|
|
|
# Introduction
|
|
|
|
|
|
|
|
In general there are many different ways and opinions on how to teach people
|
|
|
|
something new. However, most people agree that a hands-on experience is one of
|
|
|
|
the best ways to make the human brain remember a new skill. Learning must be
|
|
|
|
entertaining and interactive, with fast and frequent feedback. Some areas
|
|
|
|
are more suitable for this practical way of learning than others, and
|
|
|
|
fortunately, programming is one of them.
|
|
|
|
|
|
|
|
University education system is one of the areas where this knowledge can be
|
|
|
|
applied. In computer programming, there are several requirements a program
|
|
|
|
should satisfy, such as the code being syntactically correct, efficient and easy
|
|
|
|
to read, maintain and extend.
|
|
|
|
|
|
|
|
Checking programs written by students by hand takes time and requires a lot of
|
|
|
|
repetitive work -- reviewing source codes, compiling them and
|
|
|
|
running them through test scenarios. It is therefore desirable to automate as
|
|
|
|
much of this process as possible.
|
|
|
|
|
|
|
|
The first idea of an automatic evaluation system
|
|
|
|
comes from Stanford University professors in 1965. They implemented a system
|
|
|
|
which evaluated code in Algol submitted on punch cards. In following years, many
|
|
|
|
similar products were written.
|
|
|
|
|
|
|
|
Nowadays properties like correctness and efficiency can be tested
|
|
|
|
to a large extent automatically. This fact should be exploited to help teachers
|
|
|
|
save time for tasks such as examining bad design, bad coding habits, or logical
|
|
|
|
mistakes, which are difficult to perform automatically.
|
|
|
|
|
|
|
|
There are two basic ways of automatically evaluating code:
|
|
|
|
|
|
|
|
- **statically** -- by checking the source code without running it.
|
|
|
|
This is safe, but not very practical.
|
|
|
|
- **dynamically** -- by running the code on test inputs and checking the correctness of
|
|
|
|
outputs ones. This provides good real world experience, but requires extensive
|
|
|
|
security measures).
|
|
|
|
|
|
|
|
This project focuses on the machine-controlled part of source code evaluation.
|
|
|
|
First we observed the general concepts of grading systems and discussed the problems of the
|
|
|
|
software previously used at the Charles University in Prague.
|
|
|
|
Then the new requirements were specified and we examined projects with similar functionality.
|
|
|
|
With the acquired knowledge from these projects, we set up
|
|
|
|
goals for the new evaluation system, designed the architecture and implemented a
|
|
|
|
fully operational solution based on dynamic evaluation. The system is now ready
|
|
|
|
for production testing at the university.
|
|
|
|
|
|
|
|
## Assignment
|
|
|
|
|
|
|
|
The major goal of this project is to create a grading application which will be
|
|
|
|
used for programming classes at the Faculty of Mathematics and Physics of the
|
|
|
|
Charles University in Prague. However, the application should be designed in a
|
|
|
|
modular fashion to be easily extended or even modified to make other ways of
|
|
|
|
usage possible.
|
|
|
|
|
|
|
|
The system should be capable of doing a dynamic analysis of the submitted source
|
|
|
|
codes. This consists of the following basic steps:
|
|
|
|
|
|
|
|
1. compile the code and check for compilation errors
|
|
|
|
2. run the compiled program in a sandbox with predefined inputs
|
|
|
|
3. check the constraints of the amount of used memory and time
|
|
|
|
4. compare the outputs of the program with the defined expected outputs
|
|
|
|
5. award the solution with a numeric score
|
|
|
|
|
|
|
|
The whole system is intended to help both the teachers (supervisors) and the students.
|
|
|
|
To achieve this, it is crucial for us to keep in mind the typical usage scenarios of
|
|
|
|
the system and to try to make these tasks as simple as possible. To fulfill this
|
|
|
|
task, the project has a great starting point -- there is an old grading system
|
|
|
|
currently used at the university (CodEx), so its flaws and weaknesses can be
|
|
|
|
addressed. Furthermore, many teachers desire to use and test the new system and
|
|
|
|
they are willing to consult our ideas or problems during the development with us.
|
|
|
|
|
|
|
|
## Current System
|
|
|
|
|
|
|
|
The grading solution currently used at the Faculty of Mathematics and Physics of
|
|
|
|
the Charles University in Prague was implemented in 2006 by a group of students.
|
|
|
|
It is called [CodEx -- The Code Examiner](http://codex.ms.mff.cuni.cz/project/)
|
|
|
|
and it has been used with some improvements since then. The original plan was to
|
|
|
|
use the system only for the basic programming courses, but there was a demand for
|
|
|
|
adapting it for several different courses.
|
|
|
|
|
|
|
|
CodEx is based on dynamic analysis. It features a web-based interface, where
|
|
|
|
supervisors can assign exercises to their students and the students have a time
|
|
|
|
window to submit their solutions. Each solution is compiled and run in sandbox
|
|
|
|
(MO-Eval). The metrics which are checked are: correctness of the output, time
|
|
|
|
and memory limits. It supports programs written in C, C++, C#, Java, Pascal,
|
|
|
|
Python and Haskell.
|
|
|
|
|
|
|
|
The system has a database of users. Each user is assigned a role, which
|
|
|
|
corresponds to his/her privileges. There are user groups reflecting the
|
|
|
|
structure of lectured courses.
|
|
|
|
|
|
|
|
A database of exercises (algorithmic problems) is another part of the project.
|
|
|
|
Each exercise consists of a text describing the problem, a configuration of the
|
|
|
|
evaluation (machine-readable instructions on how to evaluate solutions to the
|
|
|
|
exercise), time and memory limits for all supported runtimes (e.g. programming
|
|
|
|
languages), a configuration for calculating the final score and a set of inputs
|
|
|
|
and reference outputs. Exercises are created by instructed privileged users.
|
|
|
|
Assigning an exercise to a group means choosing one of the available exercises
|
|
|
|
and specifying additional properties: a deadline (optionally a second deadline),
|
|
|
|
a maximum amount of points, a maximum number of submissions and a list of
|
|
|
|
supported runtime environments.
|
|
|
|
|
|
|
|
The typical use cases for the user roles are the following:
|
|
|
|
|
|
|
|
- **student**
|
|
|
|
- create a new user account via a registration form
|
|
|
|
- join groups (e.g., the courses he attends)
|
|
|
|
- get assignments in the groups
|
|
|
|
- submit a solution to an assignment -- upload one source file and start the
|
|
|
|
evaluation process
|
|
|
|
- view the results of the solution -- which parts succeeded and failed, the total
|
|
|
|
number of the acquired points, bonus points
|
|
|
|
- **supervisor** (similar to CodEx *operator*)
|
|
|
|
- create a new exercise -- create description text and evaluation configuration
|
|
|
|
(for each programming environment), upload testing inputs and outputs
|
|
|
|
- assign an exercise to a group -- choose an exercise and set the deadlines,
|
|
|
|
the number of allowed submissions, the weights of all test cases and the amount
|
|
|
|
of points for the correct solutions
|
|
|
|
- modify an assignment
|
|
|
|
- view all of the results of the students in a group
|
|
|
|
- review the automatic solution evaluation -- view the submitted source files
|
|
|
|
and optionally set bonus points (including negative points)
|
|
|
|
- **administrator**
|
|
|
|
- create groups
|
|
|
|
- alter user privileges -- make supervisor accounts
|
|
|
|
- check system logs
|
|
|
|
|
|
|
|
### Exercise Evaluation Chain
|
|
|
|
|
|
|
|
The most important part of the system is the evaluation of solutions submitted by
|
|
|
|
the students. The process from the source code to final results (score) is
|
|
|
|
described in more detail below to give readers a solid overview of what is happening
|
|
|
|
during the evaluation process.
|
|
|
|
|
|
|
|
The first thing students have to do is to submit their solutions through the web user
|
|
|
|
interface. The system checks assignment invariants (e.g., deadlines, number of
|
|
|
|
submissions) and stores the submitted code. The runtime environment is
|
|
|
|
automatically detected based on the extension of the input file, and a suitable evaluation
|
|
|
|
configuration type is chosen (one exercise can have multiple variants, for
|
|
|
|
example C and Java is allowed). This exercise configuration is then used for
|
|
|
|
the evaluation process.
|
|
|
|
|
|
|
|
There is a pool of uniform worker machines dedicated to evaluation jobs.
|
|
|
|
Incoming jobs are kept in a queue until a free worker picks them. Workers are
|
|
|
|
capable of a sequential evaluation of jobs, one at a time.
|
|
|
|
|
|
|
|
The worker obtains the solution and its evaluation configuration, parses it and
|
|
|
|
starts executing the instructions contained. Each job should have more test
|
|
|
|
cases which examine invalid inputs, corner cases and data of different sizes to
|
|
|
|
estimate the program complexity. It is crucial to keep the computer running the worker
|
|
|
|
secure and stable, so a sandboxed environment is used for dealing with an
|
|
|
|
unknown source code. When the execution is finished, results are saved, and the
|
|
|
|
student is notified.
|
|
|
|
|
|
|
|
The output of the worker contains data about the evaluation, such as time and
|
|
|
|
memory spent on running the program for each test input and whether its output
|
|
|
|
is correct. The system then calculates a numeric score from the data which is
|
|
|
|
presented to the student. If the solution is incorrect (e.g., incorrect output,
|
|
|
|
exceeds memory or time limits), error messages are also displayed to the student.
|
|
|
|
|
|
|
|
### Possible Improvements
|
|
|
|
|
|
|
|
The current system is old, but robust. There were no major security incidents
|
|
|
|
in the course of its usage. However, from the present day perspective there are
|
|
|
|
several major drawbacks:
|
|
|
|
|
|
|
|
- **web interface** -- The web interface is simple and fully functional.
|
|
|
|
However, the recent rapid development in web technologies provides us with new
|
|
|
|
possibilities of making web interfaces.
|
|
|
|
- **public API** -- CodEx offers a very limited public XML API based on outdated
|
|
|
|
technologies that are not sufficient for users who would like to create their
|
|
|
|
custom interfaces such as a command line tool or a mobile application.
|
|
|
|
- **sandboxing** -- the MO-Eval sandbox is based on the principle of monitoring
|
|
|
|
system calls and blocking the forbidden ones. This can be sufficient with
|
|
|
|
single-threaded programs, but proves to be difficult with multi-threaded ones.
|
|
|
|
Nowadays, parallelism is a very important area of computing, it is required that
|
|
|
|
multi-threaded programs can be securely tested as well.
|
|
|
|
- **instances** -- Different ways of CodEx use require separate
|
|
|
|
installations (e.g., Programming I and II, Java, C#). This configuration is
|
|
|
|
not user friendly as students have to register in each installation separately
|
|
|
|
and burdens administrators with unnecessary work. The CodEx architecture does not
|
|
|
|
allow sharing workers between installations which results in an inefficient
|
|
|
|
use of hardware for evaluation.
|
|
|
|
- **task extensibility** -- There is a need to test and evaluate complicated
|
|
|
|
programs for courses such as *Parallel programming* or *Compiler principles*,
|
|
|
|
which have a more difficult evaluation chain than simple
|
|
|
|
*compilation/execution/evaluation* provided by CodEx.
|
|
|
|
|
|
|
|
## Requirements
|
|
|
|
|
|
|
|
There are many different formal requirements for the system. Some of them
|
|
|
|
are necessary for any system for source code evaluation, some of them are
|
|
|
|
specific for university deployment and some of them arose during the ten year
|
|
|
|
long lifetime of the old system. There are not many ways of improving CodEx
|
|
|
|
experience from the perspective of a student, but a lot of feature requests come
|
|
|
|
from administrators and supervisors. The ideas were gathered mostly from our
|
|
|
|
personal experience with the system and from meetings with the faculty staff
|
|
|
|
who use the current system.
|
|
|
|
|
|
|
|
In general, CodEx features should be preserved, so only the differences are
|
|
|
|
presented here. For clear arrangement, all the requirements and wishes are
|
|
|
|
presented in groups by the user categories.
|
|
|
|
|
|
|
|
### Requirements of The Users
|
|
|
|
|
|
|
|
- _group hierarchy_ -- creating an arbitrarily nested tree structure should be
|
|
|
|
supported to keep related groups together, such as in the example
|
|
|
|
below. CodEx supported only a flat group structure. A group hierarchy also
|
|
|
|
allows to archive data from the past courses.
|
|
|
|
|
|
|
|
```
|
|
|
|
Summer term 2016
|
|
|
|
|-- Language C# and .NET platform
|
|
|
|
| |-- Labs Monday 10:30
|
|
|
|
| `-- Labs Thursday 9:00
|
|
|
|
|-- Programming I
|
|
|
|
| |-- Labs Monday 14:00
|
|
|
|
...
|
|
|
|
```
|
|
|
|
|
|
|
|
- _a database of exercises_ -- teachers should be able to filter the displayed
|
|
|
|
exercises according to several criteria, for example by the supported runtime
|
|
|
|
environments or by the author. It should also be possible to link exercises to a group
|
|
|
|
so that group supervisors do not have to browse hundreds of exercises when
|
|
|
|
their group only uses a few of them
|
|
|
|
- _advanced exercises_ -- the system should support a more advanced evaluation
|
|
|
|
pipeline than basic *compilation/execution/evaluation* which is in CodEx
|
|
|
|
- _customizable grading system_ -- teachers need to specify the way of
|
|
|
|
calculating the final score which will be allocated to the submissions
|
|
|
|
depending on their correctness and quality
|
|
|
|
- _marking a solution as accepted_ -- a supervisor should be able to choose
|
|
|
|
one of the submitted solutions of a student as accepted. The score of this
|
|
|
|
particular solution will be used as the score which the student receives
|
|
|
|
for the given assignment instead of the one with the highest score.
|
|
|
|
- _solution resubmission_ -- teachers should be able to edit the solutions of the
|
|
|
|
student and privately resubmit them, optionally saving all results (including
|
|
|
|
temporary ones); this feature can be used to quickly fix obvious errors in the
|
|
|
|
solution and see if it is otherwise correct
|
|
|
|
- _localization_ -- all texts (the UI and the assignments of the exercises) should
|
|
|
|
be translatable into several languages
|
|
|
|
- _formatted texts of assignments_ -- Markdown or another lightweight markup language
|
|
|
|
should be supported for the formatting of the texts of the exercises
|
|
|
|
- _comments_ -- adding both private and public comments to exercises, tests and
|
|
|
|
solutions should be supported
|
|
|
|
- _plagiarism detection_
|
|
|
|
|
|
|
|
### Administrative Requirements
|
|
|
|
|
|
|
|
- _independent user interface_ -- the system should allow the use of an alternative
|
|
|
|
user interface, such as a command line client; implementation of such clients
|
|
|
|
should be as straightforward as possible
|
|
|
|
- _privilege separation_ -- there should be at least two roles -- _student_ and
|
|
|
|
_supervisor_. The cases when a student of a course is also a teacher of another
|
|
|
|
course must be handled correctly.
|
|
|
|
- _alternative authentication methods_ -- logging in through a university
|
|
|
|
authentication system (e.g. LDAP) and potentially other services, such as
|
|
|
|
Github or some other OAuth service, should be supported
|
|
|
|
- _querying SIS_ -- loading user data from the university information system (SIS)
|
|
|
|
should be supported
|
|
|
|
- _sandboxing_ -- there should be a more advanced sandbox which supports
|
|
|
|
execution of parallel programs and an easy integration of different programming
|
|
|
|
environments and tools; the sandboxed environment should have the minimum
|
|
|
|
possible impact on the measurement of results (most importantly on the measured
|
|
|
|
duration of execution)
|
|
|
|
- _heterogeneous worker pool_ -- there must be a support for submission evaluation
|
|
|
|
in multiple programming environments in a single installation to avoid
|
|
|
|
unacceptable workload of the administrator (i.e., maintaining a separate
|
|
|
|
installation for every course) and high hardware requirements
|
|
|
|
- advanced low-level evaluation flow configuration with high-level abstraction
|
|
|
|
layer for ordinary configuration cases; the configuration should be able to
|
|
|
|
express more complicated flows than just compiling a source code and running
|
|
|
|
the program against the test inputs -- for example, some exercises need to build
|
|
|
|
the source code with a tool, run some tests, then run the program through
|
|
|
|
another tool and perform additional tests
|
|
|
|
- use of modern technologies with state-of-the-art compilers
|
|
|
|
|
|
|
|
### Non-functional Requirements
|
|
|
|
|
|
|
|
- _no installation_ -- the primary user interface of the system must be
|
|
|
|
accessible on the computers of the users without the need to install any
|
|
|
|
additional software except for a web browser which is installed on almost
|
|
|
|
every personal computer
|
|
|
|
- _performance_ -- the system must be ready for at least hundreds of students
|
|
|
|
and tens of supervisors who are using it at the same time
|
|
|
|
- _automated deployment_ -- all of the components of the system must be easy to
|
|
|
|
deploy in an automated fashion
|
|
|
|
- _open source licensing_ -- the source code should be released under a
|
|
|
|
permissive licence allowing further development; this also applies to the used
|
|
|
|
libraries and frameworks
|
|
|
|
- _multi-platform worker_ -- worker machines running Linux, Windows and
|
|
|
|
potentially other operating systems must be supported
|
|
|
|
|
|
|
|
### Conclusion
|
|
|
|
|
|
|
|
The survey shows that there are a lot of different requirements and wishes for
|
|
|
|
the new system. When the system is ready, it is likely that there will be new
|
|
|
|
ideas on how to use the system and thus the system must be designed to be easily
|
|
|
|
extendable so that these new ideas can be easily implemented, either by us or
|
|
|
|
community members. This also means that widely used programming languages and
|
|
|
|
techniques should be used so the programmers can quickly understand the code and
|
|
|
|
make changes easily.
|
|
|
|
|
|
|
|
## Related work
|
|
|
|
|
|
|
|
To find out the current state in the field of automatic grading systems, we conducted
|
|
|
|
a short market survey of the field of automatic grading systems at universities,
|
|
|
|
programming contests, and other places where similar tools are used.
|
|
|
|
|
|
|
|
This is not a complete list of available evaluators, but only of a few projects
|
|
|
|
which are used these days and can be of an inspiration for our project. Each
|
|
|
|
project on the list is provided with a brief description and some of its key features.
|
|
|
|
|
|
|
|
### Progtest
|
|
|
|
|
|
|
|
[Progtest](https://progtest.fit.cvut.cz/) is a private project of [FIT
|
|
|
|
ČVUT](https://fit.cvut.cz) in Prague. As far as we know it is used for C/C++,
|
|
|
|
Bash programming, and knowledge-based quizzes. Each submitted solution can receive
|
|
|
|
several bonus points or penalties and also a few hints can be attached of
|
|
|
|
what is incorrect in the solution. It is very strict with the source code quality,
|
|
|
|
for example the `-pedantic` option of GCC is used; Valgrind is used for detection of
|
|
|
|
memory leaks; array boundaries are checked via the `mudflap` library.
|
|
|
|
|
|
|
|
### Codility
|
|
|
|
|
|
|
|
[Codility](https://codility.com/) is a web based solution primarily targeted at
|
|
|
|
company recruiters. It is a commercial product available as SaaS and it
|
|
|
|
supports 16 programming languages. The
|
|
|
|
[UI](http://1.bp.blogspot.com/-_isqWtuEvvY/U8_SbkUMP-I/AAAAAAAAAL0/Hup_amNYU2s/s1600/cui.png)
|
|
|
|
of Codility is [opensource](https://github.com/Codility/cui), the rest of source
|
|
|
|
code is not available. One interesting feature is the 'task timeline' -- the captured
|
|
|
|
progress of writing the code for each user.
|
|
|
|
|
|
|
|
### CMS
|
|
|
|
|
|
|
|
[CMS](http://cms-dev.github.io/index.html) is an opensource distributed system
|
|
|
|
for running and organizing programming contests. It is written in Python and
|
|
|
|
contains several modules. CMS supports C/C++, Pascal, Python, PHP, and Java
|
|
|
|
programming languages. PostgreSQL is a single point of failure, all modules
|
|
|
|
heavily depend on the database connection. Task evaluation can be only a three
|
|
|
|
step pipeline -- compilation, execution, evaluation. Execution is performed in
|
|
|
|
[Isolate](https://github.com/ioi/isolate), a sandbox written by the consultant
|
|
|
|
of our project, Mgr. Martin Mareš, Ph.D.
|
|
|
|
|
|
|
|
### MOE
|
|
|
|
|
|
|
|
[MOE](http://www.ucw.cz/moe/) is a grading system written in Shell scripts, C
|
|
|
|
and Python. It does not provide a default GUI interface, all actions have to be
|
|
|
|
performed from the command line. The system does not evaluate submissions in real
|
|
|
|
time, results are computed in a batch mode after the exercise deadline, using Isolate
|
|
|
|
for sandboxing. Parts of MOE are used in other systems like CodEx or CMS, but
|
|
|
|
the system is obsolete in general.
|
|
|
|
|
|
|
|
### Kattis
|
|
|
|
|
|
|
|
[Kattis](http://www.kattis.com/) is another SaaS solution. It provides a clean
|
|
|
|
and functional web UI, but the rest of the application is too simple. A nice
|
|
|
|
feature is the usage of a [standardized
|
|
|
|
format](http://www.problemarchive.org/wiki/index.php/Problem_Format) for
|
|
|
|
exercises. Kattis is primarily used by programming contest organizers, company
|
|
|
|
recruiters and also some universities.
|
|
|
|
|
|
|
|
<!---
|
|
|
|
// vim: set formatoptions=tqn flp+=\\\|^\\*\\s* textwidth=80 colorcolumn=+1:
|
|
|
|
-->
|
|
|
|
|