You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

366 lines
19 KiB
Markdown

# Introduction
In general there are many different ways and opinions on how to teach people
something new. However, most people agree that a hands-on experience is one of
the best ways to make the human brain remember a new skill. Learning must be
entertaining and interactive, with fast and frequent feedback. Some areas
are more suitable for this practical way of learning than others, and
fortunately, programming is one of them.
University education system is one of the areas where this knowledge can be
applied. In computer programming, there are several requirements a program
should satisfy, such as the code being syntactically correct, efficient and easy
to read, maintain and extend.
Checking programs written by students by hand takes time and requires a lot of
repetitive work -- reviewing source codes, compiling them and
running them through test scenarios. It is therefore desirable to automate as
much of this process as possible.
The first idea of an automatic evaluation system
comes from Stanford University professors in 1965. They implemented a system
which evaluated code in Algol submitted on punch cards. In following years, many
similar products were written.
Nowadays properties like correctness and efficiency can be tested
to a large extent automatically. This fact should be exploited to help teachers
save time for tasks such as examining bad design, bad coding habits, or logical
mistakes, which are difficult to perform automatically.
There are two basic ways of automatically evaluating code:
- **statically** -- by checking the source code without running it.
This is safe, but not very practical.
- **dynamically** -- by running the code on test inputs and checking the correctness of
outputs ones. This provides good real world experience, but requires extensive
security measures).
This project focuses on the machine-controlled part of source code evaluation.
First we observed the general concepts of grading systems and discussed the problems of the
software previously used at the Charles University in Prague.
Then the new requirements were specified and we examined projects with similar functionality.
With the acquired knowledge from these projects, we set up
goals for the new evaluation system, designed the architecture and implemented a
fully operational solution based on dynamic evaluation. The system is now ready
for production testing at the university.
## Assignment
The major goal of this project is to create a grading application which will be
used for programming classes at the Faculty of Mathematics and Physics of the
Charles University in Prague. However, the application should be designed in a
modular fashion to be easily extended or even modified to make other ways of
usage possible.
The system should be capable of doing a dynamic analysis of the submitted source
codes. This consists of the following basic steps:
1. compile the code and check for compilation errors
2. run the compiled program in a sandbox with predefined inputs
3. check the constraints of the amount of used memory and time
4. compare the outputs of the program with the defined expected outputs
5. award the solution with a numeric score
The whole system is intended to help both the teachers (supervisors) and the students.
To achieve this, it is crucial for us to keep in mind the typical usage scenarios of
the system and to try to make these tasks as simple as possible. To fulfill this
task, the project has a great starting point -- there is an old grading system
currently used at the university (CodEx), so its flaws and weaknesses can be
addressed. Furthermore, many teachers desire to use and test the new system and
they are willing to consult our ideas or problems during the development with us.
## Current System
The grading solution currently used at the Faculty of Mathematics and Physics of
the Charles University in Prague was implemented in 2006 by a group of students.
It is called [CodEx -- The Code Examiner](http://codex.ms.mff.cuni.cz/project/)
and it has been used with some improvements since then. The original plan was to
use the system only for the basic programming courses, but there was a demand for
adapting it for several different courses.
CodEx is based on dynamic analysis. It features a web-based interface, where
supervisors can assign exercises to their students and the students have a time
window to submit their solutions. Each solution is compiled and run in sandbox
(MO-Eval). The metrics which are checked are: correctness of the output, time
and memory limits. It supports programs written in C, C++, C#, Java, Pascal,
Python and Haskell.
The system has a database of users. Each user is assigned a role, which
corresponds to his/her privileges. There are user groups reflecting the
structure of lectured courses.
A database of exercises (algorithmic problems) is another part of the project.
Each exercise consists of a text describing the problem, a configuration of the
evaluation (machine-readable instructions on how to evaluate solutions to the
exercise), time and memory limits for all supported runtimes (e.g. programming
languages), a configuration for calculating the final score and a set of inputs
and reference outputs. Exercises are created by instructed privileged users.
Assigning an exercise to a group means choosing one of the available exercises
and specifying additional properties: a deadline (optionally a second deadline),
a maximum amount of points, a maximum number of submissions and a list of
supported runtime environments.
The typical use cases for the user roles are the following:
- **student**
- create a new user account via a registration form
- join groups (e.g., the courses he attends)
- get assignments in the groups
- submit a solution to an assignment -- upload one source file and start the
evaluation process
- view the results of the solution -- which parts succeeded and failed, the total
number of the acquired points, bonus points
- **supervisor** (similar to CodEx *operator*)
- create a new exercise -- create description text and evaluation configuration
(for each programming environment), upload testing inputs and outputs
- assign an exercise to a group -- choose an exercise and set the deadlines,
the number of allowed submissions, the weights of all test cases and the amount
of points for the correct solutions
- modify an assignment
- view all of the results of the students in a group
- review the automatic solution evaluation -- view the submitted source files
and optionally set bonus points (including negative points)
- **administrator**
- create groups
- alter user privileges -- make supervisor accounts
- check system logs
### Exercise Evaluation Chain
The most important part of the system is the evaluation of solutions submitted by
the students. The process from the source code to final results (score) is
described in more detail below to give readers a solid overview of what is happening
during the evaluation process.
The first thing students have to do is to submit their solutions through the web user
interface. The system checks assignment invariants (e.g., deadlines, number of
submissions) and stores the submitted code. The runtime environment is
automatically detected based on the extension of the input file, and a suitable evaluation
configuration type is chosen (one exercise can have multiple variants, for
example C and Java is allowed). This exercise configuration is then used for
the evaluation process.
There is a pool of uniform worker machines dedicated to evaluation jobs.
Incoming jobs are kept in a queue until a free worker picks them. Workers are
capable of a sequential evaluation of jobs, one at a time.
The worker obtains the solution and its evaluation configuration, parses it and
starts executing the instructions contained. Each job should have more test
cases which examine invalid inputs, corner cases and data of different sizes to
estimate the program complexity. It is crucial to keep the computer running the worker
secure and stable, so a sandboxed environment is used for dealing with an
unknown source code. When the execution is finished, results are saved, and the
student is notified.
The output of the worker contains data about the evaluation, such as time and
memory spent on running the program for each test input and whether its output
is correct. The system then calculates a numeric score from the data which is
presented to the student. If the solution is incorrect (e.g., incorrect output,
exceeds memory or time limits), error messages are also displayed to the student.
### Possible Improvements
The current system is old, but robust. There were no major security incidents
in the course of its usage. However, from the present day perspective there are
several major drawbacks:
- **web interface** -- The web interface is simple and fully functional.
However, the recent rapid development in web technologies provides us with new
possibilities of making web interfaces.
- **public API** -- CodEx offers a very limited public XML API based on outdated
technologies that are not sufficient for users who would like to create their
custom interfaces such as a command line tool or a mobile application.
- **sandboxing** -- the MO-Eval sandbox is based on the principle of monitoring
system calls and blocking the forbidden ones. This can be sufficient with
single-threaded programs, but proves to be difficult with multi-threaded ones.
Nowadays, parallelism is a very important area of computing, it is required that
multi-threaded programs can be securely tested as well.
- **instances** -- Different ways of CodEx use require separate
installations (e.g., Programming I and II, Java, C#). This configuration is
not user friendly as students have to register in each installation separately
and burdens administrators with unnecessary work. The CodEx architecture does not
allow sharing workers between installations which results in an inefficient
use of hardware for evaluation.
- **task extensibility** -- There is a need to test and evaluate complicated
programs for courses such as *Parallel programming* or *Compiler principles*,
which have a more difficult evaluation chain than simple
*compilation/execution/evaluation* provided by CodEx.
## Requirements
There are many different formal requirements for the system. Some of them
are necessary for any system for source code evaluation, some of them are
specific for university deployment and some of them arose during the ten year
long lifetime of the old system. There are not many ways of improving CodEx
experience from the perspective of a student, but a lot of feature requests come
from administrators and supervisors. The ideas were gathered mostly from our
personal experience with the system and from meetings with the faculty staff
who use the current system.
In general, CodEx features should be preserved, so only the differences are
presented here. For clear arrangement, all the requirements and wishes are
presented in groups by the user categories.
### Requirements of The Users
- _group hierarchy_ -- creating an arbitrarily nested tree structure should be
supported to keep related groups together, such as in the example
below. CodEx supported only a flat group structure. A group hierarchy also
allows to archive data from the past courses.
```
Summer term 2016
|-- Language C# and .NET platform
| |-- Labs Monday 10:30
| `-- Labs Thursday 9:00
|-- Programming I
| |-- Labs Monday 14:00
...
```
- _a database of exercises_ -- teachers should be able to filter the displayed
exercises according to several criteria, for example by the supported runtime
environments or by the author. It should also be possible to link exercises to a group
so that group supervisors do not have to browse hundreds of exercises when
their group only uses a few of them
- _advanced exercises_ -- the system should support a more advanced evaluation
pipeline than basic *compilation/execution/evaluation* which is in CodEx
- _customizable grading system_ -- teachers need to specify the way of
calculating the final score which will be allocated to the submissions
depending on their correctness and quality
- _marking a solution as accepted_ -- a supervisor should be able to choose
one of the submitted solutions of a student as accepted. The score of this
particular solution will be used as the score which the student receives
for the given assignment instead of the one with the highest score.
- _solution resubmission_ -- teachers should be able to edit the solutions of the
student and privately resubmit them, optionally saving all results (including
temporary ones); this feature can be used to quickly fix obvious errors in the
solution and see if it is otherwise correct
- _localization_ -- all texts (the UI and the assignments of the exercises) should
be translatable into several languages
- _formatted texts of assignments_ -- Markdown or another lightweight markup language
should be supported for the formatting of the texts of the exercises
- _comments_ -- adding both private and public comments to exercises, tests and
solutions should be supported
- _plagiarism detection_
### Administrative Requirements
- _independent user interface_ -- the system should allow the use of an alternative
user interface, such as a command line client; implementation of such clients
should be as straightforward as possible
- _privilege separation_ -- there should be at least two roles -- _student_ and
_supervisor_. The cases when a student of a course is also a teacher of another
course must be handled correctly.
- _alternative authentication methods_ -- logging in through a university
authentication system (e.g. LDAP) and potentially other services, such as
Github or some other OAuth service, should be supported
- _querying SIS_ -- loading user data from the university information system (SIS)
should be supported
- _sandboxing_ -- there should be a more advanced sandbox which supports
execution of parallel programs and an easy integration of different programming
environments and tools; the sandboxed environment should have the minimum
possible impact on the measurement of results (most importantly on the measured
duration of execution)
- _heterogeneous worker pool_ -- there must be a support for submission evaluation
in multiple programming environments in a single installation to avoid
unacceptable workload of the administrator (i.e., maintaining a separate
installation for every course) and high hardware requirements
- advanced low-level evaluation flow configuration with high-level abstraction
layer for ordinary configuration cases; the configuration should be able to
express more complicated flows than just compiling a source code and running
the program against the test inputs -- for example, some exercises need to build
the source code with a tool, run some tests, then run the program through
another tool and perform additional tests
- use of modern technologies with state-of-the-art compilers
### Non-functional Requirements
- _no installation_ -- the primary user interface of the system must be
accessible on the computers of the users without the need to install any
additional software except for a web browser which is installed on almost
every personal computer
- _performance_ -- the system must be ready for at least hundreds of students
and tens of supervisors who are using it at the same time
- _automated deployment_ -- all of the components of the system must be easy to
deploy in an automated fashion
- _open source licensing_ -- the source code should be released under a
permissive licence allowing further development; this also applies to the used
libraries and frameworks
- _multi-platform worker_ -- worker machines running Linux, Windows and
potentially other operating systems must be supported
### Conclusion
The survey shows that there are a lot of different requirements and wishes for
the new system. When the system is ready, it is likely that there will be new
ideas on how to use the system and thus the system must be designed to be easily
extendable so that these new ideas can be easily implemented, either by us or
community members. This also means that widely used programming languages and
techniques should be used so the programmers can quickly understand the code and
make changes easily.
## Related work
To find out the current state in the field of automatic grading systems, we conducted
a short market survey of the field of automatic grading systems at universities,
programming contests, and other places where similar tools are used.
This is not a complete list of available evaluators, but only of a few projects
which are used these days and can be of an inspiration for our project. Each
project on the list is provided with a brief description and some of its key features.
### Progtest
[Progtest](https://progtest.fit.cvut.cz/) is a private project of [FIT
ČVUT](https://fit.cvut.cz) in Prague. As far as we know it is used for C/C++,
Bash programming, and knowledge-based quizzes. Each submitted solution can receive
several bonus points or penalties and also a few hints can be attached of
what is incorrect in the solution. It is very strict with the source code quality,
for example the `-pedantic` option of GCC is used; Valgrind is used for detection of
memory leaks; array boundaries are checked via the `mudflap` library.
### Codility
[Codility](https://codility.com/) is a web based solution primarily targeted at
company recruiters. It is a commercial product available as SaaS and it
supports 16 programming languages. The
[UI](http://1.bp.blogspot.com/-_isqWtuEvvY/U8_SbkUMP-I/AAAAAAAAAL0/Hup_amNYU2s/s1600/cui.png)
of Codility is [opensource](https://github.com/Codility/cui), the rest of source
code is not available. One interesting feature is the 'task timeline' -- the captured
progress of writing the code for each user.
### CMS
[CMS](http://cms-dev.github.io/index.html) is an opensource distributed system
for running and organizing programming contests. It is written in Python and
contains several modules. CMS supports C/C++, Pascal, Python, PHP, and Java
programming languages. PostgreSQL is a single point of failure, all modules
heavily depend on the database connection. Task evaluation can be only a three
step pipeline -- compilation, execution, evaluation. Execution is performed in
[Isolate](https://github.com/ioi/isolate), a sandbox written by the consultant
of our project, Mgr. Martin Mareš, Ph.D.
### MOE
[MOE](http://www.ucw.cz/moe/) is a grading system written in Shell scripts, C
and Python. It does not provide a default GUI interface, all actions have to be
performed from the command line. The system does not evaluate submissions in real
time, results are computed in a batch mode after the exercise deadline, using Isolate
for sandboxing. Parts of MOE are used in other systems like CodEx or CMS, but
the system is obsolete in general.
### Kattis
[Kattis](http://www.kattis.com/) is another SaaS solution. It provides a clean
and functional web UI, but the rest of the application is too simple. A nice
feature is the usage of a [standardized
format](http://www.problemarchive.org/wiki/index.php/Problem_Format) for
exercises. Kattis is primarily used by programming contest organizers, company
recruiters and also some universities.
<!---
// vim: set formatoptions=tqn flp+=\\\|^\\*\\s* textwidth=80 colorcolumn=+1:
-->