First part of DB problems

master
Petr Stefan 8 years ago
parent 0fc0c1be8f
commit 8cdac3f704

@ -1 +0,0 @@
,petr,felicity,18.01.2017 13:51,file:///home/petr/.config/libreoffice/4;

@ -139,19 +139,21 @@ corresponds to his/her privileges. There are user groups reflecting the
structure of lectured courses. structure of lectured courses.
A database of exercises (algorithmic problems) is another part of the project. A database of exercises (algorithmic problems) is another part of the project.
Each exercise consists of a text describing the problem in multiple language Each exercise consists of a text describing the problem (optionally in two
variants, an evaluation configuration (machine-readable instructions on how to language variants -- Czech and English), an evaluation configuration
evaluate solutions to the exercise) and a set of inputs and reference outputs. (machine-readable instructions on how to evaluate solutions to the exercise) and
Exercises are created by instructed privileged users. Assigning an exercise to a a set of inputs and reference outputs. Exercises are created by instructed
group means choosing one of the available exercises and specifying additional privileged users. Assigning an exercise to a group means choosing one of the
properties: a deadline (optionally a second deadline), a maximum amount of available exercises and specifying additional properties: a deadline (optionally
points, a configuration for calculating the score, a maximum number of a second deadline), a maximum amount of points, a configuration for calculating
submissions, and a list of supported runtime environments (e.g. programming the score, a maximum number of submissions, and a list of supported runtime
languages) including specific time and memory limits for each one. environments (e.g. programming languages) including specific time and memory
limits for each one.
Typical use cases for supported user roles are following: Typical use cases for supported user roles are following:
- **student** - **student**
- create new user account via registration form
- join a group - join a group
- get assignments in group - get assignments in group
- submit solution to assignment -- upload one source file and trigger - submit solution to assignment -- upload one source file and trigger
@ -180,10 +182,10 @@ students. Concepts of consecutive steps from source code to final results
is described in more detail below to give readers solid overview of what have to is described in more detail below to give readers solid overview of what have to
happen during evaluation process. happen during evaluation process.
First thing users have to do is to submit their solutions through web user First thing students have to do is to submit their solutions through web user
interface. The system checks assignment invariants (deadlines, count of interface. The system checks assignment invariants (deadlines, count of
submissions, ...) and stores the submitted file. The runtime environment is submissions, ...) and stores the submitted code. The runtime environment is
automatically detected based on input file and a suitable evaluation automatically detected based on input file extension and a suitable evaluation
configuration variant is chosen (one exercise can have multiple variants, for configuration variant is chosen (one exercise can have multiple variants, for
example C and Java languages). This exercise configuration is then used for example C and Java languages). This exercise configuration is then used for
taking care of evaluation process. taking care of evaluation process.
@ -449,20 +451,18 @@ restarted.
At this point there is a clear idea how the new system will be used and what are At this point there is a clear idea how the new system will be used and what are
the major enhancements for future releases. With this in mind, the overall the major enhancements for future releases. With this in mind, the overall
architecture can be sketched. From the previous research, several goals are set architecture can be sketched. To sum up, here is a list of key features of the
up for the new project. They mostly reflect drawbacks of the current version of new system. They come from previous research of current system's drawbacks,
CodEx and some reasonable wishes of university users. Most notable features are reasonable wishes of university users and our major design choices.
following:
- modern HTML5 web frontend written in JavaScript using a suitable framework - modern HTML5 web frontend written in JavaScript using a suitable framework
- REST API implemented in PHP, communicating with database, evaluation backend - REST API communicating with database, evaluation backend and a file server
and a file server
- evaluation backend implemented as a distributed system on top of a message - evaluation backend implemented as a distributed system on top of a message
queue framework (ZeroMQ) with master-worker architecture queue framework with master-worker architecture
- multi-platform worker supporting Linux and Windows environment (latter - multi-platform worker supporting Linux and Windows environment (latter
without sandbox, no general purpose suitable tool available yet) without sandbox, no general purpose suitable tool available yet)
- evaluation procedure configured in a YAML file, compound of small tasks - evaluation procedure configured in a human readable text file, compound of
connected into an arbitrary oriented acyclic graph small tasks connected into an arbitrary oriented acyclic graph
The reasons supporting these decisions are explained in the rest of analysis The reasons supporting these decisions are explained in the rest of analysis
chapter. Also a lot of smaller design choices are mentioned including possible chapter. Also a lot of smaller design choices are mentioned including possible
@ -532,18 +532,18 @@ is implemented. The relative value is set in percents and is called threshold.
Our university has a few partner grammar schools. There were an idea, that they Our university has a few partner grammar schools. There were an idea, that they
could use CodEx for teaching informatics classes. To make the setup simple for could use CodEx for teaching informatics classes. To make the setup simple for
them, all the software and hardware would be provided by university and hosted them, all the software and hardware would be provided by the university as a
in their datacentre. However, CodEx were not prepared to support this kind of completely ready-to-use remote service. However, CodEx were not prepared to
usage and no one had time to manage a separate instance. With ReCodEx it is support this kind of usage and no one had time to manage a separate instance.
possible to offer hosted environment as a service to other subjects. The concept With ReCodEx it is possible to offer hosted environment as a service to other
we figured out is based on user and group separation inside the system. There subjects. The concept we figured out is based on user and group separation
are multiple _instances_ in the system, which means unit of separation. Each inside the system. There are multiple _instances_ in the system, which means
instance has own set of users and groups, exercises can be optionally shared. unit of separation. Each instance has own set of users and groups, exercises can
Evaluation backend is common for all instances. To keep track of active be optionally shared. Evaluation backend is common for all instances. To keep
instances and paying customers, each instance must have a valid _licence_ to track of active instances and paying customers, each instance must have a valid
allow users submit their solutions. licence is granted for defined period of _licence_ to allow users submit their solutions. licence is granted for defined
time and can be revoked in advance if the subject do not keep approved terms and period of time and can be revoked in advance if the subject do not keep approved
conditions. terms and conditions.
The main work for the system is to evaluate programming exercises. The exercise The main work for the system is to evaluate programming exercises. The exercise
is quite similar to homework assignment during school labs. When a homework is is quite similar to homework assignment during school labs. When a homework is
@ -561,36 +561,6 @@ for every assignment of the same exercise. This separation is natural for all
users, in CodEx it is implemented in similar way and no other considerable users, in CodEx it is implemented in similar way and no other considerable
solution was found. solution was found.
### Forgotten password
With authentication and some sort of dealing with passwords is related a problem
with forgotten credentials, especially passwords. People easily forget them and
there has to be some kind of mechanism to retrieve a new password or change the
old one. Problem is that it cannot be done in totally secure way, but we can at
least come quite close to it. First, there are absolutely not secure and
recommendable ways how to handle that, for example sending the old password
through email. A better, but still not secure solution is to generate a new one
and again send it through email. This solution was provided in CodEx, users had
to write an email to administrator, who generated a new password and sent it
back to the sender. This simple solution could be also automated, but
administrator had quite a big control over whole process. This might come in
handy if there could be some additional checkups for example, but on the other
hand it can be quite time consuming.
Probably the best solution which is often used and is fairly secure is
following. Let us consider only case in which all users have to fill their
email addresses into the system and these addresses are safely in the hands of
the right users. When user finds out that he/she does not remember a password,
he/she requests a password reset and fill in his/her unique identifier; it might
be email or unique nickname. Based on matched user account the system generates
unique access token and sends it to user via email address. This token should be
time limited and usable only once, so it cannot be misused. User then takes the
token or URL address which is provided in the email and go to the system's
appropriate section, where new password can be set. After that user can sign in
with his/her new password. As previously stated, this solution is quite safe and
user can handle it on its own, so administrator does not have to worry about it.
That is the main reason why this approach was chosen to be used.
### Evaluation unit executed by ReCodEx ### Evaluation unit executed by ReCodEx
One of the bigger requests for the new system is to support a complex One of the bigger requests for the new system is to support a complex
@ -623,14 +593,23 @@ so no sandbox needs to be used as in external tasks case.
For a job evaluation, the tasks needs to be executed sequentially in a specified For a job evaluation, the tasks needs to be executed sequentially in a specified
order. The idea of running independent tasks in parallel is bad because exact order. The idea of running independent tasks in parallel is bad because exact
time measurement needs controlled environment on target computer with time measurement needs controlled environment on target computer with
minimization of interrupts by other processes. It seems that connecting tasks minimization of interrupts by other processes. It would be possible to run tasks
into directed acyclic graph (DAG) can handle all possible problem cases. None of which does not need exact time measuremet in parallel, but in this case a
the authors, supervisors and involved faculty staff can think of a problem that synchronization mechanism has to be developed to exclude paralellism for
cannot be decomposed into tasks connected in a DAG. The goal of evaluation is measured tasks. Usually, there are about four times more unmeasured tasks than
to satisfy as many tasks as possible. During execution there are sometimes tasks with time measurement, but measured tasks tends to be much longer. With
multiple choices of next task. To control that, each task can have a priority, [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl's_law) in mind, the
which is used as a secondary ordering criterion. For better understanding, here parallelism seems not to provide a huge benefit in overall execution speed and
is a small example. brings troubles with synchronization. However, it there will be speed issues,
this approach could be reconsiderred.
It seems that connecting tasks into directed acyclic graph (DAG) can handle all
possible problem cases. None of the authors, supervisors and involved faculty
staff can think of a problem that cannot be decomposed into tasks connected in a
DAG. The goal of evaluation is to satisfy as many tasks as possible. During
execution there are sometimes multiple choices of next task. To control that,
each task can have a priority, which is used as a secondary ordering criterion.
For better understanding, here is a small example.
![Task serialization](https://github.com/ReCodEx/wiki/raw/master/images/Assignment_overview.png) ![Task serialization](https://github.com/ReCodEx/wiki/raw/master/images/Assignment_overview.png)
@ -639,20 +618,34 @@ _CompileA_ task is finished, the _RunAA_ task is started (or _RunAB_, but should
be deterministic by position in configuration file -- tasks stated earlier be deterministic by position in configuration file -- tasks stated earlier
should be executed earlier). The task priorities guaranties, that after should be executed earlier). The task priorities guaranties, that after
_CompileA_ task all dependent tasks are executed before _CompileB_ task (they _CompileA_ task all dependent tasks are executed before _CompileB_ task (they
have higher priority number). For example this is useful to control which files have higher priority number). To sum up, connection of tasks represents
are present in a working directory at every moment. To sum up, there are 3 dependencies and priorities can be used to order unrelated tasks and with this
ordering criteria: dependencies, then priorities and finally position of task in provide a total ordering of them. For well written jobs the priorities may not
configuration. Together, they define a unambiguous linear ordering of all tasks. be so useful, but they can help control execution order for example to avoid
situation, where each test of the job generates large temporary file and there
is a one valid execution order which keeps all the temporary files for later
processing at one time. Better approach is to finish execution of one test,
clean the big temporary file and proceed with following test. If there is an
ambiguity in task ordering at this point, they are executed in order of input
task configuration.
The total linear ordering of tasks can be done easier with just executing them
in order of input configuration. But this structure cannot handle well cases,
when a task fails. There is not a easy and nice way how to tell which task
should be executed next. However, this issue can be solved with graph structured
dependencies of the tasks. In graph structure, it is clear that all dependent
tasks has to be skipped and continue execution with a non related task. This is
the main reason, why the tasks are connected in a DAG.
For grading there are several important tasks. First, tasks executing submitted For grading there are several important tasks. First, tasks executing submitted
code need to be checked for time and memory limits. Second, outputs of judging code need to be checked for time and memory limits. Second, outputs of judging
tasks need to be checked for correctness (represented by return value or by data tasks need to be checked for correctness (represented by return value or by data
on standard output) and should not fail on time or memory limits. This division on standard output) and should not fail. This division can be transparent for
can be transparent for backend, each task is executed the same way. But frontend backend, each task is executed the same way. But frontend must know which tasks
must know which tasks from whole job are important and what is their kind. It is from whole job are important and what is their kind. It is reasonable, to keep
reasonable, to keep this piece of information alongside the tasks in job this piece of information alongside the tasks in job configuration, so each task
configuration, so each task can have a label about its purpose. Unlabeled tasks can have a label about its purpose. Unlabeled tasks have an internal type
have an internal type _inner_. There are four categories of tasks: _inner_. There are four categories of tasks:
- _initiation_ -- setting up the environment, compiling code, etc.; for users - _initiation_ -- setting up the environment, compiling code, etc.; for users
failure means error in their sources which are not compatible with running it failure means error in their sources which are not compatible with running it
@ -724,29 +717,21 @@ what kind of reward for users solutions should be chosen.
At first let us focus on all kinds of outputs from executed programs within job. At first let us focus on all kinds of outputs from executed programs within job.
Out of discussion is that supervisors should be able to view almost all outputs Out of discussion is that supervisors should be able to view almost all outputs
from solutions if they choose them to be visible and recorded. This feature is from solutions if they choose them to be visible and recorded. This feature is
critical in debugging either whole exercises or users solutions. But should it critical in debugging either whole exercises or users solutions. Supervisor
be default behaviour to record every output? Absolutely not, supervisor should should have a choice to turn on preserving the data while the default behaviour
have a choice to turn it on, but discarding the outputs has to be the default is to discard them to keep a file base around whole ReCodEx system in sensible
option. Even without this functionality a file base around whole ReCodEx system limits.
can become quite large and on top of that outputs from executed programs can be
sometimes very extensive. Storing this amount of data is inefficient and More interesting question is if students should see the logs from execution of
unnecessary to most of the solutions. However, on supervisor request this their solution. Usual approach is to keep these information private because of
feature should be available. possibility of leaking input data. This may lead students to hack their
solutions to pass just the ReCodEx testing cases instead of properly solving the
More interesting question is what should regular users see from execution of assigned problem. Martin Mareš strongly recommended to use this strategy of
their solution. Simple answer is of course that they should not see anything hiding sensitive data too, so ReCodEx does. One exception are compilation
which is partly true. Outputs from their programs can be anything and users can outputs which can help students a lot during troubleshooting. These logs shall
somehow analyze inputs or even redirect them to output. So outputs from be visible unless the supervisor decides otherwise. Note, that due to lack of
execution should not be visible at all or under very special circumstances. But frontend developers, this feature was not implemented in the very first release
that is not so straightforward for compilation or other kinds of initiation, of ReCodEx, but will be definitely available in the future.
where it really depends on the particular case. Generally it is quite harmless
to display user some kind of compilation error which can help a lot during
troubleshooting. Of course again this kind of functionality should be
configurable by supervisors and disabled by default. There is also the last kind
of tasks which can output some information which is evaluation tasks. Output of
these tasks is somehow important to whole system and again can contain some
information about inputs or reference outputs. So outputs of evaluation tasks
should not be visible to regular users too.
The overall concept of grading solutions was presented earlier. To briefly The overall concept of grading solutions was presented earlier. To briefly
remind that, backend returns only exact measured values (used time and memory, remind that, backend returns only exact measured values (used time and memory,
@ -799,7 +784,7 @@ factor. There are several ways how to save structured data:
- relational database - relational database
Another important factor is amount and size of stored data. Our guess is about Another important factor is amount and size of stored data. Our guess is about
1000 users, 100 exercises, 200 assignments per year and 400000 unique solutions 1000 users, 100 exercises, 200 assignments per year and 200000 unique solutions
per year. The data are mostly structured and there are a lot of them with the per year. The data are mostly structured and there are a lot of them with the
same format. For example, there is a thousand of users and each one has the same same format. For example, there is a thousand of users and each one has the same
values -- name, email, age, etc. These kind of data are relatively small, name values -- name, email, age, etc. These kind of data are relatively small, name
@ -1449,8 +1434,8 @@ of connection with no message loss.
### API server ### API server
The API server must handle HTTP requests and manage the state of the application The API server must handle HTTP requests and manage the state of the application
in some kind of a database. It must also be able to communicate with the in some kind of a database. It must also be able to communicate with the backend
backend over ZeroMQ. over ZeroMQ.
We considered several technologies which could be used: We considered several technologies which could be used:
@ -1566,6 +1551,36 @@ including generating the signature and signature verification is done through a
widely used third-party library which lowers the risk of having a bug in the widely used third-party library which lowers the risk of having a bug in the
implementation of this critical security feature. implementation of this critical security feature.
#### Forgotten password
With authentication and some sort of dealing with passwords is related a problem
with forgotten credentials, especially passwords. People easily forget them and
there has to be some kind of mechanism to retrieve a new password or change the
old one. Problem is that it cannot be done in totally secure way, but we can at
least come quite close to it. First, there are absolutely not secure and
recommendable ways how to handle that, for example sending the old password
through email. A better, but still not secure solution is to generate a new one
and again send it through email. This solution was provided in CodEx, users had
to write an email to administrator, who generated a new password and sent it
back to the sender. This simple solution could be also automated, but
administrator had quite a big control over whole process. This might come in
handy if there could be some additional checkups for example, but on the other
hand it can be quite time consuming.
Probably the best solution which is often used and is fairly secure is
following. Let us consider only case in which all users have to fill their
email addresses into the system and these addresses are safely in the hands of
the right users. When user finds out that he/she does not remember a password,
he/she requests a password reset and fill in his/her unique identifier; it might
be email or unique nickname. Based on matched user account the system generates
unique access token and sends it to user via email address. This token should be
time limited and usable only once, so it cannot be misused. User then takes the
token or URL address which is provided in the email and go to the system's
appropriate section, where new password can be set. After that user can sign in
with his/her new password. As previously stated, this solution is quite safe and
user can handle it on its own, so administrator does not have to worry about it.
That is the main reason why this approach was chosen to be used.
#### Uploading files #### Uploading files
There are two cases when users need to upload files using the API -- submitting There are two cases when users need to upload files using the API -- submitting

Loading…
Cancel
Save