diff --git a/Rewritten-docs.md b/Rewritten-docs.md index 14f2776..3d5fbab 100644 --- a/Rewritten-docs.md +++ b/Rewritten-docs.md @@ -512,9 +512,95 @@ is implemented. The relative value is set in percents and is called threashold. @todo: explain why there is exercise and assignment division, what means what and how they are used -@todo: extended execution pipeline (not just compilation/execution/evaluation) and why it is needed +### Evaluation unit executed by ReCodEx + +One of the bigger requests for the new system is to support a complex +configuration of execution pipeline. The idea comes from lecturers of Compiler +principles class who want to migrate their semi-manual evaluation process to +CodEx. Unfortunately, CodEx is not capable of such complicated exercise setup. +None of evaluation systems we found can handle such task, so design from +scratch is needed. + +There are two main approaches to design a complex execution configuration. It +can be composed of small amount of relatively big components or much more small +tasks. Big components are easy to write and whole configuration is reasonably +small. The components are designed for current problems, so it is not scalable +enough for pleasant future usage. This can be solved by introducing small set of +single-purposed tasks which can be composed together. The whole configuration is +then quite bigger, but with great adaptation ability for new conditions and also +less amount of work programming them. For better user experience, configuration +generators for some common cases can be introduced. + +ReCodEx target is to be continuously developed and used for many years, so the +smaller tasks are the right choice. Observation of CodEx system shows that +only a few tasks are needed. In extreme case, only one task is enough -- execute +a binary. However, for better portability of configurations along different +systems it is better to implement reasonable subset of operations directly +without calling system provided binaries. These operations are copy file, create +new directory, extract archive and so on, altogether called internal tasks. +Another benefit from custom implementation of these tasks is guarantied safety, +so no sandbox needs to be used as in external tasks case. + +For a job evaluation, the tasks needs to be executed sequentially in a specified +order. The idea of running independent tasks in parallel is bad because exact +time measurement needs controlled environment on target computer with +minimization of interrupts by other processes. It seems that connecting tasks +into directed acyclic graph (DAG) can handle all possible problem cases. None of +the authors, supervisors and involved faculty staff can think of a problem that +cannot be decomposed into tasks connected in a DAG. The goal of evaluation is +to satisfy as many tasks as possible. During execution there are sometimes +multiple choices of next task. To control that, each task can have a priority, +which is used as a secondary ordering criterion. For better understanding, here +is a small example. + +![Task serialization](https://github.com/ReCodEx/wiki/raw/master/images/Assignment_overview.png) + +The _job root_ task is imaginary single starting point of each job. When the +_CompileA_ task is finished, the _RunAA_ task is started (or _RunAB_, but should +be deterministic by position in configuration file -- tasks stated earlier +should be executed earlier). The task priorities guaranties, that after +_CompileA_ task all dependent tasks are executed before _CompileB_ task (they +have higher priority number). For example this is useful to control which files +are present in a working directory at every moment. To sum up, there are 3 +ordering criteria: dependencies, then priorities and finally position of task in +configuration. Together, they define a unambiguous linear ordering of all tasks. + +For grading there are several important tasks. First, tasks executing submitted +code need to be checked for time and memory limits. Second, outputs of judging +tasks need to be checked for correctness (represented by return value or by data +on standard output) and should not fail on time or memory limits. This division +can be transparent for backend, each task is executed the same way. But frontend +must know which tasks from whole job are important and what is their kind. It is +reasonable, to keep this piece of information alongside the tasks in job +configuration, so each task can have a label about its purpose. Unlabeled tasks +have an internal type _inner_. There are four categories of tasks: + +- _initiation_ -- setting up the environment, compiling code, etc.; for users + failure means error in their sources which are not compatible with running it + with examination data +- _execution_ -- running the user code with examination data, must not exceed + time and memory limits; for users failure means wrong design, slow data + structures, etc. +- _evaluation_ -- comparing user and examination outputs; for user failure means + that the program does not compute the right results +- _inner_ -- no special meaning for frontend, technical tasks for fetching and + copying files, creating directories, etc. + +Each job is composed of multiple tasks of these types which are semantically +grouped into tests. A test can represent one set of examination data for user +code. To mark the grouping, another task label can be used. Each test must have +exactly one _evaluation_ task (to show success or failure to users) and +arbitrary number of tasks with other types. + +### Evaluation progress state + +Users surely want to know progress state of their submitted solution this kind of functionality comes particularly handy in long duration exercises. Because of reporting progress users have immediate knowledge if anything goes wrong, not mention psychological effect that whole system and its parts are working and doing something. That is why this feature was considered from beginning but there are multiple ways how to look at it in particular. + +The very first idea would be to provide progress state based on done messages from compilation, execution and evaluation. Which is something what a lot of evaluation systems are providing. These information are high level enough for users and they probably know what is going on and executing right now. If compilation fails users know that their solution is not compilable, if execution fails there were some problems with their program. The clarity of this kind of progress state is nice and understandable. But as we learnt ReCodEx has to have more advanced execution pipeline there can be more compilations or more executions. And in addition parts of the system which ensure execution of users solutions do not have to precisely know what they are executing at the moment. This kind of information may be meaningless for them. -@todo: progress state, how it can be done and displayed to user, why random messages +That is why another solution of progress state was considered. As we know right now one of the best ways how to ensure generality is to have jobs with single-purpose tasks. These tasks can be anything, some internal operation or execution of external and sandboxed program. Based on this there is one very simple solution how to provide general progress state which should be independent on task types. We know that job has some number of tasks which has to be executed so we can send state info after execution of every task. And that is how we get percentual completion of an execution. Yes, it is kind of boring and standard way but on top of that there can be built something else and more appealing to users. + +So displaying progress to users can be done numerous ways. We have percentual completion which is of course begging for simple solution which is displaying user only the percentage. ... todo done this @todo: how to display generally all outputs of executed programs to user (supervisor, student), what students can or cannot see and why @@ -623,87 +709,6 @@ will be introduced separately and covered in more detail. The communication protocol between these two logical parts will be described as well. -### Evaluation unit executed by ReCodEx - -One of the bigger requests for the new system is to support a complex -configuration of execution pipeline. The idea comes from lecturers of Compiler -principles class who want to migrate their semi-manual evaluation process to -CodEx. Unfortunately, CodEx is not capable of such complicated exercise setup. -None of evaluation systems we found can handle such task, so design from -scratch is needed. - -There are two main approaches to design a complex execution configuration. It -can be composed of small amount of relatively big components or much more small -tasks. Big components are easy to write and whole configuration is reasonably -small. The components are designed for current problems, so it is not scalable -enough for pleasant future usage. This can be solved by introducing small set of -single-purposed tasks which can be composed together. The whole configuration is -then quite bigger, but with great adaptation ability for new conditions and also -less amount of work programming them. For better user experience, configuration -generators for some common cases can be introduced. - -ReCodEx target is to be continuously developed and used for many years, so the -smaller tasks are the right choice. Observation of CodEx system shows that -only a few tasks are needed. In extreme case, only one task is enough -- execute -a binary. However, for better portability of configurations along different -systems it is better to implement reasonable subset of operations directly -without calling system provided binaries. These operations are copy file, create -new directory, extract archive and so on, altogether called internal tasks. -Another benefit from custom implementation of these tasks is guarantied safety, -so no sandbox needs to be used as in external tasks case. - -For a job evaluation, the tasks needs to be executed sequentially in a specified -order. The idea of running independent tasks in parallel is bad because exact -time measurement needs controlled environment on target computer with -minimization of interrupts by other processes. It seems that connecting tasks -into directed acyclic graph (DAG) can handle all possible problem cases. None of -the authors, supervisors and involved faculty staff can think of a problem that -cannot be decomposed into tasks connected in a DAG. The goal of evaluation is -to satisfy as many tasks as possible. During execution there are sometimes -multiple choices of next task. To control that, each task can have a priority, -which is used as a secondary ordering criterion. For better understanding, here -is a small example. - -![Task serialization](https://github.com/ReCodEx/wiki/raw/master/images/Assignment_overview.png) - -The _job root_ task is imaginary single starting point of each job. When the -_CompileA_ task is finished, the _RunAA_ task is started (or _RunAB_, but should -be deterministic by position in configuration file -- tasks stated earlier -should be executed earlier). The task priorities guaranties, that after -_CompileA_ task all dependent tasks are executed before _CompileB_ task (they -have higher priority number). For example this is useful to control which files -are present in a working directory at every moment. To sum up, there are 3 -ordering criteria: dependencies, then priorities and finally position of task in -configuration. Together, they define a unambiguous linear ordering of all tasks. - -For grading there are several important tasks. First, tasks executing submitted -code need to be checked for time and memory limits. Second, outputs of judging -tasks need to be checked for correctness (represented by return value or by data -on standard output) and should not fail on time or memory limits. This division -is transparent for backend, each task is executed the same way. But frontend -must know which tasks from whole job are important and what is their kind. It is -reasonable, to keep this piece of information alongside the tasks in job -configuration, so each task can have a label about its purpose. Unlabeled tasks -have an internal type _inner_. There are four categories of tasks: - -- _initiation_ -- setting up the environment, compiling code, etc.; for users - failure means error in their sources which are not compatible with running it - with examination data -- _execution_ -- running the user code with examination data, must not exceed - time and memory limits; for users failure means wrong design, slow data - structures, etc. -- _evaluation_ -- comparing user and examination outputs; for user failure means - that the program does not compute the right results -- _inner_ -- no special meaning for frontend, technical tasks for fetching and - copying files, creating directories, etc. - -Each job is composed of multiple tasks of these types which are semantically -grouped into tests. A test can represent one set of examination data for user -code. To mark the grouping, another task label can be used. Each test must have -exactly one _evaluation_ task (to show success or failure to users) and -arbitrary number of tasks with other types. - - ## Implementation analysis When developing a project like ReCodEx there has to be some discussion over