recodex-wiki/Analysis.md

# Analysis

None of the existing projects we came across meets all the features requested
by the new system. There is no grading system which supports an arbitrary-length
evaluation pipeline, so we have to implement this feature ourselves.
No existing solution is extensible enough to be used as a base for the new system.
After considering all these
facts, a new system has to be written from scratch. This
implies that only a subset of all the features will be implemented in the first
version, and more of them will come in the following releases.

The requested features are categorized based on priorities for the whole system. The
highest priority is the functionality present in the current CodEx. It is a base
line for being useful in the production environment. The design of the new solution
should allow that the system will be extended easily. The ideas from faculty staff
have lower priority, but most of them will be implemented as part of the project.
The most complicated tasks from this category are an advanced low-level evaluation
configuration format, use of modern tools, connection to a university system, and
combining the currently separate instances into one installation of the system.

Other tasks are scheduled
for the next releases after the first version of the project is completed.
Namely, these are a high-level exercise evaluation configuration with a
user-friendly UI, SIS integration (when a public API becomes available for
the system), and a command-line submission tool. Plagiarism detection is not
likely to be part of any release in near future unless someone else implements a
sufficiently capable and extendable solution -- this problem is too complex to be
solved as a part of this project.

We named the new project **ReCodEx -- ReCodEx Code Examiner**. The name
should point to the old CodEx, **Re** as part of the name means
redesigned, rewritten, renewed, or restarted.

At this point there is a clear idea of how the new system will be used and what are
the major enhancements for the future releases. With this in mind, it is possible
to sketch the overall architecture. To sum this up, here is a list of key features of the
new system. They come from the previous research of the current drawbacks of the system,
reasonable wishes of university users, and our major design choices:

- modern HTML5 web frontend written in JavaScript using a suitable framework
- REST API communicating with a persistent database, evaluation backend, and a file server
- evaluation backend implemented as a distributed system on top of a messaging
  framework with a master-worker architecture
- multi-platform worker supporting Linux and Windows environments (latter without
  a sandbox, no suitable general purpose tool available yet)
- evaluation procedure configured in a human readable text file, consisting of
  small tasks forming an arbitrary oriented acyclic dependency graph

## Basic Concepts

The requirements specify that the user
interface must be accessible to students without the need to install additional
software. This immediately implies that users have to be connected to the
Internet. Nowadays, there are two main ways
of designing graphical user interfaces -- as a native application or a web page.
Creating a user-friendly and multi-platform application with graphical UI
is almost impossible because of the large number of different operating systems.
These applications typically require installation or at least downloading
its files (source codes or binaries). On the other hand, distributing a web
application is easier, because every personal computer has an internet browser
installed. Browsers support a (mostly) unified and standardized
environment of HTML5 and JavaScript. CodEx is also a web application and
everybody seems to be satisfied with this fact. There are other communicating channels
most programmers use, such as e-mail or git, but they are inappropriate for
designing user interfaces on top of them.

It is clear from the assignment of the project 
that the system has to keep personalized data of the users. User data cannot
be publicly available, which implies necessity of user authentication.
The application also has to support
multiple ways of authentication (e.g., university authentication systems, a company
LDAP server, an OAuth server), and permit adding more security measures in the
future, such as two-factor authentication.

Each user has a specific role in the system. From the assignment it is required to
have at least two such roles, _student_ and _supervisor_. However, it is advisible
to add an _administrator_ level for users who take care of the system as a whole and are
responsible for the setup, monitoring, or updates. The student role has the
minimum access rights, basically a student can only view assignments and submit solutions.
Supervisors have more authority, so they can create exercises and assignments, and
view results of their students. From the organization of the university, one possible
level could be introduced, _course guarantor_. However, from real experience all
duties related with lecturing of labs are already associated with supervisors,
so this role does not seem useful. In addition, no one requested more than a three
level privilege scheme.

School labs are lessons for some students lead by supervisors. All students in a
lab have the same homework and supervisors evaluate their solutions. This
arrangement has to be transferred into the new system. The groups in the system 
correspond to the real-life labs. This concept was already discussed in the
previous chapter including the need for a hierarchical structure of the groups.

To allow restriction of group members in ReCodEx, there are two types of groups
-- _public_ and _private_. Public groups are open for all registered users, but
to become a member of a private group, one of its supervisors has to add the
user to the group. This could be done automatically at the beginning of a term with data
from the information system, but unfortunately there is no API for this yet.
However, creating this API is now being considered by university staff.

Supervisors using CodEx in their labs usually set a minimum amount of points
required to get a credit. These points can be acquired by solving assigned
exercises. To show users whether they already have enough points, ReCodEx also
supports setting this limit for the groups. There are two equal ways of setting
a limit -- an absolute number fo points or a percentage of the total possible
number of points. We decided to implement the latter of these possibilities
and we call it the threshold.

Our university has a few partners among grammar schools. There was an idea, that they
could use CodEx for teaching IT classes. To simplify the setup for
them, all the software and hardware would be provided by the university as a
SaaS. However, CodEx is not prepared for this
kind of usage and no one has the time to manage another separate instance. With
ReCodEx it is possible to offer a hosted environment as a service to other
subjects.

The system is divided into multiple separate units called _instances_.
Each instance has its own set of users and groups. Exercises can be optionally
shared. The rest of the system (the API server and the evaluation backend) is shared
between the instances. To keep a track of the active instances and allow access
to the infrastructure to other, paying, customers, each instance must have a
valid _licence_ to allow its users to submit their solutions.
Each licence is granted for a specific period of time and can be revoked in advance
if the subject does not conform with approved terms and conditions.

The problems the students solve are broken down into two parts in the system:

- the problem itself (an _exercise_),
- and its _assignment_

Exercises only describe the problem and provide testing data with the description
of how to evaluate them. In fact, these are templates for the assignments. A particular
assignment then contains data from the exercise and some additional metadata, which can be
different for every assignment of the same exercise (e.g., the deadline, maximum number
of points).

### Evaluation Unit Executed by ReCodEx

One of the bigger requests for the new system is to support a complex
configuration of execution pipeline. The idea comes from lecturers of Compiler
principles class who want to migrate their semi-manual evaluation process to
CodEx. Unfortunately, CodEx is not capable of such complicated exercise setup.
None of evaluation systems we found can handle such task, so design from
scratch is needed.

There are two main approaches to design a complex execution configuration. It
can be composed of small amount of relatively big components or much more small
tasks. Big components are easy to write and help keeping the configuration
reasonably small. However, these components are designed for current problems
and they might not hold well against future requirements. This can be solved by
introducing a small set of single-purposed tasks which can be composed together.
The whole configuration becomes bigger, but more flexible for new conditions.
Moreover, they will not require as much programming effort as bigger evaluation
units. For better user experience, configuration generators for some common
cases can be introduced.

A goal of ReCodEx is to be continuously developed and used for many years.
Therefore, we chose to use smaller tasks, because this approach is better for
future extensibility. Observation of CodEx system shows that only a few tasks
are needed. In an extreme case, only one task is enough -- execute a binary.
However, for better portability of configurations between different systems it
is better to implement a reasonable subset of operations ourselves without
calling binaries provided by the system directly. These operations are copy
file, create new directory, extract archive and so on, altogether called
internal tasks. Another benefit from custom implementation of these tasks is
guarantied safety, so no sandbox needs to be used as in external tasks case.

For a job evaluation, the tasks need to be executed sequentially in a specified
order. Running independent tasks is possible, but there are complications --
exact time measurement requires a controlled environment with as few
interruptions as possible from other processes. It would be possible to run
tasks that do not need exact time measurement in parallel, but in this case a
synchronization mechanism has to be developed to exclude parallelism for
measured tasks. Usually, there are about four times more unmeasured tasks than
tasks with time measurement, but measured tasks tend to be much longer. With
[Amdahl's law](https://en.wikipedia.org/wiki/Amdahl's_law) in mind, the
parallelism does not seem to provide a notable benefit in overall execution
speed and brings trouble with synchronization. Moreover, most of the internal
tasks are also limited by IO speed (most notably copying and downloading files
and reading archives). However, if there are performance issues, this approach
could be reconsidered, along with using a ram disk for storing supplementary
files.

It seems that connecting tasks into directed acyclic graph (DAG) can handle all
possible problem cases. None of the authors, supervisors and involved faculty
staff can think of a problem that cannot be decomposed into tasks connected in a
DAG. The goal of evaluation is to satisfy as many tasks as possible. During
execution there are sometimes multiple choices of next task. To control that,
each task can have a priority, which is used as a secondary ordering criterion.
For better understanding, here is a small example.

![Task serialization](https://github.com/ReCodEx/wiki/raw/master/images/Assignment_overview.png)

The _job root_ task is an imaginary single starting point of each job. When the
_CompileA_ task is finished, the _RunAA_ task is started (or _RunAB_, but should
be deterministic by position in configuration file -- tasks stated earlier
should be executed earlier). The task priorities guaranties, that after
_CompileA_ task all dependent tasks are executed before _CompileB_ task (they
have higher priority number). To sum up, connection of tasks represents
dependencies and priorities can be used to order unrelated tasks and with this
provide a total ordering of them. For well written jobs the priorities may not
be so useful, but they can help control execution order for example to avoid
situation, where each test of the job generates large temporary file and there
is a one valid execution order which keeps all the temporary files for later
processing at one time. Better approach is to finish execution of one test,
clean the big temporary file and proceed with following test. If there is an
ambiguity in task ordering at this point, they are executed in order of input
task configuration.

The total linear ordering of tasks can be made easier with just executing them
in order of input configuration. But this structure cannot handle cases, when a
task fails very well. There is no easy way of telling which task should be
executed next. However, this issue can be solved with graph structured
dependencies of the tasks. In graph structure, it is clear that all dependent
tasks have to be skipped and execution must be resumed with a non related task.
This is the main reason, why the tasks are connected in a DAG.

For grading there are several important tasks. First, tasks executing submitted
code need to be checked for time and memory limits. Second, outputs of judging
tasks need to be checked for correctness (represented by return value or by data
on standard output) and should not fail. This division can be transparent for
backend, each task is executed the same way. But frontend must know which tasks
from whole job are important and what is their kind. It is reasonable, to keep
this piece of information alongside the tasks in job configuration, so each task
can have a label about its purpose. Unlabeled tasks have an internal type
_inner_. There are four categories of tasks:

- _initiation_ -- setting up the environment, compiling code, etc.; for users
  failure means error in their sources which are not compatible with running it
  with examination data
- _execution_ -- running the user code with examination data, must not exceed
  time and memory limits; for users failure means wrong design, slow data
  structures, etc.
- _evaluation_ -- comparing user and examination outputs; for user failure means
  that the program does not compute the right results
- _inner_ -- no special meaning for frontend, technical tasks for fetching and
  copying files, creating directories, etc.

Each job is composed of multiple tasks of these types which are semantically
grouped into tests. A test can represent one set of examination data for user
code. To mark the grouping, another task label can be used. Each test must have
exactly one _evaluation_ task (to show success or failure to users) and
arbitrary number of tasks with other types.

### Evaluation Progress State

Users want to know the state of their submitted solution (whether it is waiting
in a queue, compiling, etc.). The very first idea would be to report a state
based on "done" messages from compilation, execution and evaluation like many
evaluation systems are already providing. However ReCodEx has a more complicated
execution pipeline where there can be more compilation or execution tasks per
test and also other internal tasks that control the job execution flow. 

The users do not know the technical details of the evaluation and data about
completion of tasks may confuse them. A solution is to show users only
percentual completion of the job without any additional information about task
types. This solution works well for all of the jobs and is very user friendly.

It is possible to expand upon this by adding a special "send progress message"
task to the job configuration that would mark the completion of a specific part
of the evaluation. However, the benefits of this feature are not worth the
effort of implementing it and unnecessarily complicating the job configuration
files.

### Results of Evaluation

The evaluation data have to be processed and then presented in human readable
form. This is done through a one numeric value called points. Also, results of
job tests should be available to know what kind of error is in the solution. For
more debugging, outputs of tasks could be optionally available for the users.

#### Scoring and Assigning Points

The overall concept of grading solutions was presented earlier. To briefly
remind that, backend returns only exact measured values (used time and memory,
return code of the judging task, ...) and on top of that one value is computed.
The way of this computation can be very different across supervisors, so it has
to be easily extendable. The best way is to provide interface, which can be
implemented and any sort of magic can return the final value.

We found out several computational possibilities. There is basic arithmetic,
weighted arithmetic, geometric and harmonic mean of results of each test (the
result is logical value succeeded/failed, optionally with weight), some kind of
interpolation of used amount of time for each test, the same with used memory
amount and surely many others. To keep the project simple, we decided to design
appropriate interface and implement only weighted arithmetic mean computation,
which is used in about 90% of all assignments. Of course, different scheme can
be chosen for every assignment and also can be configured -- for example
specifying test weights for implemented weighted arithmetic mean. Advanced ways
of computation can be implemented on demand when there is a real demand for
them.

To avoid assigning points for insufficient solutions (like only printing "File
error" which is the valid answer in two tests), a minimal point threshold can be
specified. If the solution is to get less points than specified, it will get
zero points instead. This functionality can be embedded into grading computation
algorithm itself, but it would have to be present in each implementation
separately, which is not maintainable. Because of this the threshold feature is
separated from score computation.

Automatic grading cannot reflect all aspects of submitted code. For example,
structuring the code, number and quality of comments and so on. To allow
supervisors bring these manually checked things into grading, there is a concept
of bonus points. They can be positive or negative. Generally the solution with
the most assigned points is marked for grading that particular assignment.
However, if supervisor is not satisfied with student solution (really bad code,
cheating, ...) he/she assigns the student negative bonus points. To prevent
overriding this decision by system choosing another solution with more points or
even student submitting the same code again which evaluates to more points,
supervisor can mark a particular solution as marked and used for grading instead
of solution with the most points.

#### Evaluation Outputs

In addition to the exact measured values used for score calculation described in
previous chapter, there are also text or binary outputs of the executed tasks.
Knowing them helps users identify and solve their potential issues, but on the
other hand this can lead to possibility of leaking input data. This may lead
students to hack their solutions to pass just the ReCodEx testing cases instead
of properly solving the assigned problem. The usual approach is to keep this
information private. This was also strongly recommended by Martin Mareš, who has
experience with several programming contests.

The only one exception from hiding the logs are compilation outputs, which can
help students a lot during troubleshooting and there is only a small possibility
of input data leakage. The supervisors have access to all of the logs and they
can decide if students are allowed to see the compilation outputs.

Note that due to lack of frontend developers, showing compilation logs to the
students is not implemented in the very first release of ReCodEx.

### Persistence

Previous parts of analysis show that the system has to keep some state. This
could be user settings, group membership, evaluated assignments and so on. The
data have to be kept across restart, so persistence is important decision
factor. There are several ways how to save structured data:

- plain files
- NoSQL database
- relational database

Another important factor is amount and size of stored data. Our guess is about
1000 users, 100 exercises, 200 assignments per year and 20000 unique solutions
per year. The data are mostly structured and there are a lot of them with the
same format. For example, there is a thousand of users and each one has the same
values -- name, email, age, etc. These data items are relatively small, name
and email are short strings, age is an integer. Considering this, relational
databases or formatted plain files (CSV for example) fit best for them.
However, the data often have to support searching, so they have to be
sorted and allow random access for resolving cross references. Also, addition
and deletion of entries should take reasonable time (at most logarithmic time
complexity to number of saved values). This practically excludes plain files, so
we decided to use a relational database.

On the other hand, there is data with basically no structure and much larger
size. These can be evaluation logs, sample input files for exercises or sources
submitted by students. Saving this kind of data into a relational database is
not appropriate. It is better to keep them as ordinary files or store them in
some kind of NoSQL database. Since they are already files and do not need to be
backed up in multiple copies, it is easier to keep them as ordinary files in the
filesystem. Also, this solution is more lightweight and does not require
additional dependencies on third-party software. Files can be identified using
their filesystem paths or a unique index stored as a value in a relational
database. Both approaches are equally good, final decision depends on the actual
implementation.

## Structure of The Project

The ReCodEx project is divided into two logical parts -- the *backend* and the
*frontend* -- which interact with each other and together cover the whole area
of code examination. Both of these logical parts are independent of each other
in the sense of being installed on separate machines at different locations and
that one of the parts can be replaced with a different implementation and as
long as the communication protocols are preserved, the system will continue
working as expected.

### Backend

Backend is the part which is responsible solely for the process of evaluating
a solution of an exercise. Each evaluation of a solution is referred to as a
*job*. For each job, the system expects a configuration document of the job,
supplementary files for the exercise (e.g., test inputs, expected outputs,
predefined header files), and the solution of the exercise (typically source
codes created by a student). There might be some specific requirements for the
job, such as a specific runtime environment, specific version of a compiler or
the job must be evaluated on a processor with a specific number of cores. The
backend infrastructure decides whether it will accept a job or decline it based
on the specified requirements. In case it accepts the job, it will be placed in
a queue and it will be processed as soon as possible. 

The backend publishes the progress of processing of the queued jobs and the
results of the evaluations can be queried after the job processing is finished.
The backend produces a log of the evaluation which can be used for further score
calculation or debugging.

To make the backend scalable, there are two necessary components -- the one
which will execute jobs and the other which will distribute jobs to the
instances of the first one. This ensures scalability in manner of parallel
execution of numerous jobs which is exactly what is needed. Implementation of
these services are called **broker** and **worker**, first one handles
distribution, the other one handles execution. 

These components should be enough to fulfill all tasks mentioned above, but for
the sake of simplicity and better communication, gateways with frontend two
other components were added -- **fileserver** and **monitor**. Fileserver is a
simple component whose purpose is to store files which are exchanged between
frontend and backend. Monitor is also quite a simple service which is able to
forward job progress data from worker to web application. These two additional
services are at the border between frontend and backend (like gateways) but
logically they are more connected with backend, so it is considered they belong
there.

### Frontend

Frontend on the other hand is responsible for providing users with convenient
access to the backend infrastructure and interpreting raw data from backend
evaluation. 

There are two main purposes of the frontend -- holding the state of the whole
system (database of users, exercises, solutions, points, etc.) and presenting
the state to users through some kind of a user interface (e.g., a web
application, mobile application, or a command-line tool). According to
contemporary trends in development of frontend parts of applications, we decided
to split the frontend in two logical parts -- a server side and a client side.
The server side is responsible for managing the state and the client side gives
instructions to the server side based on the inputs from the user. This
decoupling gives us the ability to create multiple client side tools which may
address different needs of the users.

The frontend developed as part of this project is a web application created with
the needs of the Faculty of Mathematics and Physics of the Charles university in
Prague in mind. The users are the students and their teachers, groups correspond
to the different courses, the teachers are the supervisors of these groups. This
model is applicable to the needs of other universities, schools, and IT
companies, which can use the same system for their needs. It is also possible to
develop a custom frontend with own user management system and use the
possibilities of the backend without any changes.

### Possible Connection

One possible configuration of ReCodEx system is illustrated on following
picture, where there is one shared backend with three workers and two separate
instances of whole frontend. This configuration may be suitable for MFF UK --
basic programming course and KSP competition. But maybe even sharing web API and
fileserver with only custom instances of client (web app or own implementation)
is more likely to be used. Note, that connections between components are not
fully accurate.

![Overall architecture](https://github.com/ReCodEx/wiki/blob/master/images/Overall_Architecture.png)

In the following parts of the documentation, both the backend and frontend parts
will be introduced separately and covered in more detail. The communication
protocol between these two logical parts will be described as well.


## Implementation Analysis

Some of the most important implementation problems or interesting observations
will be discussed in this chapter.

### Communication Between the Backend Components

Overall design of the project is discussed above. There are bunch of components
with their own responsibility. Important thing to design is communication of
these components. To choose a suitable protocol, there are some additional
requirements that should be met:

- reliability -- if a message is sent between components, the protocol has to
  ensure that it is received by target component
- working over IP protocol
- multi-platform and multi-language usage

TCP/IP protocol meets these conditions, however it is quite low level and
working with it usually requires working with platform dependent non-object API.
Often way to reflect these reproaches is to use some framework which provides
better abstraction and more suitable API. We decided to go this way, so the
following options are considered:

- CORBA (or some other form of RPC) -- CORBA is a well known framework for
  remote procedure calls. There are multiple implementations for almost every
  known programming language. It fits nicely into object oriented programming
  environment.
- RabbitMQ -- RabbitMQ is a messaging framework written in Erlang. It features a
  message broker, to which nodes connect and declare the message queues they
  work with. It is also capable of routing requests, which could be a useful
  feature for job load-balancing. Bindings exist for a large number of languages
  and there is a large community supporting the project.
- ZeroMQ -- ZeroMQ is another messaging framework, which is different from
  RabbitMQ and others (such as ActiveMQ) because it features a "brokerless
  design". This means there is no need to launch a message broker service to
  which clients have to connect -- ZeroMQ based clients are capable of
  communicating directly. However, it only provides an interface for passing
  messages (basically vectors of 255B strings) and any additional features such
  as load balancing or acknowledgement schemes have to be implemented on top of
  this. The ZeroMQ library is written in C++ with a huge number of bindings. 

CORBA is a large framework that would satisfy all our needs, but we are aiming
towards a more loosely-coupled system, and asynchronous messaging seems better
for this approach than RPC. Moreover, we rarely need to receive replies to our
requests immediately.

RabbitMQ seems well suited for many use cases, but implementing a job routing
mechanism between heterogeneous workers would be complicated -- we would probably
have to create a separate load balancing service, which cancels the advantage of
a message broker already being provided by the framework. It is also written in
Erlang, which nobody from our team understands. 

ZeroMQ is the best option for us, even with the drawback of having to implement
a load balancer ourselves (which could also be seen as a benefit and there is a
notable chance we would have to do the same with RabbitMQ). It also gives us
complete control over the transmitted messages and communication patterns.
However, all of the three options would have been possible to use.

### File Transfers

There has to be a way to access files stored on the fileserver (and also upload
them )from both worker and frontend server machines. The protocol used for this
should handle large files efficiently and be resilient to network failures.
Security features are not a primary concern, because all communication with the
fileserver will happen in an internal network. However, a basic form of
authentication can be useful to ensure correct configuration (if a development
fileserver uses different credentials than production, production workers will
not be able to use it by accident). Lastly, the protocol must have a client
library for platforms (languages) used in the backend. We will present some of
the possible options:

- HTTP(S) -- a de-facto standard for web communication that has far more
  features than just file transfers. Thanks to being used on the web, a large
  effort has been put into the development of its servers. It supports
  authentication and it can handle short-term network failures (thanks to being
  built on TCP and supporting resuming interrupted transfers). We will use HTTP
  for communication with clients, so there is no added cost in maintaining a
  server. HTTP requests can be made using libcurl.
- FTP -- an old protocol designed only for transferring files. It has all the
  required features, but doesn't offer anything over HTTP. It is also supported
  by libcurl.
- SFTP -- a file transfer protocol most frequently used as a subsystem of the
  SSH protocol implementations. It doesn't provide authentication, but it
  supports working with large files and resuming failed transfers. The libcurl
  library supports SFTP.
- A network-shared file system (such as NFS) -- an obvious advantage of a
  network-shared file system is that applications can work with remote files the
  same way they would with local files. However, it brings an overhead for the
  administrator, who has to configure access to this filesystem for every
  machine that needs to access the storage. 
- A custom protocol over ZeroMQ -- it is possible to design a custom file
  transfer protocol that uses ZeroMQ for sending data, but it is not a trivial
  task -- we would have to find a way to transfer large files efficiently, 
  implement an acknowledgement scheme and support resuming transfers. Using
  ZeroMQ as the underlying layer does not help a lot with this. The sole
  advantage of this is that the backend components would not need another
  library for communication.

We chose HTTPS because it is widely used and client libraries exist in all
relevant environments. In addition, it is highly probable we will have to run an
HTTP server, because it is intended for ReCodEx to have a web frontend.

### Frontend - Backend Communication

Our choices when considering how clients will communicate with the backend have
to stem from the fact that ReCodEx should primarily be a web application. This
rules out ZeroMQ -- while it is very useful for asynchronous communication
between backend components, it is practically impossible to use it from a web
browser. There are several other options:

- *WebSockets* -- The WebSocket standard is built on top of TCP. It enables a
  web browser to connect to a server over a TCP socket. WebSockets are
  implemented in recent versions of all modern web browsers and there are
  libraries for several programming languages like Python or JavaScript (running
  in Node.js).  Encryption of the communication over a WebSocket is supported as
  a standard.
- *HTTP protocol* -- The HTTP protocol is a state-less protocol implemented on
  top of the TCP protocol.  The communication between the client and server
  consists of a requests sent by the client and responses to these requests sent
  back by the sever. The client can send as many requests as needed and it may
  ignore the responses from the server, but the server must respond only to the
  requests of the client and it cannot initiate communication on its own.
  End-to-end encryption can be achieved easily using SSL (HTTPS).

We chose the HTTP(S) protocol because of the simple implementation in all sorts
of operating systems and runtime environments on both the client and the server
side.

The API of the server should expose basic CRUD (Create, Read, Update, Delete)
operations. There are some options on what kind of messages to send over the
HTTP:

- SOAP -- a protocol for exchanging XML messages. It is very robust and complex.
- REST -- is a stateless architecture style, not a protocol or a technology. It
  relies on HTTP (but not necessarily) and its method verbs (e.g., GET, POST,
  PUT, DELETE). It can fully implement the CRUD operations.

Even though there are some other technologies we chose the REST style over the
HTTP protocol. It is widely used, there are many tools available for development
and testing, and it is understood by programmers so it should be easy for a new
developer with some experience in client-side applications to get to know with
the ReCodEx API and develop a client application.

A high level view of chosen communication protocols in ReCodEx can be seen in
following image. Red arrows mark connections through ZeroMQ sockets, blue mark
WebSockets communication and green arrows connect nodes that communicate through
HTTP(S).

![Communication schema](https://github.com/ReCodEx/wiki/raw/master/images/Backend_Connections.png)

### Job Configuration File

As discussed previously in 'Evaluation Unit Executed by ReCodEx' an evaluation
unit have form of a job which contains small tasks representing one piece of
work executed by worker. This implies that jobs have to be passed from the
frontend to the backend. The best option for this is to use some kind of
configuration file which represents job details. The configuration file should
be specified in the frontend and in the backend, namely worker, will be parsed
and executed.

There are many formats which can be used for configuration representation. The
considered ones are:

- *XML* -- broadly used general markup language which is flavoured with document
  type definition (DTD) which can express and check XML file structure, so it
  does not have to be checked within application. But XML with its tags can be
  sometimes quite 'chatty' and extensive which is not desirable. And overly XML
  with all its features and properties can be a bit heavy-weight.
- *JSON* -- a notation which was developed to represent javascript objects. As
  such it is quite simple, there can be expressed only: key-value structures,
  arrays and primitive values. Structure and hierarchy of data is solved by
  braces and brackets.
- *INI* -- very simple configuration format which is able to represents only
  key-value structures which can be grouped into sections. This is not enough
  to represent a job and its tasks hierarchy.
- *YAML* -- format which is very similar to JSON with its capabilities. But with
  small difference in structure and hierarchy of configuration which is solved
  not with braces but with indentation. This means that YAML is easily readable
  by both human and machine.
- *specific format* -- newly created format used just for job configuration.
  Obvious drawback is non-existing parsers which would have to be written from
  scratch.

Given previous list of different formats we decided to use YAML. There are
existing parsers for most of the programming languages and it is easy enough to
learn and understand. Another choice which make sense is JSON but at the end
YAML seemed to be better.

Job configuration including design and implementation notes is described in 'Job
configuration' appendix.

#### Task Types

From the low-level point of view there are only two types of tasks in the job.
First ones are doing some internal operation which should work on all platforms
or operating systems the same way. Second type of tasks are external ones which
are executing external binary.

Internal tasks should handle at least these operations:

- *fetch* -- fetch single file from fileserver
- *copy* -- copy file between directories
- *remove* -- remove single file or folder
- *extract* -- extract files from downloaded archive

These internal operations are essential but many more can be eventually
implemented.

External tasks executing external binary should be optionally runnable in
sandbox. But for security sake there is no reason to execute them outside of
sandbox. So all external tasks are executed within a general a configurable
sandbox. Configuration options for sandboxes will be called limits and there can
be specified for example time or memory limits.

#### Configuration File Content

Content of the configuration file can be divided in two parts, first concerns
about the job in general and its metadata, second one relates to the tasks and
their specification.

There is not much to express in general job metadata. There can be
identification of the job and some general options, like enable/disable logging.
But really necessary item is address of the fileserver from where supplementary
files are downloaded. This option is crucial because there can be more
fileservers and the worker have no other way how to figure out where the files
might be.

More interesting situation is about the metadata of tasks. From the initial
analysis of evaluation unit and its structure there are derived at least these
generally needed items:

- *task identification* -- identificator used at least for specifying
  dependencies
- *type* -- as described before, one of: 'initiation', 'execution', 'evaluation'
  or 'inner'
- *priority* -- priority can additionally control execution flow in task graph
- *dependencies* -- necessary item for constructing hierarchy of tasks into DAG
- *execution command* -- command which should be executed withing this tasks
  with possible parameters parameters

Previous list of items is applicable both for internal and external tasks.
Internal tasks do not need any more items but external do. Additional items are
exclusively related to sandboxing and limitation:

- *sandbox name* -- there should be possibility to have multiple sandboxes, so
  identification of the right one is needed
- *limits* -- hardware and software resources limitations
    - *time limit* -- limits time of execution
    - *memory limit* -- maximum memory which can be consumed by external program
    - *I/O operations* -- limitation concerning disk operations
    - *restrict filesystem* -- restrict or enable access to directories

#### Supplementary Files

Interesting problem arise with supplementary files (e.g., inputs, sample
outputs). There are two main ways which can be observed. Supplementary files can
be downloaded either on the start of the execution or during the execution.

If the files are downloaded at the beginning, the execution does not really
started at this point and thus if there are problems with network, worker will
find it right away and can abort execution without executing a single task.
Slight problems can arise if some of the files needs to have specific name (e.g.
solution assumes that the input is `input.txt`). In this scenario the downloaded
files cannot be renamed at the beginning but during the execution which is
impractical and not easily observed by the authors of job configurations.

Second solution of this problem when files are downloaded on the fly has quite
opposite problem. If there are problems with network, worker will find it during
execution when for instance almost whole execution is done. This is also not
ideal solution if we care about burnt hardware resources. On the other hand
using this approach users have advanced control of the execution flow and know
what files exactly are available during execution which is from users
perspective probably more appealing then the first solution. Based on that,
downloading of supplementary files using 'fetch' tasks during execution was
chosen and implemented.

#### Job Variables

Considering the fact that jobs can be executed within the worker on different
machines with specific settings, it can be handy to have some kind of mechanism
in the job configuration which will hide these particular worker details, most
notably specific directory structure. For this purpose marks or signs can be
used and can have a form of broadly used variables.

Variables in general can be used everywhere where configuration values (not
keys) are expected. This implies that substitution should be done after parsing
of job configuration, not before. The only usage for variables which was
considered is for directories within worker, but in future this might be subject
to change.

Final form of variables is `${...}` where triple dot is textual description.
This format was used because of special dollar sign character which cannot be
used within paths of regular filesystems. Braces are there only to border
textual description of variable.

### Broker

The broker is responsible for keeping track of available workers and
distributing jobs that it receives from the frontend between them. 

#### Worker Management

It is intended for the broker to be a fixed part of the backend infrastructure
to which workers connect at will. Thanks to this design, workers can be added
and removed when necessary (and possibly in an automated fashion), without
changing the configuration of the broker. An alternative solution would be
configuring a list of workers before startup, thus making them passive in the
communication (in the sense that they just wait for incoming jobs instead of
connecting to the broker). However, this approach comes with a notable
administration overhead -- in addition to starting a worker, the administrator
would have to update the worker list.

Worker management must also take into account the possibility of worker
disconnection, either because of a network or software failure (or termination).
A common way to detect such events in distributed systems is to periodically
send short messages to other nodes and expect a response. When these messages
stop arriving, we presume that the other node encountered a failure. Both the
broker and workers can be made responsible for initiating these exchanges and it
seems that there are no differences stemming from this choice. We decided that
the workers will be the active party that initiates the exchange.

#### Scheduling

Jobs should be scheduled in a way that ensures that they will be processed
without unnecessary waiting. This depends on the fairness of the scheduling
algorithm (no worker machine should be overloaded).

The design of such scheduling algorithm is complicated by the requirements on
the diversity of workers -- they can differ in operating systems, available
software, computing power and many other aspects. 

We decided to keep the details of connected workers hidden from the frontend,
which should lead to a better separation of responsibilities and flexibility.
Therefore, the frontend needs a way of communicating its requirements on the
machine that processes a job without knowing anything about the available
workers. A key-value structure is suitable for representing such requirements.

With respect to these constraints, and because the analysis and design of a more
sophisticated solution was declared out of scope of our project assignment, a
rather simple scheduling algorithm was chosen. The broker shall maintain a queue
of available workers. When assigning a job, it traverses this queue and chooses
the first machine that matches the requirements of the job. This machine is then
moved to the end of the queue. 

Presented algorithm results in a simple round-robin load balancing strategy,
which should be sufficient for small-scale deployments (such as a single
university). However, with a large amount of jobs, some workers will easily
become overloaded. The implementation must allow for a simple replacement of the
load balancing strategy so that this problem can be solved in the near future.

#### Forwarding Jobs

Information about a job can be divided in two disjoint parts -- what the worker
needs to know to process it and what the broker needs to forward it to the
correct worker. It remains to be decided how this information will be
transferred to its destination. 

It is technically possible to transfer all the data required by the worker at
once through the broker. This package could contain submitted files, test
data, requirements on the worker, etc. A drawback of this solution is that
both submitted files and test data can be rather large. Furthermore, it is
likely that test data would be transferred many times.

Because of these facts, we decided to store data required by the worker using a
shared storage space and only send a link to this data through the broker. This
approach leads to a more efficient network and resource utilization (the broker
doesn't have to process data that it doesn't need), but also makes the job
submission flow more complicated.

#### Further Requirements

The broker can be viewed as a central point of the backend. While it has only
two primary, closely related responsibilities, other requirements have arisen
(forwarding messages about job evaluation progress back to the frontend) and
will arise in the future. To facilitate such requirements, its architecture
should allow simply adding new communication flows. It should also be as
asynchronous as possible to enable efficient communication with external
services, for example via HTTP.

### Worker

Worker is a component which is supposed to execute incoming jobs from broker. As
such worker should work and support wide range of different infrastructures and
maybe even platforms/operating systems. Support of at least two main operating
systems is desirable and should be implemented.

Worker as a service does not have to be very complicated, but a bit of complex
behaviour is needed. Mentioned complexity is almost exclusively concerned about
robust communication with broker which has to be regularly checked. Ping
mechanism is usually used for this in all kind of projects. This means that the
worker should be able to send ping messages even during execution. So worker has
to be divided into two separate parts, the one which will handle communication
with broker and the another which will execute jobs.

The easiest solution is to have these parts in separate threads which somehow
tightly communicates with each other. For inter process communication there can
be used numerous technologies, from shared memory to condition variables or some
kind of in-process messages. The ZeroMQ library which we already use provides
in-process messages that work on the same principles as network communication,
which is convenient and solves problems with thread synchronization.

#### Execution of Jobs

At this point we have worker with two internal parts listening one and execution
one. Implementation of first one is quite straightforward and clear. So let us
discuss what should be happening in execution subsystem.

After successful arrival of the job from broker to the listening thread, the job
is immediately redirected to execution thread. In there worker has to prepare
new execution environment, solution archive has to be downloaded from fileserver
and extracted. Job configuration is located within these files and loaded into
internal structures and executed. After that, results are uploaded back to
fileserver. These steps are the basic ones which are really necessary for whole
execution and have to be executed in this precise order.

The evaluation unit executed by ReCodEx and job configuration were already
discussed above. The conclusion was that jobs containing small tasks will be
used. Particular format of the actual job configuration can be found in 'Job
configuration' appendix.  Implementation of parsing and storing these data in
worker is then quite straightforward.

Worker has internal structures to which loads and which stores metadata given in
configuration. Whole job is mapped to job metadata structure and tasks are
mapped to either external ones or internal ones (internal commands has to be
defined within worker), both are different whether they are executed in sandbox
or as an internal worker commands.

#### Task Execution Failure

Another division of tasks is by task-type field in configuration. This field can
have four values: initiation, execution, evaluation and inner. All was discussed
and described above in evaluation unit analysis. What is important to worker is
how to behave if execution of task with some particular type fails.

There are two possible situations execution fails due to bad user solution or
due to some internal error. If execution fails on internal error solution cannot
be declared overly as failed. User should not be punished for bad configuration
or some network error. This is where task types are useful.

Initiation, execution and evaluation are tasks which are usually executing code
which was given by users who submitted solution of exercise. If this kinds of
tasks fail it is probably connected with bad user solution and can be evaluated.

But if some inner task fails solution should be re-executed, in best case
scenario on different worker. That is why if inner task fails it is sent back to
broker which will reassign job to another worker. More on this subject should be
discussed in broker assigning algorithms section.

#### Job Working Directories

There is also question about working directory or directories of job, which
directories should be used and what for. There is one simple answer on this
every job will have only one specified directory which will contain every file
with which worker will work in the scope of whole job execution. This solution
is easy but fails due to logical and security reasons.

The least which must be done are two folders one for internal temporary files
and second one for evaluation. The directory for temporary files is enough to
comprehend all kind of internal work with filesystem but only one directory for
whole evaluation is somehow not enough.

The solution which was chosen at the end is to have folders for downloaded
archive, decompressed solution, evaluation directory in which user solution is
executed and then folders for temporary files and for results and generally
files which should be uploaded back to fileserver with solution results.

There has to be also hierarchy which separate folders from different workers on
the same machines. That is why paths to directories are in format:
`${DEFAULT}/${FOLDER}/${WORKER_ID}/${JOB_ID}` where default means default
working directory of whole worker, folder is particular directory for some
purpose (archives, evaluation, ...).

Mentioned division of job directories proved to be flexible and detailed enough,
everything is in logical units and where it is supposed to be which means that
searching through this system should be easy. In addition if solutions of users
have access only to evaluation directory then they do not have access to
unnecessary files which is better for overall security of whole ReCodEx.

### Sandboxing

There are numerous ways how to approach sandboxing on different platforms,
describing all possible approaches is out of scope of this document. Instead of
that have a look at some of the features which are certainly needed for ReCodEx
and propose some particular sandboxes implementations on Linux or Windows.

General purpose of sandbox is safely execute software in any form, from scripts
to binaries. Various sandboxes differ in how safely are they and what limiting
features they have. Ideal situation is that sandbox will have numerous options
and corresponding features which will allow administrators to setup environment
as they like and which will not allow user programs to somehow damage executing
machine in any way possible.

For ReCodEx and its evaluation there is need for at least these features:
execution time and memory limitation, disk operations limit, disk accessibility
restrictions and network restrictions. All these features if combined and
implemented well are giving pretty safe sandbox which can be used for all kinds
of users solutions and should be able to restrict and stop any standard way of
attacks or errors.

#### Linux

Linux systems have quite extent support of sandboxing in kernel, there were
introduced and implemented kernel namespaces and cgroups which combined can
limit hardware resources (cpu, memory) and separate executing program into its
own namespace (pid, network). These two features comply sandbox requirement for
ReCodEx so there were two options, either find existing solution or implement
new one. Luckily existing solution was found and its name is **isolate**.
Isolate does not use all possible kernel features but only subset which is still
enough to be used by ReCodEx.

#### Windows

The opposite situation is in Windows world, there is limited support in its
kernel which makes sandboxing a bit trickier. Windows kernel only has ways how
to restrict privileges of a process through restriction of internal access
tokens. Monitoring of hardware resources is not possible but used resources can
be obtained through newly created job objects.

There are numerous sandboxes for Windows but they all are focused on different
things in a lot of cases they serves as safe environment for malicious programs,
viruses in particular. Or they are designed as a separate filesystem namespace
for installing a lot of temporarily used programs. From all these we can
mention: Sandboxie, Comodo Internet Security, Cuckoo sandbox and many others.
None of these is fitted as sandbox solution for ReCodEx. With this being said we
can safely state that designing and implementing new general sandbox for Windows
is out of scope of this project.

But designing sandbox only for specific environment is possible, namely for C#
and .NET. CLR as a virtual machine and runtime environment has a pretty good
security support for restrictions and separation which is also transferred to
C#. This makes it quite easy to implement simple sandbox within C# but there are
not any well known general purpose implementations.

As mentioned in previous paragraphs implementing our own solution is out of
scope of project. But C# sandbox is quite good topic for another project for
example term project for C# course so it might be written and integrated in
future.

### Fileserver

The fileserver provides access over HTTP to a shared storage space that contains
files submitted by students, supplementary files such as test inputs and outputs
and results of evaluation. In other words, it acts as an intermediate storage
node for data passed between the frontend and the backend. This functionality
can be easily separated from the rest of the backend features, which led to
designing the fileserver as a standalone component. Such design helps
encapsulate the details of how the files are stored (e.g. on a file system, in a
database or using a cloud storage service), while also making it possible to
share the storage between multiple ReCodEx frontends.

For early releases of the system, we chose to store all files on the file system
-- it is the least complicated solution (in terms of implementation complexity)
and the storage backend can be rather easily migrated to a different technology.

One of the facts we learned from CodEx is that many exercises share test input
and output files, and also that these files can be rather large (hundreds of
megabytes). A direct consequence of this is that we cannot add these files to
submission archives that are to be downloaded by workers -- the combined size of
the archives would quickly exceed gigabytes, which is impractical. Another
conclusion we made is that a way to deal with duplicate files must be
introduced.

A simple solution to this problem is storing supplementary files under the
hashes of their content. This ensures that every file is stored only once. On
the other hand, it makes it more difficult to understand what the content of a
file is at a glance, which might prove problematic for the administrator.
However, human-readable identification is not as important as removing
duplicates -- administrators rarely need to inspect stored files (and when they
do, they should know their hashes), but duplicate files occupied a large part of
the disk space used by CodEx.

A notable part of the work of the fileserver is done by a web server (e.g.
listening to HTTP requests and caching recently accessed files in memory for
faster access). What remains to be implemented is handling requests that upload
files -- student submissions should be stored in archives to facilitate simple
downloading and supplementary exercise files need to be stored under their
hashes.

We decided to use Python and the Flask web framework. This combination makes it
possible to express the logic in ~100 SLOC and also provides means to run the
fileserver as a standalone service (without a web server), which is useful for
development.

### Cleaner

Worker can use caching mechanism based on files from fileserver under one
condition, provided files has to have unique name. This means there has to be
system which can download file, store it in cache and after some time of
inactivity delete it. Because there can be multiple worker instances on some
particular server it is not efficient to have this system in every worker on its
own. So it is feasible to have this feature somehow shared among all workers on
the same machine.

Solution may be again having separate service connected through network with
workers which would provide such functionality, but this would mean component
with another communication for the purpose, where it is not exactly needed. But
mainly it would be single-failure component. If it would stop working then it is
quite a problem.

So there was chosen another solution which assumes worker has access to
specified cache folder. In there folder worker can download supplementary files
and copy them from here. This means every worker has the possibility to maintain
downloads to cache, but what is worker not able to properly do, is deletion of
unused files after some time.

#### Architecture

For that functionality single-purpose component is introduced which is called
'cleaner'. It is simple script executed within cron which is able to delete
files which were unused for some time. Together with worker fetching feature
cleaner completes particular server specific caching system.

Cleaner as mentioned is simple script which is executed regularly as a cron job.
If there is caching system like it was introduced in paragraph above there are
little possibilities how cleaner should be implemented.

On various filesystems there is usually support for two  particular timestamps,
`last access time` and `last modification time`. Files in cache are once
downloaded and then just copied, this means that last modification time is set
only once on creation of file and last access time should be set every time on
copy. From this we can conclude that last access time is what is needed here.

But unlike last modification time, last access time is not usually enabled on
conventional filesystems (more on this subject can be found
[here](https://en.wikipedia.org/wiki/Stat_%28system_call%29#Criticism_of_atime)).
So if we choose to use last access time, filesystem used for cache folder has to
have last access time for files enabled. With respect to this access time was
not chosen for implementation.

However, there is another way, last modification time which is broadly supported
can be used. But this solution is not automatic and worker would have to 'touch'
cache files whenever they are accessed. This solution is kind of built-in and
was chosen instead of last access time for the latest releases.

#### Caching Flow

Having cleaner as separated component and caching itself handled in worker is
kind of blurry and is not clearly observable that it works without problems.
The goal is to have system which can recover from every kind of errors.

Follows description of one possible implementation. This whole mechanism relies
on worker ability to recover from internal fetch task failure. In case of error
here job will be reassigned to another worker where problem hopefully does not
arise.

First start with worker implementation:

- worker discovers fetch task which should download supplementary file
- worker takes name of file and tries to copy it from cache folder to its
  working folder
	- if successful then last modification time is rewritten by worker itself
	  and whole operation is done
	- if not successful then file has to be downloaded
		- file is downloaded from fileserver to working folder
		- then file is copied into temporary file and moved (atomically) to cache

Previous implementation is only within worker, cleaner can anytime intervene and
delete files. Implementation in cleaner follows:

- cleaner on its start stores current reference timestamp which will be used for
  comparison and load configuration values of caching folder and maximal file
  age
- there is a loop going through all files and even directories in specified
  cache folder
    - if difference between last modification time and reference timestamp is 
      greater than specified maximal file age, then file or folder is deleted

Previous description implies that there is gap between detection of last modification
time and deleting file within cleaner. In the gap there can be worker which will
access file and the file is anyway deleted but this is fine, file is deleted but
worker has it copied. If worker does not copy whole file or even do not start to
copy it and the file is deleted then copy process will fail. This will cause
internal task failure which will be handled by reassigning job to another
worker.

Another problem can be with two workers downloading the same file, but this is
also not a problem, file is firstly downloaded to working folder and after that
copied to cache.

And even if something else unexpectedly fails and because of that fetch task
will fail during execution, even that should be fine as mentioned previously.
Reassigning of job should be the last salvation in case everything else goes
wrong.

### Monitor

Users want to view real time evaluation progress of their solution. It can be
easily done with established double-sided connection stream, but it is hard to
achieve with plain HTTP. HTTP itself works on a separate request basis with no
long term connection. The HTML5 specification contains Server-Sent Events - a
means of sending text messages unidirectionally from an HTTP server to a
subscribed website. Sadly, it is not supported in Internet Explorer and Edge.

However, there is another widely used technology that can solve this problem --
the WebSocket protocol. It is more general than necessary (it enables
bidirectional communication) and requires additional web server configuration,
but it is supported in recent versions of all major web browsers.

Working with the WebSocket protocol from the backend is possible, but not ideal
from the design point of view. Backend should be hidden from public internet to
minimize surface for possible attacks. With this in mind, there are two possible
options:

- send progress messages through the API
- make a separate component that forwards progress messages to clients

Both of the two possibilities have their benefits and drawbacks. The first one
requires no additional component and the API is already publicly visible. On the
other side, working with WebSockets from PHP is complicated (but it is possible
with the help of third-party libraries) and embedding this functionality into
API is not extendable. The second approach is better for future changing the
protocol or implementing extensions like caching of messages. Also, the progress
feature is considered only optional, because there may be clients for which this
feature is useless. Major drawback of separate component is another part, which
needs to be publicly exposed.

We decided to make a separate component, mainly because it is smaller component
with only one role, better maintainability and optional demands for progress
callback.

There are several possibilities how to write the component. Notably, considered
options were already used languages C++, PHP, JavaScript and Python. At the end,
the Python language was chosen for its simplicity, great support for all used
technologies and also there are free Python developers in out team.

### API Server

The API server must handle HTTP requests and manage the state of the application
in some kind of a database. The API server will be a RESTful service and will
return data encoded as JSON documents. It must also be able to communicate with
the backend over ZeroMQ.

We considered several technologies which could be used:

- PHP + Apache -- one of the most widely used technologies for creating web
  servers. It is a suitable technology for this kind of a project. It has all
  the features we need when some additional extensions are installed (to support
  LDAP or ZeroMQ).
- Ruby on Rails, Python (Django), etc. -- popular web technologies that appeared
  in the last decade. Both support ZeroMQ and LDAP via extensions and have large
  developer communities.
- ASP.NET (C#), JSP (Java) -- these technologies are very robust and are used to
  create server technologies in many big enterprises. Both can run on Windows
  and Linux servers (ASP.NET using the .NET Core).
- JavaScript (Node.js) -- it is a quite new technology and it is being used to
  create REST APIs lately.  Applications running on Node.js are quite performant
  and the number of open-source libraries available on the Internet is very
  huge.

We chose PHP and Apache mainly because we were familiar with these technologies
and we were able to develop all the features we needed without learning to use a
new technology. Since the number of features was quite high and needed to meet a
strict deadline. This does not mean that we would find all the other
technologies superior to PHP in all other aspects - PHP 7 is a mature language
with a huge community and a wide range of tools, libraries, and frameworks.

We decided to use an ORM framework to manage the database, namely the widely
used PHP ORM Doctrine 2. Using an ORM tool means we do not have to write SQL
queries by hand. Instead, we work with persistent objects, which provides a
higher level of abstraction. Doctrine also has a robust database abstraction
layer so the database engine is not very important and it can be changed without
any need for changing the code. MariaDB was chosen as the storage backend.

To speed up the development process of the PHP server application we decided to
use a web framework. After evaluating and trying several frameworks, such as
Lumen, Laravel, and Symfony, we ended up using Nette.

- **Lumen** and **Laravel** seemed promising but the default ORM framework
  Eloquent is an implementation of ActiveRecord which we wanted to avoid. It
  was also surprisingly complicated to implement custom middleware for validation
  of access tokens in the headers of incoming HTTP requests.
- **Symfony** is a very good framework and has Doctrine "built-in". The reason
  why we did not use Symfony in the end was our lack of experience with this
  framework.
- **Nette framework** is very popular in the Czech Republic -- its lead
  developer is a well-known Czech programmer David Grudl. We were already
  familiar with the patterns used in this framework, such as dependency
  injection, authentication, routing. These concepts are useful even when
  developing a REST application which might be a surprise considering that
  Nette focuses on "traditional" web applications. 
  Nette is inspired by Symfony and many of the Symfony bundles are available
  as components or extensions for Nette. There is for example a Nette
  extension which makes integration of Doctrine 2 very straightforward.

#### Architecture of The System

The Nette framework is an MVP (Model, View, Presenter) framework. It has many
tools for creating complex websites and we need only a subset of them or we use
different libraries which suite our purposes better:

- **Model** - the model layer is implemented using the Doctrine 2 ORM instead of
Nette Database
- **View** - the whole view layer of the Nette framework (e.g., the Latte engine
used for HTML template rendering) is unnecessary since we will return all the
responses encoded in JSON. JSON is a common format used in APIs and we decided
to prefer it to XML or a custom format.
- **Presenter** - the whole lifecycle of a request processing of the Nette
framework is used. The Presenters are used to group the logic of the individual
API endpoints. The routing mechanism is modified to distinguish the actions by
both the URL and the HTTP method of the request.

#### Authentication

To make certain data and actions accessible only for some specific users, there
must be a way how these users can prove their identity. We decided to avoid PHP
sessions to make the server stateless (session ID is stored in the cookies of
the HTTP requests and responses). The server issues a specific token for the
user after his/her identity is verified (i.e., by providing email and password)
and sent to the client in the body of the HTTP response. The client must
remember this token and attach it to every following request in the
*Authorization* header.

The token must be valid only for a certain time period ("log out" the user after
a few hours of inactivity) and it must be protected against abuse (e.g., an
attacker must not be able to issue a token which will be considered valid by the
system and using which the attacker could pretend to be a different user). We
decided to use the JWT standard (the JWS).

The JWT is a base64-encoded string which contains three JSON documents - a
header, some payload, and a signature. The interesting parts are the payload and
the signature: the payload can contain any data which can identify the user and
metadata of the token (i.e., the time when the token was issued, the time of
expiration). The last part is a digital signature contains a digital signature
of the header and payload and it ensures that nobody can issue their own token
and steal the identity of someone. Both of these characteristics give us the
opportunity to validate the token without storing all of the tokens in the
database.

To implement JWT in Nette, we have to implement some of its security-related
interfaces such as IAuthenticator and IUserStorage, which is rather easy thanks
to the simple authentication flow. Replacing these services in a Nette
application is also straightforward, thanks to its dependency injection
container implementation. The encoding and decoding of the tokens itself
including generating the signature and signature verification is done through a
widely used third-party library which lowers the risk of having a bug in the
implementation of this critical security feature.

##### Backend Monitoring

The next thing related to communication with the backend is monitoring its
current state. This concerns namely which workers are available for processing
different hardware groups and which languages can be therefore used in
exercises.

Another step would be the overall backend state like how many jobs were
processed by some particular worker, workload of the broker and the workers,
etc. The easiest solution is to manage this information by hand, every instance
of the API server has to have an administrator which would have to fill them.
This includes only the currently available workers and runtime
environments which does not change very often. The real-time statistics of the
backend cannot be made accessible this way in a reasonable way.

A better solution is to update this information automatically. This can be
done in two ways:

- It can be provided by the backend on-demand if API needs it
- The backend will send these information periodically to the API.

Things like currently available workers or runtime environments are better to be
really up-to-date so this could be provided on-demand if needed. Backend
statistics are not that necessary and could be updated periodically.

However due to the lack of time automatic monitoring of the backend state will
not be implemented in the early versions of this project but might be
implemented in some of the next releases.

### Web Application

The web application ("WebApp") is one of the possible client applications of the
ReCodEx system. Creating a web application as the first client application has
several advantages:

- no installation or setup is required on the device of the user
- works on all platforms including mobile devices
- when a new version is released, all the clients will use this version without
  any need for manual installation of the update

One of the downsides is the large number of different web browsers (including
the older versions of a specific browser) and their different interpretation
of the code (HTML, CSS, JS). Some features of the latest specifications of HTML5
are implemented in some browsers which are used by a subset of the Internet
users. This has to be taken into account when choosing appropriate tools
for implementation of a website.

There are two basic ways how to create a website these days:

- **server-side approach** - the actions of the user are processed on the server
  and the HTML code with the results of the action is generated on the server
  and sent back to the web browser of the user. The client does not handle any
  logic (apart from rendering of the user interface and some basic user
  interaction) and is therefore very simple. The server can use the API server
  for processing of the actions so the business logic of the server can be very
  simple as well.  A disadvantage of this approach is that a lot of redundant
  data is transferred across the requests although some parts of the content can
  be cached (e.g., CSS files). This results in longer loading times of the
  website.
- **server-side rendering with asynchronous updates (AJAX)** - a slightly
  different approach is to render the page on the server as in the previous case
  but then execute the actions of the user asynchronously using the
  `XMLHttpRequest` JavaScript functionality. Which creates a HTTP request and
  transfers only the part of the website which will be updated.
- **client-side approach** - the opposite approach is to transfer the
  communication with the API server and the rendering of the HTML completely
  from the server directly to the client. The client runs the code (usually
  JavaScript) in his/her web browser and the content of the website is generated
  based on the data received from the API server. The script file is usually
  quite large but it can be cached and does not have to be downloaded from the
  server again (until the cached file expires).  Only the data from the API
  server needs to be transferred over the Internet and thus reduce the volume of
  payload on each request which leads to a much more responsive user experience,
  especially on slower networks. Since the client-side code has full control
  over the UI and a more sophisticated user interactions with the UI can be
  achieved.

All of these are used in production by the web developers and all
of them are well documented and there are mature tools for creating websites
using any of these approaches.

We decided to use the third approach -- to create a fully client-side
application which would be familiar and intuitive for a user who is used to
modern web applications.

#### Used Technologies

We examined several frameworks which are commonly used to speed up the
development of a web application. There are several open source options
available with a large number of tools, tutorials, and libraries. From the many
options (Backbone, Ember, Vue, Cycle.js, ...) there are two main frameworks
worth considering:

- **Angular 2** - it is a new framework which was developed by Google. This
  framework is very complex and provides the developer with many tools which
  make creating a website very straightforward. The code can be written in pure
  JavaScript (ES5) or using the TypeScript language which is then transpiled
  into JavaScript. Creating a web application in Angular 2 is based on creating
  and composing components. The previous version of Angular is not compatible
  with this new version.
- **React and Redux** - [React](https://facebook.github.io/react) is a fast
  library for rendering of the user interface developed by Facebook. It is based
  on components composition as well. A React application is usually written in
  EcmaScript 6 and the JSX syntax for defining the component tree. This code is
  usually transpiled to JavaScript (ES5) using some kind of a transpiler like
  Babel. [Redux](http://redux.js.org/) is a library for managing the state of
  the application and it implements a modification of the so-called Flux
  architecture introduced by Facebook. React and Redux are being used for a
  longer time than Angular 2 and both are still actively developed. There are
  many open-source components and addons available for both React and Redux.

We decided to use React and Redux over Angular 2 for several reasons:

- There is a large community around these libraries and there is a large number
  of tutorials, libraries, and other resources available online.
- Many of the web frontend developers are familiar with React and Redux and
  contributing to the project should be easy for them.
- A stable version of Angular 2 was still not released at the time we started
  developing the web application.
- We had previous experience with React and Redux and Angular 2 did not bring
  any significant improvements and features over React so it would not be worth
  learning the paradigms of a new framework.
- It is easy to debug React component tree and Redux state transitions
  using extensions for Google Chrome and Firefox.

##### Internationalization And Globalization

The user interface must be accessible in multiple languages and should be easily
translatable into more languages in the future. The most promissing library
which enables react applications to translate all of the messages of the UI is
[react-intl](https://github.com/yahoo/react-intl).

A good JavaScript library for manipulation with dates and times is
[Moment.js](http://momentjs.com). It is used by many open-source react 
components like date and time pickers.

#### User Interface Design

There is no artist on the team so we had to come up with an idea how to create a
visually appealing application with this handicap. User interfaces created by
programmers are notoriously ugly and unintuitive. Luckily we found the
[AdminLTE](https://almsaeedstudio.com/) theme by Abdullah Almsaeed which is
built on top of the [Bootstrap framework](http://getbootstrap.com/) by Twitter.

This is a great combination because there is an open-source implementation of
the Bootstrap components for React and with the stylesheets from AdminLTE the
application looks good and is distingushable form the many websites using the
Bootstrap framework with very little work.

<!---
// vim: set formatoptions=tqn flp+=\\\|^\\*\\s* textwidth=80 colorcolumn=+1:
-->