**ReCodEx** is designed to be very modular. In the following picture main components are arranged into one possible configuration. Note, that connections between components are not fully accurate.
**ReCodEx** is designed to be very modular and configurable. One such configuration is sketched in the following picture. There are two separate frontend instances with distinct databases sharing common backend part. This configuration may be suitable for MFF UK -- basic programming course and KSP competition. Note, that connections between components are not fully accurate.
**Web app** is main part of whole project for users. It provides nice user interface and is the only part, that interacts with outside world directly. **Web API** contains almost all logic of the app including _user management and authentication_, _storing and versioning files_ (with help of **File server**), _counting and assigning points_ to users etc. **Broker** is essential part of whole architecture and can be marked as single point of failure. It maintains list of available **Workers**, receives submissions from the **Web API** and routes them further and reports progress of evaluations back to the **Web app**. **Worker** securely runs each received job and evaluate it's results. **Monitor** resends evaluation progress messages to the **Web app** in order to be presented to users.
Almost whole communication goes through **Broker** and ZeroMQ messaging middleware. When **Web app** wants to execute submission then all datas are handed over to **Worker** through **Broker**, similar situation is with progress state which start in **Worker** goes through **Broker** then pass **Monitor** and end up in **Web app** (as WebSockets). Only part of communication, which does not include **Broker**, is communication with **File server** which is realized through HTTP commmunication. This communication can be initiated by **Web API** or by **Worker**, other services have no access to **File server**. Detailed view into communication is in separate section [Communication](#communication).
**Web app** is main part of whole project from user point of view. It provides nice user interface and it's the only part, that interacts with outside world directly. **Web API** contains almost all logic of the app including _user management and authentication_, _storing and versioning files_ (with help of **File server**), _counting and assigning points_ to users etc. Advanced users may connect to the API directly or may create custom fronends. **Broker** is essential part of whole architecture. It maintains list of available **Workers**, receives submissions from the **Web API** and routes them further and reports progress of evaluations back to the **Web app**. **Worker** securely runs each received job and evaluate it's results. **Monitor** resends evaluation progress messages to the **Web app** in order to be presented to users.
## Communication
This section gives detailed overview about communication in ReCodEx solution. Basic concept is captured on following image:
Detailed communication inside the ReCodEx project is captured in the following image and described in sections below. Red connections are through ZeroMQ sockets, blue are through WebSockets and green are through HTTP(S). All ZeroMQ messages are sent as multipart with one string (command, option) per part, with no empty frames (unles explicitly specified otherwise).
Red connections are through ZeroMQ sockets, Blue are through WebSockets and Green are through HTTP. All ZeroMQ messages are sent as multipart with one string (command, option) per part, with no empty frames (unles explicitly specified otherwise).
### Broker - Worker communication
Broker is server when communicating with worker. IP address and port are configurable, protocol is TCP. Worker socket is DEALER, broker one is ROUTER type. Because of that, very first part of every (multipart) message from broker to worker must be target worker's socket identity (which is saved on it's **init** command).
Broker acts as server when communicating with worker. Listening IP address and port are configurable, protocol family is TCP. Worker socket is of DEALER type, broker one is ROUTER type. Because of that, very first part of every (multipart) message from broker to worker must be target worker's socket identity (which is saved on it's **init** command).
Commands from broker to worker:
#### Commands from broker to worker:
- **eval** - evaluate a job. Requires 3 message frames:
- `job_id` - identifier of the job (in ASCII representation -- we avoid
@ -35,9 +34,9 @@ Commands from broker to worker:
shut down and restarted without the other side noticing.
- **pong** - reply to **ping** command, no arguments
Commands from worker to broker:
#### Commands from worker to broker:
- **init** - introduce yourself to the broker. Useful on startup or after reestablishing lost connection. Requires at least 2 arguments:
- **init** - introduce self to the broker. Useful on startup or after reestablishing lost connection. Requires at least 2 arguments:
- `hwgroup` - hardware group of this worker
- `header` - additional header describing worker capabilities. Format must
be `header_name=value`, every header shall be in a separate message frame.
@ -45,7 +44,7 @@ Commands from worker to broker:
There is also an optional third argument - additional information. If
present, it should be separated from the headers with an empty frame. The
format is the same. Supported keys for additional information are:
format is the same as headers. Supported keys for additional information are:
- `description` - a human readable description of the worker for
administrators (it will show up in broker logs)
- `current_job` - an identifier of a job the worker is now processing. This
@ -54,14 +53,13 @@ Commands from worker to broker:
- **done** - notifying of finished job. Contains following message frames:
- `job_id` - identifier of finished job
- `result` - response result, possible values are:
- OK - evaluation finished successfully
- FAILED - job failed and cannot be reassigned to another worker (e.g.
due to error in configuration)
- INTERNAL_ERROR - job failed due to internal worker error, but another
worker might be able to process it (e.g. downloading a file failed)
- OK - evaluation finished successfully
- FAILED - job failed and cannot be reassigned to another worker (e.g.
due to error in configuration)
- INTERNAL_ERROR - job failed due to internal worker error, but another
worker might be able to process it (e.g. downloading a file failed)
- `message` - a human readable error message
- **progress** - notice about evaluation progress. Contains following message
frames
- **progress** - notice about current evaluation progress. Contains following message frames:
- `job_id` - identifier of current job
- `state` - what is happening now.
- DOWNLOADED - submission successfuly fetched from fileserver
@ -73,12 +71,13 @@ Commands from worker to broker:
- FINISHED - whole execution is finished and worker ready for another job execution
- TASK - task state changed - see below
- `task_id` - only present for "TASK" state - identifier of task in current job
- `task_state` - only present for "TASK" state - result of task evaluation. One of "COMPLETED", "FAILED" and "SKIPPED".
- `task_state` - only present for "TASK" state - result of task evaluation. One of:
- COMPLETED - task was successfully executed without any error, subsequent task will be executed
- FAILED - task ended up with some error, subsequent task will be skipped
- SKIPPED - some of the previous dependencies failed to execute, so this task wont be executed at all
- SKIPPED - some of the previous dependencies failed to execute, so this task won't be executed at all
- **ping** - tell broker I'm alive, no arguments
#### Heartbeating
It is important for the broker and workers to know if the other side is still
@ -99,21 +98,23 @@ workers.
If a worker thinks the broker is dead, it tries to reconnect with a bounded,
exponentially increasing delay.
This protocol proved great robustness in real world testing. Thus whole backend is really reliable and can outlive short term issues with connection without problems. Also, increasing delay of ping messages doesn't flood the network when there are problems. We experienced no issues since we're using this protocol.
### Worker - File Server communication
Worker is communicating with file server only from _execution thread_. Supported is HTTP protocol optionally with SSL encryption (**recommended**, you can get free certificate from [Let's Encrypt](https://letsencrypt.org/) if you haven't one yet). If supported by server and used version of libcurl, HTTP/2 standard is also available. File server should be set up to require basic HTTP authentication and worker is capable to send corresponding credentials with each request.
Worker is communicating with file server only from _execution thread_ (see picture above). Supported protocol is HTTP optionally with SSL encryption (**recommended**, you can get free trusted DV certificate from [Let's Encrypt](https://letsencrypt.org/) authority if you haven't one yet). If supported by server and used version of libcurl, HTTP/2 standard is also available. File server should be set up to require basic HTTP authentication and worker is capable to send corresponding credentials with each request.
#### Worker point of view
#### Worker side
Worker is cabable of 2 things - download file and upload file. Internally, worker is using libcurl C library with very similar setup. In both cases it can verify HTTPS certificate (on Linux against system cert list, on Windows against downloaded one from their website during installation), support basic HTTP authentication, offer HTTP/2 with fallback to HTTP/1.1 and fail on error (returned HTTP status code is >= 400). Worker have list of credentials to all available file servers in it's config file.
Worker is cabable of 2 things - download file and upload file. Internally, worker is using libcurl C library with very similar setup. In both cases it can verify HTTPS certificate (on Linux against system cert list, on Windows against downloaded one from CURL website during installation), support basic HTTP authentication, offer HTTP/2 with fallback to HTTP/1.1 and fail on error (returned HTTP status code is >= 400). Worker have list of credentials to all available file servers in it's config file.
- download file - standard HTTP GET request to given URL expecting content as response
- download file - standard HTTP GET request to given URL expectingi file content as response
- upload file - standard HTTP PUT request to given URL with file data as body - same as command line tool `curl` with option `--upload-file`
#### File server point of view
#### File server side
File server has it's internal directory structure, where all the files are stored. It provides REST API to get them or create new ones. File server doesn't provide authentication or secured connection by itself, but it's supposed to run file server as WSGI script inside a web server (like Apache) with proper configuration. For communication with worker are relevant these commands:
File server has it's own internal directory structure, where all the files are stored. It provides simple REST API to get them or create new ones. File server doesn't provide authentication or secured connection by itself, but it's supposed to run file server as WSGI script inside a web server (like Apache) with proper configuration. Relevant commands for communication with workers:
- **GET /submission_archives/\<id\>.\<ext\>** - gets an archive with submitted source code and corresponding configuration of this job evaluation
- **GET /tasks/\<hash\>** - gets a file, common usage is for input files or reference result files
@ -125,7 +126,7 @@ If not specified otherwise, `zip` format of archives is used. Symbol `/` in API
### Broker - Monitor communication
Broker communicates with monitor also through ZeroMQ over TCP protocol. Type of
socket is same on both sides, ROUTER. Monitor is set as server in this
socket is same on both sides, ROUTER. Monitor is set to act as server in this
communication, its IP address and port are configurable in monitor's config
file. ZeroMQ socket ID (set on monitor's side) is "recodex-monitor" and must be
sent as first frame of every multipart message - see ZeroMQ ROUTER socket
@ -138,7 +139,7 @@ communication so that the workers don't have to know too many network services.
Monitor is treated as a somewhat optional part of whole solution, so no special
effort on communication realibility was made.
Commands from monitor to broker:
#### Commands from monitor to broker:
Because there is no need for the monitor to communicate with the broker, there
are no commands so far. Any message from monitor to broker is logged and
@ -149,13 +150,13 @@ Commands from broker to monitor:
- **progress** - notification about progress with job evaluation. See [Progress callback](#progress-callback) section for more info.
### Broker - Frontend communication
### Broker - Web API communication
Broker communicates with frontend through ZeroMQ connection over TCP. Socket
Broker communicates with main REST API through ZeroMQ connection over TCP. Socket
type on broker side is ROUTER, on frontend part it's REQ. Broker acts as a
server, its IP address and port is configurable in frontend.
server, its IP address and port is configurable in the API.
Commands from frontend to broker:
#### Commands from API to broker:
- **eval** - evaluate a job. Requires at least 4 frames:
- `job_id` - identifier of this job (in ASCII representation -- we avoid endianness issues and also support alphabetic ids)
@ -164,39 +165,42 @@ Commands from frontend to broker:
- `job_url` - URI location of archive with job configuration and submitted source code
- `result_url` - remote URI where results will be pushed to
Commands from broker to frontend (all are responses to **eval** command):
#### Commands from broker to API (all are responses to **eval** command):
- **ack** - this is first message which is sent back to frontend right after eval command arrives, basically it means "Hi, I am all right and am capable of receiving job requests", after sending this broker will try to find acceptable worker for arrived request
- **accept** - broker is capable of routing request to a worker
- **reject** - broker can't handle this job (for example when the requirements
specified by the headers cannot be met). There are (rare) cases when the
broker finds that it cannot handle the job after it's been confirmed. In such
cases it uses the frontend REST API to mark the job as failed.
### File Server - Frontend communication
### File Server - Web API communication
File server has a REST API for interaction with other parts of ReCodEx. Description communication with workers is in [File server point of view](#file-server-point-of-view) section. On top of that, there are other command for interaction with frontend:
File server has a REST API for interaction with other parts of ReCodEx. Description of communication with workers is in [File server side](#file-server-side) section. On top of that, there are other commands for interaction with the API:
- **GET /results/\<id\>.\<ext\>** - download archive with evaluated results of job _id_
- **POST /submissions/\<id\>** - upload new submission with identifier _id_. Expects that the body of the POST request uses file paths as keys and the content of the files as values. On successful upload returns JSON `{ "archive_path": <archive_url>, "result_path": <result_url> }` in response body. From _archive_path_ can be the submission downloaded (by worker) and corresponding evaluation results shouldbe uploaded to _result_path_.
- **POST /submissions/\<id\>** - upload new submission with identifier _id_. Expects that the body of the POST request uses file paths as keys and the content of the files as values. On successful upload returns JSON `{ "archive_path": <archive_url>, "result_path": <result_url> }` in response body. From _archive_path_ the submission can be downloaded (by worker) and corresponding evaluation results shouldbe uploaded to _result_path_.
- **POST /tasks** - upload new files, which will be available by names eqal to `sha1sum` of their content. There can be uploaded more files at once. On successful upload returns JSON `{ "result": "OK", "files": <file_list> }` in response body, where _file_list_ is dictionary of original file name as key and new URL with already hashed name as value.
There are no plans yet to support deleting files from this API. This may change in time.
**TODO: frontend side**
Web API calls these fileserver endpoints with standard HTTP requests. There are no special commands involved. There is no communication in opposite direction.
### Monitor - Browser communication
Monitor interacts with browser through WebSocket connection. Monitor acts as server and browsers are connecting to it. IP address and port are also configurable. When client connects to the monitor, it sends a message with string representation of channel id (which messages are interested in, usually id of evaluating job). There can be at most one listener per channel, latter connection replaces previous one.
### Monitor - Web app communication
When monitor receives "progress" message from broker there are two options:
Monitor interacts with web application through WebSocket connection. Monitor acts as server and browsers are connecting to it. IP address and port are configurable. When client connects to the monitor, it sends a message with string representation of channel id (which messages are interested in, usually id of evaluating job). There can be multiple listeners per channel, even (shortly) delayed connections will receive all messages from the very begining.
When monitor receives **progress** message from broker there are two options:
- there is no WebSocket connection for listed channel (job id) - message is dropped
- there is active WebSocket connection for listed channel - message is parsed into JSON format (see below) and send as string to browser. Messages for active connections are queued, so no messages are discarded even on heavy workload.
- there is active WebSocket connection for listed channel - message is parsed into JSON format (see below) and send as string to that established channel. Messages for active connections are queued, so no messages are discarded even on heavy workload.
Message JSON format is dictionary with keys:
Message JSON format is dictionary (associative array) with keys:
- **command** - type of progress.
- **command** - type of progress, one of:
- DOWNLOADED - submission successfuly fetched from fileserver
- FAILED - something bad happened and job was not executed at all
- UPLOADED - results are uploaded to fileserver
@ -212,8 +216,9 @@ Message JSON format is dictionary with keys:
- SKIPPED - some of the previous dependencies failed to execute, so this task wont be executed at all
### Frontend - Browser communication
**TODO:**
### Web app - Web API communication
Provided web application runs as javascript client inside user's browser. It communicates with REST API on the server through standard HTTP requests. Documentation of the main REST API is in separate document due to it's extensiveness. Results are returned as JSON payload, which is simply parsed in web application and presented to the users.
## Assignments
@ -722,10 +727,15 @@ Only remaining part is evaluation of results. This is provided on demand when us
- .
## Common install
## Installation
### Ansible installer
### Hints to manuall install
This page contains steps how to set up a computer on which some parts of **ReCodEx** may run. Most steps are listed in two variants, for Red Hat based distributions (like RHEL, Centos or Fedora) and Debian based distibutions. Before starting, make sure you have completed basic OS installation and set up, including users and logins, SSH, Git, firewall, etc.
- See [this](https://fedoraproject.org/wiki/EPEL) for instructions.
### Installation of dependencies
#### Installation of dependencies
Install as new version of each package as possible, so mostly Debian packages are from testing repositories and RHEL packages are the newest ones from EPEL repositories.
@ -766,3 +776,5 @@ Install as new version of each package as possible, so mostly Debian packages ar