You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
99 lines
6.4 KiB
Markdown
99 lines
6.4 KiB
Markdown
# Worker
|
|
|
|
## Description
|
|
|
|
The worker's job is to securely execute submitted assignments and possibly
|
|
evaluate results against model solutions provided by the exercise author. After
|
|
receiving an evaluation request, worker has to:
|
|
|
|
- download the archive containing submitted source files and configuration file
|
|
- download any supplementary files based on the configuration file, such as test
|
|
inputs or helper programs (this is done on demand, using a `fetch` command
|
|
in the assignment configuration)
|
|
- evaluate the submission according to job configuration
|
|
- during evaluation progress messages can be sent back to broker
|
|
- upload the results of the evaluation to the fileserver
|
|
- notify broker that the evaluation finished
|
|
|
|
### Header matching
|
|
|
|
Every worker belongs to exactly one **hardware group** and has a set of **headers**.
|
|
These properties help the broker decide which worker is suitable for processing
|
|
a request.
|
|
|
|
The hardware group is a string identifier used to group worker machines with
|
|
similar hardware configuration, for example "i7-4560-quad-ssd". It is
|
|
important for assignments where running times are compared to those of reference
|
|
solutions (we have to make sure that both programs run on simmilar hardware).
|
|
|
|
The headers are a set of key-value pairs that describe the worker
|
|
capabilities -- which runtime environments are installed, how many threads can
|
|
the worker run or whether it measures time precisely.
|
|
|
|
These information are sent to the broker on startup using the `init` command.
|
|
|
|
|
|
## Architecture
|
|
|
|
### Internal communication
|
|
|
|
Worker is logicaly divided into three parts:
|
|
|
|
- **Listener** - communicates with broker through
|
|
[ZeroMQ](http://zeromq.org/). On startup, it introduces itself to the broker.
|
|
Then it receives new jobs, passes them to the **evaluator** part and sends
|
|
back results and progress reports.
|
|
- **Evaluator** - gets jobs from the **listener** part, evaluates them (possibly
|
|
in sandbox) and notifies the other part when the evaluation ends. **Evaluator**
|
|
also communicates with fileserver, downloads supplementary files and
|
|
uploads detailed results.
|
|
- **Progress callback** -- receives information about the progress of an
|
|
evaluation from the evaluator and forwards them to the broker.
|
|
|
|
These parts run in separate threads of the same process and communicate through
|
|
ZeroMQ in-process sockets. Alternative approach would be using shared memory
|
|
region with unique access, but messaging is generally considered safer. Shared
|
|
memory has to be used very carefully because of race condition issues when
|
|
reading and writing concurrently. Also, messages inside worker are small, so
|
|
there is no big overhead copying data between threads. This multi-threaded
|
|
design allows the worker to keep sending `ping` messages even when it is
|
|
processing a job.
|
|
|
|
### File management
|
|
|
|
The messages sent by the broker to assign jobs to workers are rather simple -
|
|
they don't contain any files, only a URL of an archive with a job configuration.
|
|
When processing the job, it may also be necessary to fetch supplementary files
|
|
such as helper scripts or test inputs and outputs.
|
|
|
|
Supplementary files are addressed using hashes of their content, which allows
|
|
simple caching. Requested files are downloaded into the cache on demand.
|
|
This mechanism is hidden from the job evaluator, which depends on a
|
|
`file_manager_interface` instance. Because the filesystem cache can be shared
|
|
between more workers, cleaning functionality is implemented by the Cleaner
|
|
program that should be set up to run periodically.
|
|
|
|
### Running student submissions
|
|
|
|
Student submissions are executed inside sandboxing environment to prevent damage of host system and also to restrict amount of used resources. Now only the Isolate sandbox support is implemented in worker, but there is a possibility of easy extending list of supported sandboxes.
|
|
|
|
Isolate is executed in separate Linux process created by `fork` and `exec` system calls. Communication between processes is performed through unnamed pipe with standard input and output descriptors redirection. To prevent Isolate failure there is another safety guard -- whole sandbox is killed when it does not end in `(time + 300) * 1.2` seconds for `time` as original maximum time allowed for the task. However, Isolate should allways end itself in time, so this additional safety should never be used.
|
|
|
|
Sandbox in general has to be command line application taking parameters with arguments, standard input or file. Outputs should be written to file or standard output. There are no other requirements, worker design is very versatile and can be adapted to different needs.
|
|
|
|
|
|
## Cleaner
|
|
|
|
### Description
|
|
|
|
Cleaner is integral part of worker which manages its cache folder, mainly deletes outdated files. Every cleaner instance maintains one cache folder, which can be used by multiple workers. This means on one server there can be numerous instances of workers with the same cache folder, but there should be only one cleaner.
|
|
|
|
Cleaner is written in Python programming language and is used as simple script which just does its job and ends, so has to be cronned. For proper function of cleaner some suitable cronning interval has to be used. It is recommended to use 24 hour interval which should be sufficient enough.
|
|
|
|
#### Last access timestamp
|
|
|
|
There is a bit of catch with cleaner service, to work properly, server filesystem has to have enabled last access timestamp. Cleaner checks these stamps and based on them it decides if file will be deleted or not, simple write timestamp or created at timestamp are not enough to reflect real usage and need of particular file. Last access timestamp feature is a bit controversial (more on this subject can be found [here](https://en.wikipedia.org/wiki/Stat_%28system_call%29#Criticism_of_atime)) and it is not by default enabled on conventional filesystems. In linux this can be solved by adding `strictatime` option to `fstab` file. On Windows following command has to be executed (as administrator) `fsutil behavior set disablelastaccess 0`.
|
|
|
|
Another possibility seems to be to update last modified timestamp when accessing the file. This timestamp is used in most major filesystems, so there are less issues with compatibility than last access timestamp. The modified timestamp then must be updated by workers at each access, for example using `touch` command or similar. Final decision on better of these ways will be made after practical experience of running production system.
|
|
|