|
|
|
@ -1,11 +1,64 @@
|
|
|
|
|
# Broker
|
|
|
|
|
Broker is essential part of ReCodEx solution which maintaines almost all communication.
|
|
|
|
|
|
|
|
|
|
The broker is a cental part of the ReCodEx backend that directs almost all
|
|
|
|
|
communication.
|
|
|
|
|
|
|
|
|
|
## Description
|
|
|
|
|
|
|
|
|
|
The broker's responsibilites are:
|
|
|
|
|
|
|
|
|
|
- allowing workers to register themselves and keep track of their capabilities
|
|
|
|
|
- tracking worker's status and handle cases when they crash
|
|
|
|
|
- accepting assignment evaluation requests from the frontend and forwarding them
|
|
|
|
|
to workers
|
|
|
|
|
- receiving job status information from workers and forward it to the frontend
|
|
|
|
|
either via monitor or REST API
|
|
|
|
|
- notifying the frontend of errors in the backend
|
|
|
|
|
|
|
|
|
|
## Architecture
|
|
|
|
|
|
|
|
|
|
The broker uses our ZeroMQ reactor to bind events on sockets to handler classes.
|
|
|
|
|
There are currently two handlers - one that handles the main functionality and
|
|
|
|
|
another one that sends status reports to the REST API asynchronously so that the
|
|
|
|
|
broker doesn't have to wait for HTTP requests which can take a lot of time,
|
|
|
|
|
especially when some kind of error happens on the server.
|
|
|
|
|
|
|
|
|
|
### Worker registry
|
|
|
|
|
|
|
|
|
|
The `worker_registry` class is used to store information about workers, their
|
|
|
|
|
status and the jobs in their queue. It can look up a worker using the headers
|
|
|
|
|
received with a request. It also uses a basic load balancing algorithm - the
|
|
|
|
|
workers are contained in a queue and whenever one of them receives a job, it's
|
|
|
|
|
moved to the back, which makes it less likely to receive another job soon.
|
|
|
|
|
|
|
|
|
|
When a worker is assigned a job, it won't be assigned another one until we
|
|
|
|
|
receive a `done` message from it.
|
|
|
|
|
|
|
|
|
|
### Error handling
|
|
|
|
|
|
|
|
|
|
**Job failure** - we recognize two ways a job can fail - an internally and
|
|
|
|
|
externally. An internal failure is the worker's fault - for example when it
|
|
|
|
|
can't download a file needed for the evaluation for some reason. An external
|
|
|
|
|
error is for example when the job configuration is malformed. Note that we don't
|
|
|
|
|
consider a student entering an incorrect solution a job failure.
|
|
|
|
|
|
|
|
|
|
Jobs that failed internally are reassigned until a limit on the amount of
|
|
|
|
|
reassingments (configurable with the `max_request_failures` option) is reached.
|
|
|
|
|
External failures are reported to the frontend immediately.
|
|
|
|
|
|
|
|
|
|
**Worker failure** - when a worker crash is detected, we attempt to reassign its
|
|
|
|
|
current job and also all the jobs from its queue. Because the current job might
|
|
|
|
|
be the reason of the crash, its reassignment is also counted towards the
|
|
|
|
|
`max_request_failures` limit (the counter is shared). If there is no worker that
|
|
|
|
|
could process a job (i.e. it cannot be reassigned), the job is reported as
|
|
|
|
|
failed to the frontend via REST API.
|
|
|
|
|
|
|
|
|
|
**Broker failure** - when the broker itself crashes and is restarted, workers
|
|
|
|
|
will reconnect automatically. However, all jobs in their queues are lost. If a
|
|
|
|
|
worker manages to finish a job and notifies the "new" broker, the report is
|
|
|
|
|
forwarded to the frontend. The same goes for external failures. Jobs that fail
|
|
|
|
|
internally cannot be reassigned, because the "new" broker doesn't know their
|
|
|
|
|
headers - they are reported as failed immediately.
|
|
|
|
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
|
|
|
|