Copy broker content into new doc
parent
46f6b38bb7
commit
abe471833b
@ -1,72 +0,0 @@
|
||||
# Broker
|
||||
|
||||
The broker is a central part of the ReCodEx backend that directs almost all
|
||||
communication. It was designed to properly maintain heavy load of messages
|
||||
by making only small actions in main communication thread and asynchronous
|
||||
execution of other actions.
|
||||
|
||||
## Description
|
||||
|
||||
The broker's responsibilites are:
|
||||
|
||||
- allowing workers to register themselves and keep track of their capabilities
|
||||
- tracking status of each worker and handle cases when they crash
|
||||
- accepting assignment evaluation requests from the frontend and forwarding them
|
||||
to workers
|
||||
- receiving job status information from workers and forward it to the frontend
|
||||
either via monitor or REST API
|
||||
- notifying the frontend of errors in the backend
|
||||
|
||||
## Architecture
|
||||
|
||||
The broker uses our ZeroMQ _reactor_ to bind events on sockets to handler classes.
|
||||
There are currently two handlers -- one that handles the main functionality and
|
||||
another one that sends status reports to the REST API asynchronously so that the
|
||||
broker does not have to wait for HTTP requests which can take a lot of time,
|
||||
especially when some kind of error happens on the server.
|
||||
|
||||
### Worker registry
|
||||
|
||||
The `worker_registry` class is used to store information about workers, their
|
||||
status and the jobs in their queue. It can look up a worker using the headers
|
||||
received with a request (a worker is considered suitable if and only if it
|
||||
satisfies all the headers). The headers are arbitrary key-value pairs, which
|
||||
are checked for equality by the broker. However, some headers require special
|
||||
handling, namely `threads`, for which we check if the value in the request is
|
||||
lesser than or equal to the value advertised by the worker, and `hwgroup`, for
|
||||
which we support requesting one of multiple hardware groups by listing multiple
|
||||
names separated with a `|` symbol (e.g. `group_1|group_2|group_3`.
|
||||
|
||||
The registry also implements a basic load balancing algorithm -- the
|
||||
workers are contained in a queue and whenever one of them receives a job, it is
|
||||
moved to its end, which makes it less likely to receive another job soon.
|
||||
|
||||
When a worker is assigned a job, it will not be assigned another one until we
|
||||
receive a `done` message from it.
|
||||
|
||||
### Error handling
|
||||
|
||||
**Job failure** -- we recognize two ways a job can fail -- an internally and
|
||||
externally. An internal failure is the worker's fault -- for example when it
|
||||
cannot download a file needed for the evaluation for some reason. An external
|
||||
error is for example when the job configuration is malformed. Note that we do not
|
||||
consider a student entering an incorrect solution a job failure.
|
||||
|
||||
Jobs that failed internally are reassigned until a limit on the amount of
|
||||
reassingments (configurable with the `max_request_failures` option) is reached.
|
||||
External failures are reported to the frontend immediately.
|
||||
|
||||
**Worker failure** -- when a worker crash is detected, we attempt to reassign its
|
||||
current job and also all the jobs from its queue. Because the current job might
|
||||
be the reason of the crash, its reassignment is also counted towards the
|
||||
`max_request_failures` limit (the counter is shared). If there is no worker that
|
||||
could process a job (i.e. it cannot be reassigned), the job is reported as
|
||||
failed to the frontend via REST API.
|
||||
|
||||
**Broker failure** -- when the broker itself crashed and is restarted, workers
|
||||
will reconnect automatically. However, all jobs in their queues are lost. If a
|
||||
worker manages to finish a job and notifies the "new" broker, the report is
|
||||
forwarded to the frontend. The same goes for external failures. Jobs that fail
|
||||
internally cannot be reassigned, because the "new" broker does not know their
|
||||
headers -- they are reported as failed immediately.
|
||||
|
Loading…
Reference in New Issue