a couple of paragraphs about the broker

master
Teyras 8 years ago
parent 80fd6a8836
commit bac0ee8c11

@ -1,11 +1,64 @@
# Broker
Broker is essential part of ReCodEx solution which maintaines almost all communication.
The broker is a cental part of the ReCodEx backend that directs almost all
communication.
## Description
The broker's responsibilites are:
- allowing workers to register themselves and keep track of their capabilities
- tracking worker's status and handle cases when they crash
- accepting assignment evaluation requests from the frontend and forwarding them
to workers
- receiving job status information from workers and forward it to the frontend
either via monitor or REST API
- notifying the frontend of errors in the backend
## Architecture
The broker uses our ZeroMQ reactor to bind events on sockets to handler classes.
There are currently two handlers - one that handles the main functionality and
another one that sends status reports to the REST API asynchronously so that the
broker doesn't have to wait for HTTP requests which can take a lot of time,
especially when some kind of error happens on the server.
### Worker registry
The `worker_registry` class is used to store information about workers, their
status and the jobs in their queue. It can look up a worker using the headers
received with a request. It also uses a basic load balancing algorithm - the
workers are contained in a queue and whenever one of them receives a job, it's
moved to the back, which makes it less likely to receive another job soon.
When a worker is assigned a job, it won't be assigned another one until we
receive a `done` message from it.
### Error handling
**Job failure** - we recognize two ways a job can fail - an internally and
externally. An internal failure is the worker's fault - for example when it
can't download a file needed for the evaluation for some reason. An external
error is for example when the job configuration is malformed. Note that we don't
consider a student entering an incorrect solution a job failure.
Jobs that failed internally are reassigned until a limit on the amount of
reassingments (configurable with the `max_request_failures` option) is reached.
External failures are reported to the frontend immediately.
**Worker failure** - when a worker crash is detected, we attempt to reassign its
current job and also all the jobs from its queue. Because the current job might
be the reason of the crash, its reassignment is also counted towards the
`max_request_failures` limit (the counter is shared). If there is no worker that
could process a job (i.e. it cannot be reassigned), the job is reported as
failed to the frontend via REST API.
**Broker failure** - when the broker itself crashes and is restarted, workers
will reconnect automatically. However, all jobs in their queues are lost. If a
worker manages to finish a job and notifies the "new" broker, the report is
forwarded to the frontend. The same goes for external failures. Jobs that fail
internally cannot be reassigned, because the "new" broker doesn't know their
headers - they are reported as failed immediately.
## Installation

Loading…
Cancel
Save