You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

133 lines
6.2 KiB

# Broker
The broker is a central part of the ReCodEx backend that directs almost all
communication. It was designed to properly maintain heavy load of messages
by making only small actions in main communication thread and asynchronous
execution of other actions.
## Description
The broker's responsibilites are:
- allowing workers to register themselves and keep track of their capabilities
- tracking status of each worker and handle cases when they crash
- accepting assignment evaluation requests from the frontend and forwarding them
to workers
- receiving job status information from workers and forward it to the frontend
either via monitor or REST API
- notifying the frontend of errors in the backend
## Architecture
The broker uses our ZeroMQ _reactor_ to bind events on sockets to handler classes.
There are currently two handlers -- one that handles the main functionality and
another one that sends status reports to the REST API asynchronously so that the
broker does not have to wait for HTTP requests which can take a lot of time,
especially when some kind of error happens on the server.
### Worker registry
The `worker_registry` class is used to store information about workers, their
status and the jobs in their queue. It can look up a worker using the headers
received with a request (a worker is considered suitable if and only if it
satisfies all the headers). The headers are arbitrary key-value pairs, which
are checked for equality by the broker. However, some headers require special
handling, namely `threads`, for which we check if the value in the request is
lesser than or equal to the value advertised by the worker, and `hwgroup`, for
which we support requesting one of multiple hardware groups by listing multiple
names separated with a `|` symbol (e.g. `group_1|group_2|group_3`.
The registry also implements a basic load balancing algorithm -- the
workers are contained in a queue and whenever one of them receives a job, it is
moved to its end, which makes it less likely to receive another job soon.
When a worker is assigned a job, it will not be assigned another one until we
receive a `done` message from it.
### Error handling
**Job failure** -- we recognize two ways a job can fail -- an internally and
externally. An internal failure is the worker's fault -- for example when it
cannot download a file needed for the evaluation for some reason. An external
error is for example when the job configuration is malformed. Note that we do not
consider a student entering an incorrect solution a job failure.
Jobs that failed internally are reassigned until a limit on the amount of
reassingments (configurable with the `max_request_failures` option) is reached.
External failures are reported to the frontend immediately.
**Worker failure** -- when a worker crash is detected, we attempt to reassign its
current job and also all the jobs from its queue. Because the current job might
be the reason of the crash, its reassignment is also counted towards the
`max_request_failures` limit (the counter is shared). If there is no worker that
could process a job (i.e. it cannot be reassigned), the job is reported as
failed to the frontend via REST API.
**Broker failure** -- when the broker itself crashed and is restarted, workers
will reconnect automatically. However, all jobs in their queues are lost. If a
worker manages to finish a job and notifies the "new" broker, the report is
forwarded to the frontend. The same goes for external failures. Jobs that fail
internally cannot be reassigned, because the "new" broker does not know their
headers -- they are reported as failed immediately.
## Configuration and usage
Following text describes how to set up and run broker program. It is supposed to have required binaries installed. Also, using systemd is recommended for best user experience, but it is not required. Almost all modern Linux distributions are using systemd now.
### Default broker configuration
#### Configuration items
Description of configurable items in broker's config. Mandatory items are bold, optional italic.
- _clients_ -- specifies address and port to bind for clients (frontend instance)
- _address_ -- hostname or IP address as string (`*` for any)
- _port_ -- desired port
- _workers_ -- specifies address and port to bind for workers
- _address_ -- hostname or IP address as string (`*` for any)
- _port_ -- desired port
- _max_liveness_ -- maximum amount of pings the worker can fail to send before it is considered disconnected
- _max_request_failures_ -- maximum number of times a job can fail (due to e.g. worker disconnect or a network error when downloading something from the fileserver) and be assigned again
- _monitor_ -- settings of monitor service connection
- _address_ -- IP address of running monitor service
- _port_ -- desired port
- _notifier_ -- details of connection which is used in case of errors and good to know states
- _address_ -- address where frontend API runs
- _port_ -- desired port
- _username_ -- username which can be used for HTTP authentication
- _password_ -- password which can be used for HTTP authentication
- _logger_ -- settings of logging capabilities
- _file_ -- path to the logging file with name without suffix. `/var/log/recodex/broker` item will produce `broker.log`, `broker.1.log`, ...
- _level_ -- level of logging, one of `off`, `emerg`, `alert`, `critical`, `err`, `warn`, `notice`, `info` and `debug`
- _max-size_ -- maximal size of log file before rotating
- _rotations_ -- number of rotation kept
#### Example config file
# Address and port for clients (frontend)
address: "*"
port: 9658 # Address and port for workers
address: "*"
port: 9657
max_liveness: 10
max_request_failures: 3
address: ""
port: 7894
address: ""
port: 8080
username: ""
password: ""
file: "/var/log/recodex/broker" # w/o suffix - actual names will be
# broker.log, broker.1.log, ...
level: "debug" # level of logging
max-size: 1048576 # 1 MB; max size of file before log rotation
rotations: 3 # number of rotations kept