|
|
|
@ -2656,10 +2656,10 @@ used.
|
|
|
|
|
|
|
|
|
|
## Broker
|
|
|
|
|
|
|
|
|
|
The broker is a central part of the ReCodEx backend that directs almost all
|
|
|
|
|
communication. It was designed to properly maintain heavy load of messages by
|
|
|
|
|
making only small actions in main communication thread and asynchronous
|
|
|
|
|
execution of other actions.
|
|
|
|
|
The broker is a central part of the ReCodEx backend that directs most of the
|
|
|
|
|
communication. It was designed to maintain a heavy load of messages by making
|
|
|
|
|
only small actions in the main communication thread and asynchronous execution
|
|
|
|
|
of other actions.
|
|
|
|
|
|
|
|
|
|
The responsibilites of broker are:
|
|
|
|
|
|
|
|
|
@ -2667,18 +2667,18 @@ The responsibilites of broker are:
|
|
|
|
|
- tracking status of each worker and handle cases when they crash
|
|
|
|
|
- accepting assignment evaluation requests from the frontend and forwarding them
|
|
|
|
|
to workers
|
|
|
|
|
- receiving job status information from workers and forward it to the frontend
|
|
|
|
|
either via monitor or REST API
|
|
|
|
|
- notifying the frontend of errors in the backend
|
|
|
|
|
- receiving a job status information from workers and forward them to the
|
|
|
|
|
frontend either via monitor or REST API
|
|
|
|
|
- notifying the frontend on errors of the backend
|
|
|
|
|
|
|
|
|
|
### Internal Structure
|
|
|
|
|
|
|
|
|
|
The broker uses our ZeroMQ _reactor_ to bind events on sockets to handler
|
|
|
|
|
classes. There are currently two handlers -- one that handles the main
|
|
|
|
|
functionality and another one that sends status reports to the REST API
|
|
|
|
|
asynchronously so that the broker does not have to wait for HTTP requests which
|
|
|
|
|
can take a lot of time, especially when some kind of error happens on the
|
|
|
|
|
server.
|
|
|
|
|
The main work of the broker is to handle incomming messages. For that a
|
|
|
|
|
_reactor_ subcomponent is written to bind events on sockets to handler classes.
|
|
|
|
|
There are currently two handlers -- one that handles the main functionality and
|
|
|
|
|
the other that sends status reports to the REST API asynchronously. This
|
|
|
|
|
prevents broker freezes when synchronously waiting for responses of HTTP
|
|
|
|
|
requests, especially when some kind of error happens on the server.
|
|
|
|
|
|
|
|
|
|
Main handler takes care of requests from workers and API servers:
|
|
|
|
|
|
|
|
|
@ -2697,8 +2697,8 @@ requests. This notifier is used on error reporting from backend to frontend API.
|
|
|
|
|
The `worker_registry` class is used to store information about workers, their
|
|
|
|
|
status and the jobs in their queue. It can look up a worker using the headers
|
|
|
|
|
received with a request (a worker is considered suitable if and only if it
|
|
|
|
|
satisfies all the headers). The headers are arbitrary key-value pairs, which are
|
|
|
|
|
checked for equality by the broker. However, some headers require special
|
|
|
|
|
satisfies all the job headers). The headers are arbitrary key-value pairs, which
|
|
|
|
|
are checked for equality by the broker. However, some headers require special
|
|
|
|
|
handling, namely `threads`, for which we check if the value in the request is
|
|
|
|
|
lesser than or equal to the value advertised by the worker, and `hwgroup`, for
|
|
|
|
|
which we support requesting one of multiple hardware groups by listing multiple
|
|
|
|
@ -2708,34 +2708,35 @@ The registry also implements a basic load balancing algorithm -- the workers are
|
|
|
|
|
contained in a queue and whenever one of them receives a job, it is moved to its
|
|
|
|
|
end, which makes it less likely to receive another job soon.
|
|
|
|
|
|
|
|
|
|
When a worker is assigned a job, it will not be assigned another one until we
|
|
|
|
|
receive a `done` message from it.
|
|
|
|
|
When a worker is assigned a job, it will not be assigned another one until a
|
|
|
|
|
`done` message is received.
|
|
|
|
|
|
|
|
|
|
#### Error Reporting
|
|
|
|
|
|
|
|
|
|
Broker is the only backend component which is able to report errors to frontend
|
|
|
|
|
API. For this purpose HTTP protocol is used through *libcurl* library. To
|
|
|
|
|
address security concerns there is *HTTP Basic Auth* configured on particular
|
|
|
|
|
endpoints which is simply enough to use within *libcurl*.
|
|
|
|
|
Broker is the only backend component which is able to report errors directly to
|
|
|
|
|
the REST API. Other components have to notify the broker first and it forwards
|
|
|
|
|
the messages to the API. For HTTP communication a *libcurl* library is used. To
|
|
|
|
|
address security concerns there is a *HTTP Basic Auth* configured on particular
|
|
|
|
|
API endpoints and correct credentials have to be entered.
|
|
|
|
|
|
|
|
|
|
Following types of failures are distinguished:
|
|
|
|
|
|
|
|
|
|
**Job failure** -- we recognize two ways a job can fail -- an internally and
|
|
|
|
|
externally. An internal failure is the fault of worker -- for example when it
|
|
|
|
|
cannot download a file needed for the evaluation for some reason. An external
|
|
|
|
|
error is for example when the job configuration is malformed. Note that we do
|
|
|
|
|
not consider a student entering an incorrect solution a job failure.
|
|
|
|
|
|
|
|
|
|
Jobs that failed internally are reassigned until a limit on the amount of
|
|
|
|
|
reassingments (configurable with the `max_request_failures` option) is reached.
|
|
|
|
|
External failures are reported to the frontend immediately.
|
|
|
|
|
|
|
|
|
|
**Worker failure** -- when a worker crash is detected, we attempt to reassign
|
|
|
|
|
its current job and also all the jobs from its queue. Because the current job
|
|
|
|
|
might be the reason of the crash, its reassignment is also counted towards the
|
|
|
|
|
`max_request_failures` limit (the counter is shared). If there is no worker
|
|
|
|
|
that could process a job (i.e. it cannot be reassigned), the job is reported as
|
|
|
|
|
failed to the frontend via REST API.
|
|
|
|
|
**Job failure** -- there are two ways a job can fail, internal and external one.
|
|
|
|
|
An internal failure is the fault of worker, for example when it
|
|
|
|
|
cannot download a file needed for the evaluation. An external
|
|
|
|
|
error is for example when the job configuration is malformed. Note that wrong
|
|
|
|
|
student solution is not considered as a job failure.
|
|
|
|
|
|
|
|
|
|
Jobs that failed internally are reassigned until a limit on the amount of
|
|
|
|
|
reassingments (configurable with the `max_request_failures` option) is reached.
|
|
|
|
|
External failures are reported to the frontend immediately.
|
|
|
|
|
|
|
|
|
|
**Worker failure** -- when a worker crash is detected, an attempt to reassign
|
|
|
|
|
its current job and also all the jobs from its queue is made. Because the
|
|
|
|
|
current job might be the reason of the crash, its reassignment is also counted
|
|
|
|
|
towards the `max_request_failures` limit (the counter is shared). If there is
|
|
|
|
|
no worker that could process a job available (i.e. it cannot be reassigned),
|
|
|
|
|
the job is reported as failed to the frontend via REST API.
|
|
|
|
|
|
|
|
|
|
**Broker failure** -- when the broker itself crashed and is restarted, workers
|
|
|
|
|
will reconnect automatically. However, all jobs in their queues are lost. If a
|
|
|
|
|