Broker implementation update

master
Petr Stefan 8 years ago
parent c30aced4b1
commit a35975951a
No known key found for this signature in database
GPG Key ID: B1D74F2C9C7433D3

@ -2656,10 +2656,10 @@ used.
## Broker
The broker is a central part of the ReCodEx backend that directs almost all
communication. It was designed to properly maintain heavy load of messages by
making only small actions in main communication thread and asynchronous
execution of other actions.
The broker is a central part of the ReCodEx backend that directs most of the
communication. It was designed to maintain a heavy load of messages by making
only small actions in the main communication thread and asynchronous execution
of other actions.
The responsibilites of broker are:
@ -2667,18 +2667,18 @@ The responsibilites of broker are:
- tracking status of each worker and handle cases when they crash
- accepting assignment evaluation requests from the frontend and forwarding them
to workers
- receiving job status information from workers and forward it to the frontend
either via monitor or REST API
- notifying the frontend of errors in the backend
- receiving a job status information from workers and forward them to the
frontend either via monitor or REST API
- notifying the frontend on errors of the backend
### Internal Structure
The broker uses our ZeroMQ _reactor_ to bind events on sockets to handler
classes. There are currently two handlers -- one that handles the main
functionality and another one that sends status reports to the REST API
asynchronously so that the broker does not have to wait for HTTP requests which
can take a lot of time, especially when some kind of error happens on the
server.
The main work of the broker is to handle incomming messages. For that a
_reactor_ subcomponent is written to bind events on sockets to handler classes.
There are currently two handlers -- one that handles the main functionality and
the other that sends status reports to the REST API asynchronously. This
prevents broker freezes when synchronously waiting for responses of HTTP
requests, especially when some kind of error happens on the server.
Main handler takes care of requests from workers and API servers:
@ -2697,8 +2697,8 @@ requests. This notifier is used on error reporting from backend to frontend API.
The `worker_registry` class is used to store information about workers, their
status and the jobs in their queue. It can look up a worker using the headers
received with a request (a worker is considered suitable if and only if it
satisfies all the headers). The headers are arbitrary key-value pairs, which are
checked for equality by the broker. However, some headers require special
satisfies all the job headers). The headers are arbitrary key-value pairs, which
are checked for equality by the broker. However, some headers require special
handling, namely `threads`, for which we check if the value in the request is
lesser than or equal to the value advertised by the worker, and `hwgroup`, for
which we support requesting one of multiple hardware groups by listing multiple
@ -2708,34 +2708,35 @@ The registry also implements a basic load balancing algorithm -- the workers are
contained in a queue and whenever one of them receives a job, it is moved to its
end, which makes it less likely to receive another job soon.
When a worker is assigned a job, it will not be assigned another one until we
receive a `done` message from it.
When a worker is assigned a job, it will not be assigned another one until a
`done` message is received.
#### Error Reporting
Broker is the only backend component which is able to report errors to frontend
API. For this purpose HTTP protocol is used through *libcurl* library. To
address security concerns there is *HTTP Basic Auth* configured on particular
endpoints which is simply enough to use within *libcurl*.
Broker is the only backend component which is able to report errors directly to
the REST API. Other components have to notify the broker first and it forwards
the messages to the API. For HTTP communication a *libcurl* library is used. To
address security concerns there is a *HTTP Basic Auth* configured on particular
API endpoints and correct credentials have to be entered.
Following types of failures are distinguished:
**Job failure** -- we recognize two ways a job can fail -- an internally and
externally. An internal failure is the fault of worker -- for example when it
cannot download a file needed for the evaluation for some reason. An external
error is for example when the job configuration is malformed. Note that we do
not consider a student entering an incorrect solution a job failure.
**Job failure** -- there are two ways a job can fail, internal and external one.
An internal failure is the fault of worker, for example when it
cannot download a file needed for the evaluation. An external
error is for example when the job configuration is malformed. Note that wrong
student solution is not considered as a job failure.
Jobs that failed internally are reassigned until a limit on the amount of
reassingments (configurable with the `max_request_failures` option) is reached.
External failures are reported to the frontend immediately.
**Worker failure** -- when a worker crash is detected, we attempt to reassign
its current job and also all the jobs from its queue. Because the current job
might be the reason of the crash, its reassignment is also counted towards the
`max_request_failures` limit (the counter is shared). If there is no worker
that could process a job (i.e. it cannot be reassigned), the job is reported as
failed to the frontend via REST API.
**Worker failure** -- when a worker crash is detected, an attempt to reassign
its current job and also all the jobs from its queue is made. Because the
current job might be the reason of the crash, its reassignment is also counted
towards the `max_request_failures` limit (the counter is shared). If there is
no worker that could process a job available (i.e. it cannot be reassigned),
the job is reported as failed to the frontend via REST API.
**Broker failure** -- when the broker itself crashed and is restarted, workers
will reconnect automatically. However, all jobs in their queues are lost. If a

Loading…
Cancel
Save