Broker implementation update

master
Petr Stefan 8 years ago
parent c30aced4b1
commit a35975951a
No known key found for this signature in database
GPG Key ID: B1D74F2C9C7433D3

@ -2656,10 +2656,10 @@ used.
## Broker ## Broker
The broker is a central part of the ReCodEx backend that directs almost all The broker is a central part of the ReCodEx backend that directs most of the
communication. It was designed to properly maintain heavy load of messages by communication. It was designed to maintain a heavy load of messages by making
making only small actions in main communication thread and asynchronous only small actions in the main communication thread and asynchronous execution
execution of other actions. of other actions.
The responsibilites of broker are: The responsibilites of broker are:
@ -2667,18 +2667,18 @@ The responsibilites of broker are:
- tracking status of each worker and handle cases when they crash - tracking status of each worker and handle cases when they crash
- accepting assignment evaluation requests from the frontend and forwarding them - accepting assignment evaluation requests from the frontend and forwarding them
to workers to workers
- receiving job status information from workers and forward it to the frontend - receiving a job status information from workers and forward them to the
either via monitor or REST API frontend either via monitor or REST API
- notifying the frontend of errors in the backend - notifying the frontend on errors of the backend
### Internal Structure ### Internal Structure
The broker uses our ZeroMQ _reactor_ to bind events on sockets to handler The main work of the broker is to handle incomming messages. For that a
classes. There are currently two handlers -- one that handles the main _reactor_ subcomponent is written to bind events on sockets to handler classes.
functionality and another one that sends status reports to the REST API There are currently two handlers -- one that handles the main functionality and
asynchronously so that the broker does not have to wait for HTTP requests which the other that sends status reports to the REST API asynchronously. This
can take a lot of time, especially when some kind of error happens on the prevents broker freezes when synchronously waiting for responses of HTTP
server. requests, especially when some kind of error happens on the server.
Main handler takes care of requests from workers and API servers: Main handler takes care of requests from workers and API servers:
@ -2697,8 +2697,8 @@ requests. This notifier is used on error reporting from backend to frontend API.
The `worker_registry` class is used to store information about workers, their The `worker_registry` class is used to store information about workers, their
status and the jobs in their queue. It can look up a worker using the headers status and the jobs in their queue. It can look up a worker using the headers
received with a request (a worker is considered suitable if and only if it received with a request (a worker is considered suitable if and only if it
satisfies all the headers). The headers are arbitrary key-value pairs, which are satisfies all the job headers). The headers are arbitrary key-value pairs, which
checked for equality by the broker. However, some headers require special are checked for equality by the broker. However, some headers require special
handling, namely `threads`, for which we check if the value in the request is handling, namely `threads`, for which we check if the value in the request is
lesser than or equal to the value advertised by the worker, and `hwgroup`, for lesser than or equal to the value advertised by the worker, and `hwgroup`, for
which we support requesting one of multiple hardware groups by listing multiple which we support requesting one of multiple hardware groups by listing multiple
@ -2708,34 +2708,35 @@ The registry also implements a basic load balancing algorithm -- the workers are
contained in a queue and whenever one of them receives a job, it is moved to its contained in a queue and whenever one of them receives a job, it is moved to its
end, which makes it less likely to receive another job soon. end, which makes it less likely to receive another job soon.
When a worker is assigned a job, it will not be assigned another one until we When a worker is assigned a job, it will not be assigned another one until a
receive a `done` message from it. `done` message is received.
#### Error Reporting #### Error Reporting
Broker is the only backend component which is able to report errors to frontend Broker is the only backend component which is able to report errors directly to
API. For this purpose HTTP protocol is used through *libcurl* library. To the REST API. Other components have to notify the broker first and it forwards
address security concerns there is *HTTP Basic Auth* configured on particular the messages to the API. For HTTP communication a *libcurl* library is used. To
endpoints which is simply enough to use within *libcurl*. address security concerns there is a *HTTP Basic Auth* configured on particular
API endpoints and correct credentials have to be entered.
Following types of failures are distinguished: Following types of failures are distinguished:
**Job failure** -- we recognize two ways a job can fail -- an internally and **Job failure** -- there are two ways a job can fail, internal and external one.
externally. An internal failure is the fault of worker -- for example when it An internal failure is the fault of worker, for example when it
cannot download a file needed for the evaluation for some reason. An external cannot download a file needed for the evaluation. An external
error is for example when the job configuration is malformed. Note that we do error is for example when the job configuration is malformed. Note that wrong
not consider a student entering an incorrect solution a job failure. student solution is not considered as a job failure.
Jobs that failed internally are reassigned until a limit on the amount of Jobs that failed internally are reassigned until a limit on the amount of
reassingments (configurable with the `max_request_failures` option) is reached. reassingments (configurable with the `max_request_failures` option) is reached.
External failures are reported to the frontend immediately. External failures are reported to the frontend immediately.
**Worker failure** -- when a worker crash is detected, we attempt to reassign **Worker failure** -- when a worker crash is detected, an attempt to reassign
its current job and also all the jobs from its queue. Because the current job its current job and also all the jobs from its queue is made. Because the
might be the reason of the crash, its reassignment is also counted towards the current job might be the reason of the crash, its reassignment is also counted
`max_request_failures` limit (the counter is shared). If there is no worker towards the `max_request_failures` limit (the counter is shared). If there is
that could process a job (i.e. it cannot be reassigned), the job is reported as no worker that could process a job available (i.e. it cannot be reassigned),
failed to the frontend via REST API. the job is reported as failed to the frontend via REST API.
**Broker failure** -- when the broker itself crashed and is restarted, workers **Broker failure** -- when the broker itself crashed and is restarted, workers
will reconnect automatically. However, all jobs in their queues are lost. If a will reconnect automatically. However, all jobs in their queues are lost. If a

Loading…
Cancel
Save