Broker implementation update

8 years ago · a35975951a
parent c30aced4b1
commit a35975951a
1 changed files with 38 additions and 37 deletions
--- a/Rewritten-docs.md
+++ b/Rewritten-docs.md
@ -2656,10 +2656,10 @@ used.

 ## Broker

-The broker is a central part of the ReCodEx backend that directs almost all
-communication. It was designed to properly maintain heavy load of messages by
-making only small actions in main communication thread and asynchronous
-execution of other actions.
+The broker is a central part of the ReCodEx backend that directs most of the
+communication. It was designed to maintain a heavy load of messages by making
+only small actions in the main communication thread and asynchronous execution
+of other actions.

 The responsibilites of broker are:

@ -2667,18 +2667,18 @@ The responsibilites of broker are:
 - tracking status of each worker and handle cases when they crash
 - accepting assignment evaluation requests from the frontend and forwarding them
  to workers
- receiving job status information from workers and forward it to the frontend
-  either via monitor or REST API
- notifying the frontend of errors in the backend
+- receiving a job status information from workers and forward them to the
+  frontend either via monitor or REST API
+- notifying the frontend on errors of the backend

 ### Internal Structure

-The broker uses our ZeroMQ _reactor_ to bind events on sockets to handler
-classes. There are currently two handlers -- one that handles the main
-functionality and another one that sends status reports to the REST API
-asynchronously so that the broker does not have to wait for HTTP requests which
-can take a lot of time, especially when some kind of error happens on the
-server.
+The main work of the broker is to handle incomming messages. For that a
+_reactor_ subcomponent is written to bind events on sockets to handler classes.
+There are currently two handlers -- one that handles the main functionality and
+the other that sends status reports to the REST API asynchronously. This
+prevents broker freezes when synchronously waiting for responses of HTTP
+requests, especially when some kind of error happens on the server.

 Main handler takes care of requests from workers and API servers:

@ -2697,8 +2697,8 @@ requests. This notifier is used on error reporting from backend to frontend API.
 The `worker_registry` class is used to store information about workers, their
 status and the jobs in their queue. It can look up a worker using the headers
 received with a request (a worker is considered suitable if and only if it
-satisfies all the headers). The headers are arbitrary key-value pairs, which are
-checked for equality by the broker. However, some headers require special
+satisfies all the job headers). The headers are arbitrary key-value pairs, which
+are checked for equality by the broker. However, some headers require special
 handling, namely `threads`, for which we check if the value in the request is
 lesser than or equal to the value advertised by the worker, and `hwgroup`, for
 which we support requesting one of multiple hardware groups by listing multiple
@ -2708,34 +2708,35 @@ The registry also implements a basic load balancing algorithm -- the workers are
 contained in a queue and whenever one of them receives a job, it is moved to its
 end, which makes it less likely to receive another job soon.

-When a worker is assigned a job, it will not be assigned another one until we
-receive a `done` message from it.
+When a worker is assigned a job, it will not be assigned another one until a
+`done` message is received.

 #### Error Reporting

-Broker is the only backend component which is able to report errors to frontend
-API. For this purpose HTTP protocol is used through *libcurl* library. To
-address security concerns there is *HTTP Basic Auth* configured on particular
-endpoints which is simply enough to use within *libcurl*.
+Broker is the only backend component which is able to report errors directly to
+the REST API. Other components have to notify the broker first and it forwards
+the messages to the API. For HTTP communication a *libcurl* library is used. To
+address security concerns there is a *HTTP Basic Auth* configured on particular
+API endpoints and correct credentials have to be entered.

 Following types of failures are distinguished:

-**Job failure** -- we recognize two ways a job can fail -- an internally and
- externally. An internal failure is the fault of worker -- for example when it
- cannot download a file needed for the evaluation for some reason. An external
- error is for example when the job configuration is malformed. Note that we do
- not consider a student entering an incorrect solution a job failure.
-
-Jobs that failed internally are reassigned until a limit on the amount of
-reassingments (configurable with the `max_request_failures` option) is reached.
-External failures are reported to the frontend immediately.
-
-**Worker failure** -- when a worker crash is detected, we attempt to reassign
- its current job and also all the jobs from its queue. Because the current job
- might be the reason of the crash, its reassignment is also counted towards the
- `max_request_failures` limit (the counter is shared). If there is no worker
- that could process a job (i.e. it cannot be reassigned), the job is reported as
- failed to the frontend via REST API.
+**Job failure** -- there are two ways a job can fail, internal and external one.
+ An internal failure is the fault of worker, for example when it
+ cannot download a file needed for the evaluation. An external
+ error is for example when the job configuration is malformed. Note that wrong
+ student solution is not considered as a job failure.
+
+ Jobs that failed internally are reassigned until a limit on the amount of
+ reassingments (configurable with the `max_request_failures` option) is reached.
+ External failures are reported to the frontend immediately.
+
+**Worker failure** -- when a worker crash is detected, an attempt to reassign
+ its current job and also all the jobs from its queue is made. Because the
+ current job might be the reason of the crash, its reassignment is also counted
+ towards the `max_request_failures` limit (the counter is shared). If there is
+ no worker that could process a job available (i.e. it cannot be reassigned),
+ the job is reported as failed to the frontend via REST API.

 **Broker failure** -- when the broker itself crashed and is restarted, workers
 will reconnect automatically. However, all jobs in their queues are lost. If a