diff --git a/Rewritten-docs.md b/Rewritten-docs.md index 7a8611f..653f3b0 100644 --- a/Rewritten-docs.md +++ b/Rewritten-docs.md @@ -2656,10 +2656,10 @@ used. ## Broker -The broker is a central part of the ReCodEx backend that directs almost all -communication. It was designed to properly maintain heavy load of messages by -making only small actions in main communication thread and asynchronous -execution of other actions. +The broker is a central part of the ReCodEx backend that directs most of the +communication. It was designed to maintain a heavy load of messages by making +only small actions in the main communication thread and asynchronous execution +of other actions. The responsibilites of broker are: @@ -2667,18 +2667,18 @@ The responsibilites of broker are: - tracking status of each worker and handle cases when they crash - accepting assignment evaluation requests from the frontend and forwarding them to workers -- receiving job status information from workers and forward it to the frontend - either via monitor or REST API -- notifying the frontend of errors in the backend +- receiving a job status information from workers and forward them to the + frontend either via monitor or REST API +- notifying the frontend on errors of the backend ### Internal Structure -The broker uses our ZeroMQ _reactor_ to bind events on sockets to handler -classes. There are currently two handlers -- one that handles the main -functionality and another one that sends status reports to the REST API -asynchronously so that the broker does not have to wait for HTTP requests which -can take a lot of time, especially when some kind of error happens on the -server. +The main work of the broker is to handle incomming messages. For that a +_reactor_ subcomponent is written to bind events on sockets to handler classes. +There are currently two handlers -- one that handles the main functionality and +the other that sends status reports to the REST API asynchronously. This +prevents broker freezes when synchronously waiting for responses of HTTP +requests, especially when some kind of error happens on the server. Main handler takes care of requests from workers and API servers: @@ -2697,8 +2697,8 @@ requests. This notifier is used on error reporting from backend to frontend API. The `worker_registry` class is used to store information about workers, their status and the jobs in their queue. It can look up a worker using the headers received with a request (a worker is considered suitable if and only if it -satisfies all the headers). The headers are arbitrary key-value pairs, which are -checked for equality by the broker. However, some headers require special +satisfies all the job headers). The headers are arbitrary key-value pairs, which +are checked for equality by the broker. However, some headers require special handling, namely `threads`, for which we check if the value in the request is lesser than or equal to the value advertised by the worker, and `hwgroup`, for which we support requesting one of multiple hardware groups by listing multiple @@ -2708,34 +2708,35 @@ The registry also implements a basic load balancing algorithm -- the workers are contained in a queue and whenever one of them receives a job, it is moved to its end, which makes it less likely to receive another job soon. -When a worker is assigned a job, it will not be assigned another one until we -receive a `done` message from it. +When a worker is assigned a job, it will not be assigned another one until a +`done` message is received. #### Error Reporting -Broker is the only backend component which is able to report errors to frontend -API. For this purpose HTTP protocol is used through *libcurl* library. To -address security concerns there is *HTTP Basic Auth* configured on particular -endpoints which is simply enough to use within *libcurl*. +Broker is the only backend component which is able to report errors directly to +the REST API. Other components have to notify the broker first and it forwards +the messages to the API. For HTTP communication a *libcurl* library is used. To +address security concerns there is a *HTTP Basic Auth* configured on particular +API endpoints and correct credentials have to be entered. Following types of failures are distinguished: -**Job failure** -- we recognize two ways a job can fail -- an internally and - externally. An internal failure is the fault of worker -- for example when it - cannot download a file needed for the evaluation for some reason. An external - error is for example when the job configuration is malformed. Note that we do - not consider a student entering an incorrect solution a job failure. - -Jobs that failed internally are reassigned until a limit on the amount of -reassingments (configurable with the `max_request_failures` option) is reached. -External failures are reported to the frontend immediately. - -**Worker failure** -- when a worker crash is detected, we attempt to reassign - its current job and also all the jobs from its queue. Because the current job - might be the reason of the crash, its reassignment is also counted towards the - `max_request_failures` limit (the counter is shared). If there is no worker - that could process a job (i.e. it cannot be reassigned), the job is reported as - failed to the frontend via REST API. +**Job failure** -- there are two ways a job can fail, internal and external one. + An internal failure is the fault of worker, for example when it + cannot download a file needed for the evaluation. An external + error is for example when the job configuration is malformed. Note that wrong + student solution is not considered as a job failure. + + Jobs that failed internally are reassigned until a limit on the amount of + reassingments (configurable with the `max_request_failures` option) is reached. + External failures are reported to the frontend immediately. + +**Worker failure** -- when a worker crash is detected, an attempt to reassign + its current job and also all the jobs from its queue is made. Because the + current job might be the reason of the crash, its reassignment is also counted + towards the `max_request_failures` limit (the counter is shared). If there is + no worker that could process a job available (i.e. it cannot be reassigned), + the job is reported as failed to the frontend via REST API. **Broker failure** -- when the broker itself crashed and is restarted, workers will reconnect automatically. However, all jobs in their queues are lost. If a