From abe471833bba3dbce0dcdab2b60706f4ff3df20d Mon Sep 17 00:00:00 2001 From: Martin Polanka Date: Sun, 29 Jan 2017 16:18:46 +0100 Subject: [PATCH] Copy broker content into new doc --- Broker.md | 72 ----------------------------------------------- Rewritten-docs.md | 70 +++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 68 insertions(+), 74 deletions(-) delete mode 100644 Broker.md diff --git a/Broker.md b/Broker.md deleted file mode 100644 index 77b95b2..0000000 --- a/Broker.md +++ /dev/null @@ -1,72 +0,0 @@ -# Broker - -The broker is a central part of the ReCodEx backend that directs almost all -communication. It was designed to properly maintain heavy load of messages -by making only small actions in main communication thread and asynchronous -execution of other actions. - -## Description - -The broker's responsibilites are: - -- allowing workers to register themselves and keep track of their capabilities -- tracking status of each worker and handle cases when they crash -- accepting assignment evaluation requests from the frontend and forwarding them - to workers -- receiving job status information from workers and forward it to the frontend - either via monitor or REST API -- notifying the frontend of errors in the backend - -## Architecture - -The broker uses our ZeroMQ _reactor_ to bind events on sockets to handler classes. -There are currently two handlers -- one that handles the main functionality and -another one that sends status reports to the REST API asynchronously so that the -broker does not have to wait for HTTP requests which can take a lot of time, -especially when some kind of error happens on the server. - -### Worker registry - -The `worker_registry` class is used to store information about workers, their -status and the jobs in their queue. It can look up a worker using the headers -received with a request (a worker is considered suitable if and only if it -satisfies all the headers). The headers are arbitrary key-value pairs, which -are checked for equality by the broker. However, some headers require special -handling, namely `threads`, for which we check if the value in the request is -lesser than or equal to the value advertised by the worker, and `hwgroup`, for -which we support requesting one of multiple hardware groups by listing multiple -names separated with a `|` symbol (e.g. `group_1|group_2|group_3`. - -The registry also implements a basic load balancing algorithm -- the -workers are contained in a queue and whenever one of them receives a job, it is -moved to its end, which makes it less likely to receive another job soon. - -When a worker is assigned a job, it will not be assigned another one until we -receive a `done` message from it. - -### Error handling - -**Job failure** -- we recognize two ways a job can fail -- an internally and -externally. An internal failure is the worker's fault -- for example when it -cannot download a file needed for the evaluation for some reason. An external -error is for example when the job configuration is malformed. Note that we do not -consider a student entering an incorrect solution a job failure. - -Jobs that failed internally are reassigned until a limit on the amount of -reassingments (configurable with the `max_request_failures` option) is reached. -External failures are reported to the frontend immediately. - -**Worker failure** -- when a worker crash is detected, we attempt to reassign its -current job and also all the jobs from its queue. Because the current job might -be the reason of the crash, its reassignment is also counted towards the -`max_request_failures` limit (the counter is shared). If there is no worker that -could process a job (i.e. it cannot be reassigned), the job is reported as -failed to the frontend via REST API. - -**Broker failure** -- when the broker itself crashed and is restarted, workers -will reconnect automatically. However, all jobs in their queues are lost. If a -worker manages to finish a job and notifies the "new" broker, the report is -forwarded to the frontend. The same goes for external failures. Jobs that fail -internally cannot be reassigned, because the "new" broker does not know their -headers -- they are reported as failed immediately. - diff --git a/Rewritten-docs.md b/Rewritten-docs.md index 75bf83d..ac952bc 100644 --- a/Rewritten-docs.md +++ b/Rewritten-docs.md @@ -2656,10 +2656,76 @@ used. ## Broker -@todo: gets stuff done, single point of failure and center point of ReCodEx universe +The broker is a central part of the ReCodEx backend that directs almost all +communication. It was designed to properly maintain heavy load of messages by +making only small actions in main communication thread and asynchronous +execution of other actions. + +The responsibilites of broker are: + +- allowing workers to register themselves and keep track of their capabilities +- tracking status of each worker and handle cases when they crash +- accepting assignment evaluation requests from the frontend and forwarding them + to workers +- receiving job status information from workers and forward it to the frontend + either via monitor or REST API +- notifying the frontend of errors in the backend + +### Architecture + +The broker uses our ZeroMQ _reactor_ to bind events on sockets to handler +classes. There are currently two handlers -- one that handles the main +functionality and another one that sends status reports to the REST API +asynchronously so that the broker does not have to wait for HTTP requests which +can take a lot of time, especially when some kind of error happens on the +server. + +#### Worker registry + +The `worker_registry` class is used to store information about workers, their +status and the jobs in their queue. It can look up a worker using the headers +received with a request (a worker is considered suitable if and only if it +satisfies all the headers). The headers are arbitrary key-value pairs, which are +checked for equality by the broker. However, some headers require special +handling, namely `threads`, for which we check if the value in the request is +lesser than or equal to the value advertised by the worker, and `hwgroup`, for +which we support requesting one of multiple hardware groups by listing multiple +names separated with a `|` symbol (e.g. `group_1|group_2|group_3`. + +The registry also implements a basic load balancing algorithm -- the workers are +contained in a queue and whenever one of them receives a job, it is moved to its +end, which makes it less likely to receive another job soon. + +When a worker is assigned a job, it will not be assigned another one until we +receive a `done` message from it. + +#### Error handling + +**Job failure** -- we recognize two ways a job can fail -- an internally and + externally. An internal failure is the fault of worker -- for example when it + cannot download a file needed for the evaluation for some reason. An external + error is for example when the job configuration is malformed. Note that we do + not consider a student entering an incorrect solution a job failure. + +Jobs that failed internally are reassigned until a limit on the amount of +reassingments (configurable with the `max_request_failures` option) is reached. +External failures are reported to the frontend immediately. + +**Worker failure** -- when a worker crash is detected, we attempt to reassign + its current job and also all the jobs from its queue. Because the current job + might be the reason of the crash, its reassignment is also counted towards the + `max_request_failures` limit (the counter is shared). If there is no worker + that could process a job (i.e. it cannot be reassigned), the job is reported as + failed to the frontend via REST API. + +**Broker failure** -- when the broker itself crashed and is restarted, workers + will reconnect automatically. However, all jobs in their queues are lost. If a + worker manages to finish a job and notifies the "new" broker, the report is + forwarded to the frontend. The same goes for external failures. Jobs that fail + internally cannot be reassigned, because the "new" broker does not know their + headers -- they are reported as failed immediately. @todo: what to mention: - - job scheduling, worker queues - API notification using curl, authentication using HTTP Basic Auth - asynchronous resending progress messages