diff --git a/Broker.md b/Broker.md index ad10b13..cd7539a 100644 --- a/Broker.md +++ b/Broker.md @@ -1,11 +1,64 @@ # Broker -Broker is essential part of ReCodEx solution which maintaines almost all communication. + +The broker is a cental part of the ReCodEx backend that directs almost all +communication. ## Description +The broker's responsibilites are: + +- allowing workers to register themselves and keep track of their capabilities +- tracking worker's status and handle cases when they crash +- accepting assignment evaluation requests from the frontend and forwarding them + to workers +- receiving job status information from workers and forward it to the frontend + either via monitor or REST API +- notifying the frontend of errors in the backend ## Architecture +The broker uses our ZeroMQ reactor to bind events on sockets to handler classes. +There are currently two handlers - one that handles the main functionality and +another one that sends status reports to the REST API asynchronously so that the +broker doesn't have to wait for HTTP requests which can take a lot of time, +especially when some kind of error happens on the server. + +### Worker registry + +The `worker_registry` class is used to store information about workers, their +status and the jobs in their queue. It can look up a worker using the headers +received with a request. It also uses a basic load balancing algorithm - the +workers are contained in a queue and whenever one of them receives a job, it's +moved to the back, which makes it less likely to receive another job soon. + +When a worker is assigned a job, it won't be assigned another one until we +receive a `done` message from it. + +### Error handling + +**Job failure** - we recognize two ways a job can fail - an internally and +externally. An internal failure is the worker's fault - for example when it +can't download a file needed for the evaluation for some reason. An external +error is for example when the job configuration is malformed. Note that we don't +consider a student entering an incorrect solution a job failure. + +Jobs that failed internally are reassigned until a limit on the amount of +reassingments (configurable with the `max_request_failures` option) is reached. +External failures are reported to the frontend immediately. + +**Worker failure** - when a worker crash is detected, we attempt to reassign its +current job and also all the jobs from its queue. Because the current job might +be the reason of the crash, its reassignment is also counted towards the +`max_request_failures` limit (the counter is shared). If there is no worker that +could process a job (i.e. it cannot be reassigned), the job is reported as +failed to the frontend via REST API. + +**Broker failure** - when the broker itself crashes and is restarted, workers +will reconnect automatically. However, all jobs in their queues are lost. If a +worker manages to finish a job and notifies the "new" broker, the report is +forwarded to the frontend. The same goes for external failures. Jobs that fail +internally cannot be reassigned, because the "new" broker doesn't know their +headers - they are reported as failed immediately. ## Installation