a couple of paragraphs about the broker

8 years ago · bac0ee8c11
parent 80fd6a8836
commit bac0ee8c11
1 changed files with 54 additions and 1 deletions
--- a/Broker.md
+++ b/Broker.md
@ -1,11 +1,64 @@
 # Broker
-Broker is essential part of ReCodEx solution which maintaines almost all communication.
+
+The broker is a cental part of the ReCodEx backend that directs almost all 
+communication.

 ## Description

+The broker's responsibilites are:
+
+- allowing workers to register themselves and keep track of their capabilities
+- tracking worker's status and handle cases when they crash
+- accepting assignment evaluation requests from the frontend and forwarding them 
+  to workers
+- receiving job status information from workers and forward it to the frontend 
+  either via monitor or REST API
+- notifying the frontend of errors in the backend

 ## Architecture

+The broker uses our ZeroMQ reactor to bind events on sockets to handler classes. 
+There are currently two handlers - one that handles the main functionality and 
+another one that sends status reports to the REST API asynchronously so that the 
+broker doesn't have to wait for HTTP requests which can take a lot of time, 
+especially when some kind of error happens on the server.
+
+### Worker registry
+
+The `worker_registry` class is used to store information about workers, their 
+status and the jobs in their queue. It can look up a worker using the headers 
+received with a request. It also uses a basic load balancing algorithm - the 
+workers are contained in a queue and whenever one of them receives a job, it's 
+moved to the back, which makes it less likely to receive another job soon.
+
+When a worker is assigned a job, it won't be assigned another one until we 
+receive a `done` message from it.
+
+### Error handling
+
+**Job failure** - we recognize two ways a job can fail - an internally and 
+externally. An internal failure is the worker's fault - for example when it 
+can't download a file needed for the evaluation for some reason. An external 
+error is for example when the job configuration is malformed. Note that we don't 
+consider a student entering an incorrect solution a job failure.
+
+Jobs that failed internally are reassigned until a limit on the amount of 
+reassingments (configurable with the `max_request_failures` option) is reached. 
+External failures are reported to the frontend immediately.
+
+**Worker failure** - when a worker crash is detected, we attempt to reassign its 
+current job and also all the jobs from its queue. Because the current job might 
+be the reason of the crash, its reassignment is also counted towards the 
+`max_request_failures` limit (the counter is shared). If there is no worker that 
+could process a job (i.e. it cannot be reassigned), the job is reported as 
+failed to the frontend via REST API.
+
+**Broker failure** - when the broker itself crashes and is restarted, workers 
+will reconnect automatically. However, all jobs in their queues are lost. If a 
+worker manages to finish a job and notifies the "new" broker, the report is 
+forwarded to the frontend. The same goes for external failures. Jobs that fail 
+internally cannot be reassigned, because the "new" broker doesn't know their 
+headers - they are reported as failed immediately.

 ## Installation