You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

3.6 KiB

Raw Blame History

Broker

The broker is a central part of the ReCodEx backend that directs almost all communication. It was designed to properly maintain heavy load of messages by making only small actions in main communication thread and asynchronous execution of other actions.

Description

The broker's responsibilites are:

allowing workers to register themselves and keep track of their capabilities
tracking status of each worker and handle cases when they crash
accepting assignment evaluation requests from the frontend and forwarding them to workers
receiving job status information from workers and forward it to the frontend either via monitor or REST API
notifying the frontend of errors in the backend

Architecture

The broker uses our ZeroMQ reactor to bind events on sockets to handler classes. There are currently two handlers -- one that handles the main functionality and another one that sends status reports to the REST API asynchronously so that the broker does not have to wait for HTTP requests which can take a lot of time, especially when some kind of error happens on the server.

Worker registry

The worker_registry class is used to store information about workers, their status and the jobs in their queue. It can look up a worker using the headers received with a request (a worker is considered suitable if and only if it satisfies all the headers). The headers are arbitrary key-value pairs, which are checked for equality by the broker. However, some headers require special handling, namely threads, for which we check if the value in the request is lesser than or equal to the value advertised by the worker, and hwgroup, for which we support requesting one of multiple hardware groups by listing multiple names separated with a | symbol (e.g. group_1|group_2|group_3.

The registry also implements a basic load balancing algorithm -- the workers are contained in a queue and whenever one of them receives a job, it is moved to its end, which makes it less likely to receive another job soon.

When a worker is assigned a job, it will not be assigned another one until we receive a done message from it.

Error handling

Job failure -- we recognize two ways a job can fail -- an internally and externally. An internal failure is the worker's fault -- for example when it cannot download a file needed for the evaluation for some reason. An external error is for example when the job configuration is malformed. Note that we do not consider a student entering an incorrect solution a job failure.

Jobs that failed internally are reassigned until a limit on the amount of reassingments (configurable with the max_request_failures option) is reached. External failures are reported to the frontend immediately.

Worker failure -- when a worker crash is detected, we attempt to reassign its current job and also all the jobs from its queue. Because the current job might be the reason of the crash, its reassignment is also counted towards the max_request_failures limit (the counter is shared). If there is no worker that could process a job (i.e. it cannot be reassigned), the job is reported as failed to the frontend via REST API.

Broker failure -- when the broker itself crashed and is restarted, workers will reconnect automatically. However, all jobs in their queues are lost. If a worker manages to finish a job and notifies the "new" broker, the report is forwarded to the frontend. The same goes for external failures. Jobs that fail internally cannot be reassigned, because the "new" broker does not know their headers -- they are reported as failed immediately.