diff --git a/Overall-architecture.md b/Overall-architecture.md index 9bf269d..838466e 100644 --- a/Overall-architecture.md +++ b/Overall-architecture.md @@ -17,73 +17,18 @@ This section gives detailed overview about communication in ReCodEx solution. Ba Red connections are through ZeroMQ sockets, Blue are through WebSockets and Green are through HTTP. All ZeroMQ messages are sent as multipart with one string (command, option) per part, with no empty frames (unles explicitly specified otherwise). - -### Internal worker communication - -Communication between the two worker threads is split into two separate parts, -each one holding dedicated connection line. These internal lines are realized by -ZeroMQ inproc PAIR sockets. In this section we assume that the thread of the -worker which communicates with broker is called _listening thread_ and the other -one, which is evaluating incoming jobs is called _execution thread_. _Listening -thread_ is a server in both cases (the one who calls the `bind()` method), but -because of how ZeroMQ works, it's not very important (`connect()` call in -clients can precede server `bind()` call with no issue). - -#### Main communication - -Main communication is on `inproc://jobs` sockets. _Listening thread_ is waiting -for any messages (from broker, jobs and progress sockets) and passes incoming -requests to the _execution thread_, which handles them properly. - -Commands from _listening thread_ to _execution thread_: - -- **eval** - evaluate a job. Requires 3 message frames: - - `job_id` - identifier of this job (in ASCII representation -- we avoid endianness issues and also support alphabetic ids) - - `job_url` - URI location of archive with job configuration and submitted source code - - `result_url` - remote URI where results will be pushed to - -Commands from _execution thread_ to _listening thread_: - -- **done** - notifying of finished job. Requires 2 message frames: - - `job_id` - identifier of finished job - - `result` - response result, possible values below - - OK - everything ok - - FAILED - execution failed and cannot be reassigned to another worker (due to error in configuration for example) - - INTERNAL_ERROR - execution failed due to internal worker error, but other worker possibly can execute this without error - - `message` - non-empty error description if result was not "OK" - - -#### Progress callback - -Progress messages are sent through `inproc://progress` sockets. This is only one way communication from _execution thread_ to the _listening thread_. - -Commands: - -- **progress** - notice about evaluation progress. Requires 2 or 4 arguments: - - `job_id` - identifier of current job - - `state` - what is happening now. - - DOWNLOADED - submission successfuly fetched from fileserver - - FAILED - something bad happened and job was not executed at all - - UPLOADED - results are uploaded to fileserver - - STARTED - evaluation of tasks started - - ENDED - evaluation of tasks is finnished - - ABORTED - evaluation of job encountered internal error, job will be rescheduled to another worker - - FINISHED - whole execution is finished and worker ready for another job execution - - TASK - task state changed - see below - - `task_id` - only present for "TASK" state - identifier of task in current job - - `task_state` - only present for "TASK" state - result of task evaluation. One of "COMPLETED", "FAILED" and "SKIPPED". - - COMPLETED - task was successfully executed without any error, subsequent task will be executed - - FAILED - task ended up with some error, subsequent task will be skipped - - SKIPPED - some of the previous dependencies failed to execute, so this task wont be executed at all - - ### Broker - Worker communication Broker is server when communicating with worker. IP address and port are configurable, protocol is TCP. Worker socket is DEALER, broker one is ROUTER type. Because of that, very first part of every (multipart) message from broker to worker must be target worker's socket identity (which is saved on it's **init** command). Commands from broker to worker: -- **eval** - evaluate a job. See **eval** command in [Main communication](#main-communication) section +- **eval** - evaluate a job. Requires 3 message frames: + - `job_id` - identifier of the job (in ASCII representation -- we avoid + endianness issues and also support alphabetic ids) + - `job_url` - URL of the archive with job configuration and submitted source + code + - `result_url` - URL where the results should be stored after evaluation - **intro** - introduce yourself to the broker (with **init** command) - this is required when the broker loses track of the worker who sent the command. Possible reasons for such event are e.g. that one of the communicating sides @@ -94,7 +39,9 @@ Commands from worker to broker: - **init** - introduce yourself to the broker. Useful on startup or after reestablishing lost connection. Requires at least 2 arguments: - `hwgroup` - hardware group of this worker - - `header` - additional header describing worker capabilities. Format must be `header_name=value`, every header shall be in a separate message frame. There is no maximum limit on number of headers. + - `header` - additional header describing worker capabilities. Format must + be `header_name=value`, every header shall be in a separate message frame. + There is no limit on number of headers. There is also an optional third argument - additional information. If present, it should be separated from the headers with an empty frame. The @@ -104,8 +51,32 @@ Commands from worker to broker: - `current_job` - an identifier of a job the worker is now processing. This is useful when we're reassembling a connection to the broker and need it to know the worker won't accept a new job. -- **done** - job evaluation finished, see **done** command in [Main communication](#main-communication) section. -- **progress** - evaluation progress report, see **progress** command in [Progress callback](#progress-callback) section +- **done** - notifying of finished job. Contains following message frames: + - `job_id` - identifier of finished job + - `result` - response result, possible values are: + - OK - evaluation finished successfully + - FAILED - job failed and cannot be reassigned to another worker (e.g. + due to error in configuration) + - INTERNAL_ERROR - job failed due to internal worker error, but another + worker might be able to process it (e.g. downloading a file failed) + - `message` - a human readable error message +- **progress** - notice about evaluation progress. Contains following message + frames + - `job_id` - identifier of current job + - `state` - what is happening now. + - DOWNLOADED - submission successfuly fetched from fileserver + - FAILED - something bad happened and job was not executed at all + - UPLOADED - results are uploaded to fileserver + - STARTED - evaluation of tasks started + - ENDED - evaluation of tasks is finnished + - ABORTED - evaluation of job encountered internal error, job will be rescheduled to another worker + - FINISHED - whole execution is finished and worker ready for another job execution + - TASK - task state changed - see below + - `task_id` - only present for "TASK" state - identifier of task in current job + - `task_state` - only present for "TASK" state - result of task evaluation. One of "COMPLETED", "FAILED" and "SKIPPED". + - COMPLETED - task was successfully executed without any error, subsequent task will be executed + - FAILED - task ended up with some error, subsequent task will be skipped + - SKIPPED - some of the previous dependencies failed to execute, so this task wont be executed at all - **ping** - tell broker I'm alive, no arguments #### Heartbeating diff --git a/Worker.md b/Worker.md index 94c6dcd..3c31767 100644 --- a/Worker.md +++ b/Worker.md @@ -1,21 +1,50 @@ # Worker ## Description -**Worker's** main role is securely execute given submission and possibly _evaluate_ results against model solutions provided by submitter. **Worker** is logicaly divided into two parts: -- **Listener** - listens and communicates with **Broker** through [ZeroMQ](http://zeromq.org/). It receives new jobs, communicates with **Evaluator** part and sends back results or progress. -- **Evaluator** - gets jobs to evaluate from **Listener** part, evaluate them (possibly in sandbox) and get to know to other part that evaluation ended. This part also communicates with **Fileserver**, downloads needed files and uploads detailed results. - -**Worker** after getting evaluation request has to: +The **worker's** job is to securely execute submitted assignments and possibly +_evaluate_ results against model solutions provided by submitter. **Worker** is +divided into two parts: +- **Listener** - communicates with **Broker** through + [ZeroMQ](http://zeromq.org/). On startup, it introduces itself to the broker. + Then it receives new jobs, passes them to the **Evaluator** part and sends + back results and progress reports. +- **Evaluator** - gets jobs from the **Listener** part, evaluates them (possibly + in sandbox) and notifies the other part when the evaluation ends. This part + also communicates with **Fileserver**, downloads supplementary files and + uploads detailed results. + +These parts run in separate threads that communicate through a ZeroMQ in-process +socket. This design allows the worker to keep sending `ping` messages even when +it's processing a job. + +After receiving an evaluation request, **Worker** has to: - Download the archive containing submitted source files and configuration file - Download any supplementary files based on the configuration file, such as test inputs or helper programs (This is done on demand, using a `fetch` command in the assignment configuration) - Evaluate the submission accordingly to job configuration -- During evaluation progress states can be sent back to **Broker** +- During evaluation progress messages can be sent back to **Broker** - Upload the results of the evaluation to the **Fileserver** - Notify **Broker** that the evaluation finished +### Header matching + +Every worker belongs to exactly one **hardware group** and has a set of headers. +These properties help the broker decide which worker is suitable for processing +a request. + +The **hardware group** is a string identifier used to group worker machines with +simmilar hardware configuration -- for example "i7-4560-quad-ssd". It's +important for assignments where running times are compared to those of reference +solutions -- we have to make sure that both programs run on simmilar hardware. + +The **headers** are a set of key-value pairs that describe the worker's +capabilities -- which runtime environments are installed, how many threads can +the worker run or whether it measures time precisely. + +This information is sent to the broker on startup using the `init` command. + ## Architecture Picture below is overall internal architecture of worker which shows its defined classes with private variables and public functions. Vector version of this picture is available [here](https://github.com/ReCodEx/GlobalWiki/raw/master/images/Worker_Internal_Architecture.pdf). ![Internal Worker architecture](https://github.com/ReCodEx/GlobalWiki/blob/master/images/Worker_Internal_Architecture.png) @@ -86,7 +115,11 @@ install> win-build.cmd -test install> win-build.cmd -package ``` -All build binaries and cmake temporary files can be found in *build* folder, classically there will be subfolder *Release* which will contain compiled application with all needed dlls. Once if clickable installation binary is created, it can be found in *build* folder named something like *recodex-worker-${VERSION}-win32.exe*. +All build binaries and cmake temporary files can be found in *build* folder, +classically there will be subfolder *Release* which will contain compiled +application with all needed dlls. Once if clickable installation binary is +created, it can be found in *build* folder named something like +*recodex-worker-VERSION-win32.exe*. **Clickable installation feature:** @@ -136,7 +169,7 @@ _**TODO**_ ## Configuration and usage Following text describes how to set up and run **worker** program. It's supposed to have required binaries installed. For instructions see [[Installation|Worker#installation]] section. Also, using systemd is recommended for best user experience, but it's not required. Almost all modern Linux distributions are using systemd now. -Installation of **worker** program does following step to your computer: +The **worker** installation program is composed of following steps: - create config file `/etc/recodex/worker/config-1.yml` - create _systemd_ unit file `/etc/systemd/system/recodex-worker@.service` - put main binary to `/usr/bin/recodex-worker` @@ -218,9 +251,15 @@ limits: mode: RW,NOEXEC ``` -### Running worker +### Running the worker -For easy and comfortable managing worker is included systemd unit file. It integrates worker nicely into your Linux system and allow you to run worker automaticaly after system startup for example. It's supposed to have more than one worker on every server, so provided unit file is templated. Each instance of worker have unique string identifier, which is used for managing that instance through systemd. By default, only one worker instance is ready to use after installation. It's ID is "1". +A systemd unit file is distributed with the worker to simplify its launch. It +integrates worker nicely into your Linux system and allows you to run it +automatically on system startup. It's possible to have more than one worker on +every server, so the provided unit file is templated. Each instance of the +worker unit has a unique string identifier, which is used for managing that +instance through systemd. By default, only one worker instance is ready to use +after installation and its ID is "1". - Starting worker with id "1" can be done this way: ``` @@ -242,7 +281,7 @@ For further information about using systemd please refer to systemd documentatio ### Adding new worker To add a new worker you need to do a few steps: -- Get unique string ID. Think of one. +- Make up an unique string ID. - Copy default configuration file `/etc/recodex/worker/config-1.yml` to the same directory and name it `config-.yml` - Edit that config file to fit your needs. Note that you must at least change _worker-id_ and _logger file_ values to be unique. - Run new instance using @@ -276,4 +315,4 @@ box3.cpus = 1,2,3 # assign list of processors to isolate box with id 3 ## Cleaner -TODO \ No newline at end of file +TODO