misc fixes, documentation on headers and hardware groups

8 years ago · 1e708bb8dc
parent 86993e8997
commit 1e708bb8dc
2 changed files with 86 additions and 76 deletions
--- a/Overall-architecture.md
+++ b/Overall-architecture.md
@ -17,73 +17,18 @@ This section gives detailed overview about communication in ReCodEx solution. Ba
 Red connections are through ZeroMQ sockets, Blue are through WebSockets and Green are through HTTP. All ZeroMQ messages are sent as multipart with one string (command, option) per part, with no empty frames (unles explicitly specified otherwise).
 ### Internal worker communication
 Communication between the two worker threads is split into two separate parts, 
 each one holding dedicated connection line. These internal lines are realized by 
 ZeroMQ inproc PAIR sockets. In this section we assume that the thread of the 
 worker which communicates with broker is called _listening thread_ and the other 
 one, which is evaluating incoming jobs is called _execution thread_. _Listening 
 thread_ is a server in both cases (the one who calls the `bind()` method), but 
 because of how ZeroMQ works, it's not very important (`connect()` call in 
 clients can precede server `bind()` call with no issue).
 #### Main communication
 Main communication is on `inproc://jobs` sockets. _Listening thread_ is waiting 
 for any messages (from broker, jobs and progress sockets) and passes incoming 
 requests to the _execution thread_, which handles them properly.
 Commands from _listening thread_ to _execution thread_:
 - **eval** - evaluate a job. Requires 3 message frames:
    - `job_id` - identifier of this job (in ASCII representation -- we avoid endianness issues and also support alphabetic ids)
    - `job_url` - URI location of archive with job configuration and submitted source code
    - `result_url` - remote URI where results will be pushed to
 Commands from _execution thread_ to _listening thread_:
 - **done** - notifying of finished job. Requires 2 message frames:
    - `job_id` - identifier of finished job
    - `result` - response result, possible values below
        - OK - everything ok
        - FAILED - execution failed and cannot be reassigned to another worker (due to error in configuration for example)
        - INTERNAL_ERROR - execution failed due to internal worker error, but other worker possibly can execute this without error
    - `message` - non-empty error description if result was not "OK"
 #### Progress callback
 Progress messages are sent through `inproc://progress` sockets. This is only one way communication from _execution thread_ to the _listening thread_.
 Commands:
 - **progress** - notice about evaluation progress. Requires 2 or 4 arguments:
    - `job_id` - identifier of current job
    - `state` - what is happening now. 
        - DOWNLOADED - submission successfuly fetched from fileserver
        - FAILED - something bad happened and job was not executed at all
        - UPLOADED - results are uploaded to fileserver
        - STARTED - evaluation of tasks started
        - ENDED - evaluation of tasks is finnished
        - ABORTED - evaluation of job encountered internal error, job will be rescheduled to another worker
        - FINISHED - whole execution is finished and worker ready for another job execution
        - TASK - task state changed - see below
    - `task_id` - only present for "TASK" state - identifier of task in current job
    - `task_state` - only present for "TASK" state - result of task evaluation. One of "COMPLETED", "FAILED" and "SKIPPED".
        - COMPLETED - task was successfully executed without any error, subsequent task will be executed
        - FAILED - task ended up with some error, subsequent task will be skipped
        - SKIPPED - some of the previous dependencies failed to execute, so this task wont be executed at all
 ### Broker - Worker communication
 Broker is server when communicating with worker. IP address and port are configurable, protocol is TCP. Worker socket is DEALER, broker one is ROUTER type. Because of that, very first part of every (multipart) message from broker to worker must be target worker's socket identity (which is saved on it's **init** command).
 Commands from broker to worker:
- **eval** - evaluate a job. See **eval** command in [Main communication](#main-communication) section
+- **eval** - evaluate a job. Requires 3 message frames:
    - `job_id` - identifier of the job (in ASCII representation -- we avoid 
      endianness issues and also support alphabetic ids)
    - `job_url` - URL of the archive with job configuration and submitted source 
      code
    - `result_url` - URL where the results should be stored after evaluation
 - **intro** - introduce yourself to the broker (with **init** command) - this is 
  required when the broker loses track of the worker who sent the command. 
  Possible reasons for such event are e.g. that one of the communicating sides 
@ -94,7 +39,9 @@ Commands from worker to broker:
 - **init** - introduce yourself to the broker. Useful on startup or after reestablishing lost connection. Requires at least 2 arguments:
    - `hwgroup` - hardware group of this worker
-    - `header` - additional header describing worker capabilities. Format must be `header_name=value`, every header shall be in a separate message frame. There is no maximum limit on number of headers.
+    - `header` - additional header describing worker capabilities. Format must 
      be `header_name=value`, every header shall be in a separate message frame. 
      There is no limit on number of headers.
    There is also an optional third argument - additional information. If 
    present, it should be separated from the headers with an empty frame. The 
@ -104,8 +51,32 @@ Commands from worker to broker:
    - `current_job` - an identifier of a job the worker is now processing. This 
      is useful when we're reassembling a connection to the broker and need it 
      to know the worker won't accept a new job.
- **done** - job evaluation finished, see **done** command in [Main communication](#main-communication) section.
+- **done** - notifying of finished job. Contains following message frames:
- **progress** - evaluation progress report, see **progress** command in [Progress callback](#progress-callback) section
+    - `job_id` - identifier of finished job
    - `result` - response result, possible values are:
 	- OK - evaluation finished successfully
 	- FAILED - job failed and cannot be reassigned to another worker (e.g. 
 	  due to error in configuration)
 	- INTERNAL_ERROR - job failed due to internal worker error, but another 
 	  worker might be able to process it (e.g. downloading a file failed)
    - `message` - a human readable error message
 - **progress** - notice about evaluation progress. Contains following message 
  frames
    - `job_id` - identifier of current job
    - `state` - what is happening now. 
        - DOWNLOADED - submission successfuly fetched from fileserver
        - FAILED - something bad happened and job was not executed at all
        - UPLOADED - results are uploaded to fileserver
        - STARTED - evaluation of tasks started
        - ENDED - evaluation of tasks is finnished
        - ABORTED - evaluation of job encountered internal error, job will be rescheduled to another worker
        - FINISHED - whole execution is finished and worker ready for another job execution
        - TASK - task state changed - see below
    - `task_id` - only present for "TASK" state - identifier of task in current job
    - `task_state` - only present for "TASK" state - result of task evaluation. One of "COMPLETED", "FAILED" and "SKIPPED".
        - COMPLETED - task was successfully executed without any error, subsequent task will be executed
        - FAILED - task ended up with some error, subsequent task will be skipped
        - SKIPPED - some of the previous dependencies failed to execute, so this task wont be executed at all
 - **ping** - tell broker I'm alive, no arguments
 #### Heartbeating
--- a/Worker.md
+++ b/Worker.md
@ -1,21 +1,50 @@
 # Worker
 ## Description
-**Worker's** main role is securely execute given submission and possibly _evaluate_ results against model solutions provided by submitter. **Worker** is logicaly divided into two parts:
+The **worker's** job is to securely execute submitted assignments and possibly 
- **Listener** - listens and communicates with **Broker** through [ZeroMQ](http://zeromq.org/). It receives new jobs, communicates with **Evaluator** part and sends back results or progress.
+_evaluate_ results against model solutions provided by submitter. **Worker** is 
- **Evaluator** - gets jobs to evaluate from **Listener** part, evaluate them (possibly in sandbox) and get to know to other part that evaluation ended. This part also communicates with **Fileserver**, downloads needed files and uploads detailed results.
+divided into two parts:
-
+- **Listener** - communicates with **Broker** through 
-**Worker** after getting evaluation request has to:
+  [ZeroMQ](http://zeromq.org/). On startup, it introduces itself to the broker. 
  Then it receives new jobs, passes them to the **Evaluator** part and sends 
  back results and progress reports.
 - **Evaluator** - gets jobs from the **Listener** part, evaluates them (possibly 
  in sandbox) and notifies the other part when the evaluation ends. This part 
  also communicates with **Fileserver**, downloads supplementary files and 
  uploads detailed results.
 These parts run in separate threads that communicate through a ZeroMQ in-process 
 socket. This design allows the worker to keep sending `ping` messages even when 
 it's processing a job.
 After receiving an evaluation request, **Worker** has to:
 - Download the archive containing submitted source files and configuration file
 - Download any supplementary files based on the configuration file, such as test 
  inputs or helper programs (This is done on demand, using a `fetch` command
  in the assignment configuration)
 - Evaluate the submission accordingly to job configuration
- During evaluation progress states can be sent back to **Broker**
+- During evaluation progress messages can be sent back to **Broker**
 - Upload the results of the evaluation to the **Fileserver**
 - Notify **Broker** that the evaluation finished
 ### Header matching
 Every worker belongs to exactly one **hardware group** and has a set of headers. 
 These properties help the broker decide which worker is suitable for processing 
 a request. 
 The **hardware group** is a string identifier used to group worker machines with 
 simmilar hardware configuration -- for example "i7-4560-quad-ssd". It's 
 important for assignments where running times are compared to those of reference 
 solutions -- we have to make sure that both programs run on simmilar hardware.
 The **headers** are a set of key-value pairs that describe the worker's 
 capabilities -- which runtime environments are installed, how many threads can 
 the worker run or whether it measures time precisely.
 This information is sent to the broker on startup using the `init` command.
 ## Architecture
 Picture below is overall internal architecture of worker which shows its defined classes with private variables and public functions. Vector version of this picture is available [here](https://github.com/ReCodEx/GlobalWiki/raw/master/images/Worker_Internal_Architecture.pdf).
 ![Internal Worker architecture](https://github.com/ReCodEx/GlobalWiki/blob/master/images/Worker_Internal_Architecture.png)
@ -86,7 +115,11 @@ install> win-build.cmd -test
 install> win-build.cmd -package
 ```
-All build binaries and cmake temporary files can be found in *build* folder, classically there will be subfolder *Release* which will contain compiled application with all needed dlls. Once if clickable installation binary is created, it can be found in *build* folder named something like *recodex-worker-${VERSION}-win32.exe*.
+All build binaries and cmake temporary files can be found in *build* folder, 
 classically there will be subfolder *Release* which will contain compiled 
 application with all needed dlls. Once if clickable installation binary is 
 created, it can be found in *build* folder named something like 
 *recodex-worker-VERSION-win32.exe*.
 **Clickable installation feature:**
@ -136,7 +169,7 @@ _**TODO**_
 ## Configuration and usage
 Following text describes how to set up and run **worker** program. It's supposed to have required binaries installed. For instructions see [[Installation|Worker#installation]] section. Also, using systemd is recommended for best user experience, but it's not required. Almost all modern Linux distributions are using systemd now.
-Installation of **worker** program does following step to your computer:
+The **worker** installation program is composed of following steps:
 - create config file `/etc/recodex/worker/config-1.yml`
 - create _systemd_ unit file `/etc/systemd/system/recodex-worker@.service`
 - put main binary to `/usr/bin/recodex-worker`
@ -218,9 +251,15 @@ limits:
          mode: RW,NOEXEC
 ```
-### Running worker
+### Running the worker
-For easy and comfortable managing worker is included systemd unit file. It integrates worker nicely into your Linux system and allow you to run worker automaticaly after system startup for example. It's supposed to have more than one worker on every server, so provided unit file is templated. Each instance of worker have unique string identifier, which is used for managing that instance through systemd. By default, only one worker instance is ready to use after installation. It's ID is "1".
+A systemd unit file is distributed with the worker to simplify its launch. It 
 integrates worker nicely into your Linux system and allows you to run it 
 automatically on system startup. It's possible to have more than one worker on 
 every server, so the provided unit file is templated. Each instance of the 
 worker unit has a unique string identifier, which is used for managing that 
 instance through systemd. By default, only one worker instance is ready to use 
 after installation and its ID is "1".
 - Starting worker with id "1" can be done this way:
 ```
@ -242,7 +281,7 @@ For further information about using systemd please refer to systemd documentatio
 ### Adding new worker
 To add a new worker you need to do a few steps:
- Get unique string ID. Think of one.
+- Make up an unique string ID.
 - Copy default configuration file `/etc/recodex/worker/config-1.yml` to the same directory and name it `config-<your_unique_ID>.yml`
 - Edit that config file to fit your needs. Note that you must at least change _worker-id_ and _logger file_ values to be unique.
 - Run new instance using