From a642941556e9c5ad35da0d649b49a8f42fe4a995 Mon Sep 17 00:00:00 2001 From: Teyras Date: Mon, 20 Jun 2016 15:30:08 +0200 Subject: [PATCH] some juicy information on heartbeating --- Communication.md | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/Communication.md b/Communication.md index 434e322..ebf8a46 100644 --- a/Communication.md +++ b/Communication.md @@ -64,7 +64,10 @@ Broker is server when comminicating with worker. IP address and port are configu Commands from broker to worker: - **eval** - evaluate a job. See **eval** command in [[Communication#main-communication]] -- **intro** - introduce yourself to the broker (with **init** command) +- **intro** - introduce yourself to the broker (with **init** command) - this is + required when the broker loses track of the worker who sent the command. + Possible reasons for such event are e.g. that one of the communicating sides + shut down and restarted whithout the other side noticing. - **pong** - reply to **ping** command, no arguments Commands from worker to broker: @@ -76,6 +79,25 @@ Commands from worker to broker: - **progress** - evaluation progress report, see **progress** command in [[Communication#progress-callback]] - **ping** - tell broker I'm alive, no arguments +### Heartbeating + +It is important for the broker and workers to know if the other side is still +working (and connected). This is achieved with a simple heartbeating protocol. + +The protocol requires the workers to send a **ping** command regularly (the +interval is configurable on both sides - future releases might let the worker +send its ping interval with the **init** command). Upon receiving a **ping** +command, the broker responds with **pong**. + +Both sides keep track of missing heartbeating messages since the last one was +received. When this number reaches a threshold (called maximum liveness), the +other side is considered dead. + +When the broker decides a worker died, it tries to reschedule its jobs to other +workers. + +If a worker thinks the broker is dead, it tries to reconnect with a bounded, +exponentially increasing delay. ## Worker - File Server communication