Skip to content

Add celerity blockchain for task divergence checking#217

Open
GagaLP wants to merge 3 commits intomasterfrom
divergence-check
Open

Add celerity blockchain for task divergence checking#217
GagaLP wants to merge 3 commits intomasterfrom
divergence-check

Conversation

@GagaLP
Copy link
Copy Markdown
Contributor

@GagaLP GagaLP commented Oct 2, 2023

This pull request adds a divergence checking mechanism for tasks.

It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:

[2023-10-02 17:31:07.784] [error] Divergence detected in task graph at index 1:

0x471b0f1db5e4b8e6 on nodes 1 
0xe9fbff654e3748e1 on nodes 0 

[2023-10-02 17:31:07.784] [error] Task record for hash 0x471b0f1db5e4b8e6:

id: 1, debug_name: task_b_4, type: device-compute, cgid: 0
geometry:
         dimensions: 2, global_size: [1,1,1], global_offset: [0,0,0], granularity: [1,1,1]
accesses: 
         bid: 0, buffer_name: , mode: R, req: {[64,0,0] - [128,1,1]}
dependencies: 
         node: 0, kind: true-dep, origin: last-epoch

Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):

[warning] After 10 seconds of waiting nodes 1, did not move to the next task. The runtime might be stuck.

All of this is automatically turned on by running the program with task recording enabled.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants