-
Notifications
You must be signed in to change notification settings - Fork 14
Add doc on supervision edge and edge dir #553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
fd2cd3c
Initial commit
mkolodner-sc 47dcc2d
Merge branch 'main' into mkolodner-sc/add_supervision_edge_dir_doc
mkolodner-sc e1b4e33
Address comments
mkolodner-sc 827d043
Fix
mkolodner-sc c9d2f9e
Address comments
mkolodner-sc 28efbde
Fix
mkolodner-sc 16c099a
Update format
mkolodner-sc File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,259 @@ | ||
| # Edge Direction and Supervision Edge Types for Link Prediction | ||
|
|
||
| This document explains how the `edge_dir` parameter interacts with supervision edge types for the link prediction task. | ||
|
|
||
| ## What `edge_dir` means | ||
|
|
||
| In GLT / GiGL, a distributed graph stores message passing edges in one canonical direction and `edge_dir` tells the | ||
| sampler which direction to walk them: | ||
|
|
||
| - `"out"` — standard forward traversal: neighbours of node `v` are the nodes that `v` points **to** (i.e. | ||
| `v → neighbour`). | ||
| - `"in"` — reverse traversal: neighbours of node `v` are the nodes that point **into** `v` (i.e. `neighbour → v`, read | ||
| as "incoming edges of `v`"). | ||
|
|
||
| ## Choosing `edge_dir` | ||
|
|
||
| The right choice depends on the semantics of your edges and which direction of traversal gives meaningful neighbours for | ||
| your anchor nodes. | ||
|
|
||
| Consider a social graph where edges represent "A follows B" (A→B), and you are predicting future follow edges from | ||
| users: | ||
|
|
||
| **`edge_dir="out"` (default)** — the sampler walks edges in their stored direction. Starting from User A, hop 1 finds | ||
| A's followees (nodes A points to); hop 2 finds their followees; and so on. This builds a representation of A based on | ||
| *who A follows and what those accounts do*. | ||
|
|
||
| ``` | ||
| User A ──follows──► User B ──follows──► User D | ||
| ``` | ||
|
|
||
| **`edge_dir="in"`** — the sampler walks edges in reverse (against their stored direction). Starting from User A, hop 1 | ||
| finds A's followers (nodes that point to A); hop 2 finds their followers. This builds a representation of A based on | ||
| *who follows A and what those accounts do*. | ||
|
|
||
| ``` | ||
| User X ──follows──► User A ◄──follows── User Y | ||
| ``` | ||
|
|
||
| Generally, use `"out"` when the signal flows forward along your edges (e.g. a user's taste is defined by what they | ||
| follow). Use `"in"` when the signal flows backward (e.g. a user's influence is defined by who follows them). | ||
|
|
||
| ______________________________________________________________________ | ||
|
|
||
| ## Two types of edges | ||
|
|
||
| Additionally, there are two types of edges which GiGL distinguishes for the link prediction task: | ||
|
|
||
| - **Message passing edges** — the graph topology used for GNN neighbourhood aggregation. These are the edges in | ||
| `edge_indices` (e.g. `USER→ITEM`, `ITEM→USER`). `edge_dir` controls which direction the sampler walks these edges | ||
| during training and inference. | ||
|
|
||
| - **Supervision edges** — the positive and negative edges used as the training signal. They describe *what the model is | ||
| trying to predict*, not the graph topology. **The supervision edge types must always be supplied in the outgoing | ||
| (src→dst) direction, regardless of `edge_dir`.** | ||
|
|
||
| ______________________________________________________________________ | ||
|
|
||
| ## Supplying Supervision Edges | ||
|
|
||
| GiGL supports two paradigms for providing supervision edges. | ||
|
|
||
| ### Self-supervised labels | ||
|
|
||
| GiGL derives positive supervision labels from the message passing edges at dataset build time. A random subset — | ||
| controlled by the `_ssl_positive_label_percentage` argument in `build_dataset` — is sampled as positive labels. No | ||
| separate label tensors are needed. | ||
|
|
||
| You still must pass `supervision_edge_type` in the **outgoing direction** to both `DistNodeAnchorLinkSplitter` and | ||
| `DistABLPLoader`. | ||
|
|
||
| ```python | ||
| supervision_edge_type = EdgeType(USER, Relation("to"), ITEM) # always outgoing | ||
|
|
||
| splitter = DistNodeAnchorLinkSplitter( | ||
| sampling_direction=edge_dir, | ||
| supervision_edge_types=[supervision_edge_type], | ||
| ) | ||
| dataset = build_dataset( | ||
| serialized_graph_metadata=..., | ||
| sample_edge_direction=edge_dir, | ||
| splitter=splitter, | ||
| _ssl_positive_label_percentage=0.05, # 5% of message passing edges used as positive labels | ||
| ) | ||
| loader = DistABLPLoader( | ||
| dataset=dataset, | ||
| supervision_edge_type=supervision_edge_type, # outgoing, regardless of edge_dir | ||
| ... | ||
| ) | ||
| ``` | ||
|
|
||
| ### Explicitly provided labels | ||
|
|
||
| Users register positive and optionally negative label edges in `DataPreprocessorConfig.get_edges_preprocessing_spec()` | ||
| using `EdgeUsageType.POSITIVE` and `EdgeUsageType.NEGATIVE`. These are written as TFRecords by the data preprocessor and | ||
| loaded automatically when the dataset is built — no extra loading code is needed. | ||
|
|
||
| Label edges must be in the **outgoing (src→dst) direction**. GiGL handles any internal reversal required for | ||
| `edge_dir="in"` automatically (see [What happens internally](#what-happens-internally)). | ||
|
|
||
| ```python | ||
| # In DataPreprocessorConfig.get_edges_preprocessing_spec(): | ||
| def get_edges_preprocessing_spec(self): | ||
| supervision_edge_type = EdgeType(USER, Relation("to"), ITEM) # outgoing direction | ||
|
|
||
| positive_edge_ref = BigqueryEdgeDataReference( | ||
| reference_uri=self.positive_edges_table, | ||
| edge_type=supervision_edge_type, | ||
| edge_usage_type=EdgeUsageType.POSITIVE, | ||
| ) | ||
| negative_edge_ref = BigqueryEdgeDataReference( # optional | ||
| reference_uri=self.negative_edges_table, | ||
| edge_type=supervision_edge_type, | ||
| edge_usage_type=EdgeUsageType.NEGATIVE, | ||
| ) | ||
| ... | ||
|
|
||
| # At training time, labels are loaded automatically from TFRecords: | ||
| supervision_edge_type = EdgeType(USER, Relation("to"), ITEM) # always outgoing | ||
|
|
||
| splitter = DistNodeAnchorLinkSplitter( | ||
| sampling_direction=edge_dir, | ||
| supervision_edge_types=[supervision_edge_type], | ||
| ) | ||
| dataset = build_dataset( | ||
| serialized_graph_metadata=..., | ||
| sample_edge_direction=edge_dir, | ||
| splitter=splitter, | ||
| # no _ssl_positive_label_percentage — labels are loaded from TFRecords | ||
| ) | ||
| loader = DistABLPLoader( | ||
| dataset=dataset, | ||
| supervision_edge_type=supervision_edge_type, # outgoing, regardless of edge_dir | ||
| ... | ||
| ) | ||
| ``` | ||
|
|
||
| ______________________________________________________________________ | ||
|
|
||
| ## What happens internally | ||
|
|
||
| This section describes how both supervision edges and message passing edges are stored in a `DistDataset`. This is | ||
| purely internal detail — you do not need to implement this yourself, but it is useful context when reading the code or | ||
| debugging. | ||
|
|
||
| There are two principles which our internal codebase currently operates under: | ||
|
|
||
| - Both supervision and message passing edges are stored in a single `dict[EdgeType, Tensor]` edge index object. | ||
| Supervision edges are injected under derived edge type keys: the relation is suffixed with `_gigl_positive` or | ||
| `_gigl_negative` (e.g. a user-provided supervision edge type `("user", "to", "item")` becomes | ||
| `("user", "to_gigl_positive", "item")` for positive labels), keeping them distinct from the message passing edge type | ||
| keys. | ||
| - All edges stored in the `DistDataset` must share the same edge direction (either both `"in"` or both `"out"`). | ||
|
|
||
| Since supervision edges are always provided in the outgoing direction, but `edge_dir="in"` requires all stored edges to | ||
| point inward, we reverse the supervision edges before storing them alongside the message passing edges. When | ||
| `edge_dir="out"`, no reversal is needed since both are already outgoing. The reversal (both the edge type and the COO | ||
| row flip) is implemented in `_get_label_edges` (`gigl/types/graph.py:110`). `DistNodeAnchorLinkSplitter` | ||
| (`gigl/utils/data_splitters.py:250`) and `DistABLPLoader` (`gigl/distributed/dist_ablp_neighborloader.py:462`) | ||
| independently apply the same edge type reversal to derive the correct lookup key when accessing supervision edges from | ||
| the stored graph. | ||
|
|
||
| For example, if a user provides `supervision_edge_type = ("user", "to", "item")` with `edge_dir="in"`, we internally | ||
| reverse the supervision edge type so that all edges can be stored in the same edge index object. The dataset will store | ||
| an injected `("item", "to_gigl_positive", "user")` edge type, with the edge index rows flipped accordingly. | ||
|
|
||
| ### `DistNodeAnchorLinkSplitter` and `edge_dir` | ||
|
|
||
| The splitter reads anchor nodes directly from the stored supervision edge index. It uses `edge_dir` to determine two | ||
| things: | ||
|
|
||
| - **Which edge type to look up** — it derives the stored key the same way as storage (reversing for `edge_dir="in"`), so | ||
| the lookup always matches. | ||
| - **Which end of the supervision edge holds the anchor nodes** — `edge_dir` determines which endpoint the splitter reads | ||
| anchor nodes from: the source end for `"out"`, the destination end for `"in"`, since the stored edge type is reversed | ||
| for `"in"` and the anchor node type shifts accordingly. For example, with | ||
| `supervision_edge_type = ("user", "to", "item")`: `edge_dir="out"` reads user IDs from the source end of | ||
| `("user", "to_gigl_positive", "item")`; `edge_dir="in"` reads user IDs from the destination end of | ||
| `("item", "to_gigl_positive", "user")`. | ||
|
|
||
| The invariant: anchor nodes (users in the above example) are always used for splitting regardless of `edge_dir`. | ||
|
|
||
| ______________________________________________________________________ | ||
|
|
||
| ### Worked example | ||
|
|
||
| Consider a USER→ITEM graph. A user wants to predict which items users will interact with next and provides: | ||
|
|
||
| ```python | ||
| supervision_edge_type = EdgeType(USER, Relation("to"), ITEM) | ||
|
|
||
| positive_labels = { | ||
| supervision_edge_type: torch.tensor([ | ||
| [0, 0, 1, 1, 2, 2 ], # src: user IDs | ||
| [10,11,11,12,12,13], # dst: item IDs | ||
| ]) | ||
| } | ||
| negative_labels = { | ||
| supervision_edge_type: torch.tensor([ | ||
| [0, 1, 2 ], # src: user IDs | ||
| [12, 13, 10], # dst: item IDs | ||
| ]) | ||
| } | ||
| ``` | ||
|
|
||
| These inputs are **identical regardless of `edge_dir`**. What changes is how they are stored internally in the | ||
| `DistDataset`. | ||
|
|
||
| **`edge_dir="out"`** — supervision and message passing edges already share the same direction, so no reversal is needed: | ||
|
|
||
| ``` | ||
| User 0 ──pos──► Item 10 | ||
| ──pos──► Item 11 | ||
| ··neg··► Item 12 | ||
|
|
||
| User 1 ──pos──► Item 11 | ||
| ──pos──► Item 12 | ||
| ··neg··► Item 13 | ||
|
|
||
| User 2 ──pos──► Item 12 | ||
| ──pos──► Item 13 | ||
| ··neg··► Item 10 | ||
|
|
||
| injected positive edge type = ("user", "to_gigl_positive", "item") | ||
| injected positive edge index = [[0, 0, 1, 1, 2, 2 ], ← user IDs (row 0 = src) | ||
| [10, 11, 11, 12, 12, 13]] ← item IDs (row 1 = dst) | ||
|
|
||
| injected negative edge type = ("user", "to_gigl_negative", "item") | ||
| injected negative edge index = [[0, 1, 2 ], ← user IDs (row 0) | ||
| [12, 13, 10]] ← item IDs (row 1) | ||
| ``` | ||
|
|
||
| **`edge_dir="in"`** — supervision edges are reversed before storage so they match the inward orientation of the message | ||
| passing edges: | ||
|
|
||
| ``` | ||
| Item 10 ──pos──► User 0 | ||
| ··neg··► User 2 | ||
|
|
||
| Item 11 ──pos──► User 0 | ||
| ──pos──► User 1 | ||
|
|
||
| Item 12 ──pos──► User 1 | ||
| ──pos──► User 2 | ||
| ··neg··► User 0 | ||
|
|
||
| Item 13 ──pos──► User 2 | ||
| ··neg··► User 1 | ||
|
|
||
| injected positive edge type = ("item", "to_gigl_positive", "user") | ||
| injected positive edge index = [[10, 11, 11, 12, 12, 13], ← item IDs (row 0 = src) | ||
| [0, 0, 1, 1, 2, 2 ]] ← user IDs (row 1 = dst) | ||
|
|
||
| injected negative edge type = ("item", "to_gigl_negative", "user") | ||
| injected negative edge index = [[12, 13, 10], ← item IDs (row 0) | ||
| [0, 1, 2 ]] ← user IDs (row 1) | ||
| ``` | ||
|
|
||
|
mkolodner-sc marked this conversation as resolved.
|
||
| In both cases the user IDs (anchor nodes) are dervied from from the stored edge index, and the user-provided | ||
| `supervision_edge_type` and edge index tensors are unchanged. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.