█████╗ ███████╗████████╗██╗ ██╗███████╗██████╗ ██████╗ ██████╗ █████╗ ██████╗ ██╗ ██╗
██╔══██╗██╔════╝╚══██╔══╝██║ ██║██╔════╝██╔══██╗██╔════╝ ██╔══██╗██╔══██╗██╔══██╗██║ ██║
███████║█████╗ ██║ ███████║█████╗ ██████╔╝██║ ███╗██████╔╝███████║██████╔╝███████║
██╔══██║██╔══╝ ██║ ██╔══██║██╔══╝ ██╔══██╗██║ ██║██╔══██╗██╔══██║██╔═══╝ ██╔══██║
██║ ██║███████╗ ██║ ██║ ██║███████╗██║ ██║╚██████╔╝██║ ██║██║ ██║██║ ██║ ██║
╚═╝ ╚═╝╚══════╝ ╚═╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝
Rust core + Python bindings. High-performance graph sampling and real-time feature serving for GNN training on billion-scale evolving graphs.
"Because the CSR layout is simply 3 arrays, it scales on a single computer: a CSR matrix can be laid out on a disk instead of in-memory. You simply memory map the 3 arrays and use them on-disk from there.
With modern NVMe drives random seeks aren't slow anymore, much faster than distributed network calls like you do when scaling the linked list-based graph. I haven't seen anyone actually implement this yet, but it's in the roadmap for my implementation at least."
— VodkaHaze, r/MachineLearning (2021)
from aethergraph import Graph
from aethergraph.pytorch import NeighborLoader
# Static graph (CSR, mmap'd from NVMe)
graph = Graph.load("graph.bin")
graph.load_features("features.bin")
loader = NeighborLoader(graph, num_neighbors=[15, 10], batch_size=128)
for batch in loader:
out = model(batch.x, batch.edge_index)from aethergraph import DynamicGraph
# Live graph (C-tree, single-writer ingest, lock-free readers)
graph = DynamicGraph(num_vertices=2_000_000_000, arena_mb=8192)
graph.insert_edge(user_id, post_id) # from Kafka consumer
# Training reads concurrently — no locks, no stale snapshotsfrom aethergraph import HeteroGraph
from aethergraph.pytorch import HeteroNeighborLoader
# Heterogeneous graph (Reddit: users, posts, comments, subreddits)
loader = HeteroNeighborLoader(
hetero_graph,
num_neighbors={
("user", "votes", "post"): [15, 10],
("user", "writes", "comment"): [10, 5],
("post", "belongs_to", "subreddit"): [3, 2],
},
input_nodes=("user", train_user_ids),
batch_size=128,
)
for batch in loader: # yields PyG HeteroData
user_x = batch["user"].x
edge_index = batch["user", "votes", "post"].edge_indexTraditional GNN frameworks load entire graphs into GPU memory, hitting the VRAM
wall at millions-of-nodes scale. AetherGraph keeps topology on disk and streams
neighborhoods via memory-mapping + io_uring, enabling 100B+ edge graphs on
commodity hardware.
| PyTorch Geometric | AetherGraph | |
|---|---|---|
| Max graph size | ~40M nodes (VRAM) | 2B+ nodes (NVMe) |
| Graph updates | Full rebuild | O(1) edge insert, no rebuild |
| Feature loading | All in RAM | Streamed on-demand |
| Feature serving | N/A | GPUDirect RDMA (<5us) |
| Hetero graphs | Python-level looping | Rust-native typed sampling |
| Startup time | Minutes (load all) | Instant (mmap) |
| Sampling (1M nodes) | 341 us/batch | 240 us/batch (1.4x faster) |
How it works:
- Static CSR: Graph stored as 3 arrays (offsets, destinations, weights) — O(1) neighbor lookup, mmap'd from NVMe
- Dynamic C-tree: Balanced tree of cache-line-sized chunks — O(1) edge insert, lock-free reads via functional persistence and atomic root swap
- Rabbit Order reordering: Hierarchical community-detection vertex permutation (Arai et al., IPDPS 2016) for cache-friendly sampling on power-law graphs
- io_uring (Linux): Batched async NVMe reads, ~10us per random access
- AF_XDP + RDMA: Kernel-bypass ingestion and GPUDirect feature serving
- Rust core: Zero-copy data pipeline from disk to PyTorch tensors
| Crate | Purpose |
|---|---|
aethergraph-core |
Static CSR graph, homogeneous + heterogeneous sampling, feature store |
aethergraph-cli |
CLI tools (convert, info, stats) |
aethergraph-py |
PyO3 Python bindings |
aether-graph |
Dynamic graph with C-tree neighbor lists, lock-free reader path |
aether-mem |
Lock-free slab allocator with HugePages and pluggable memory hooks |
aether-stream |
Kernel-bypass streaming: AF_XDP ingestion, seqlock feature table, RDMA |
For graphs that don't change during training. Memory-mapped from NVMe, instant startup.
graph = Graph.load("reddit_2B.bin") # instant, no loading
loader = NeighborLoader(graph, num_neighbors=[15, 10], batch_size=128)For graphs that evolve during training. Single-writer ingest publishes new C-tree roots atomically; readers run concurrently without taking locks. Edges arrive from Kafka/Flink while training samples neighborhoods.
graph = DynamicGraph(num_vertices=2_000_000_000, arena_mb=8192)
# Writer thread (Kafka consumer)
for edge in kafka_stream:
graph.insert_edge(edge.src, edge.dst)
# Reader thread (training, concurrent)
neighbors = graph.neighbors(node_id) # lock-free, consistent snapshotFor multi-relational graphs (users, posts, comments, subreddits). Typed k-hop sampling with per-edge-type fanout. Drop-in replacement for PyG's HeteroData NeighborLoader.
loader = HeteroNeighborLoader(
hetero_graph,
num_neighbors={("user", "votes", "post"): [15, 10], ...},
input_nodes=("user", train_ids),
batch_size=128,
)For real-time models (fraud detection, feed ranking), features change with every user interaction. AetherGraph serves live features directly into GPU memory via GPUDirect RDMA, bypassing CPU entirely.
NIC (AF_XDP) GPU Node
│ │
▼ │
Ingestion Loop │
(busy-poll, per-queue thread) │
│ │
▼ │
FeatureTable │
(seqlock, HugePage RAM) │
│ │
├── TCP Control Plane ──────► Discover base_addr + rkey
│ │
└── RDMA READ ◄───────────── GPUDirect (NIC → PCIe → VRAM)
│
▼
GPU Kernel
(validate + compact)
│
▼
[batch, feature_dim]
output tensor
# One config change — features arrive via RDMA instead of from disk
loader = NeighborLoader(
graph, num_neighbors=[15, 10], batch_size=128,
feature_source="rdma://10.0.0.1:9999",
)
for batch in loader:
batch.x # CUDA tensor, arrived via GPUDirect RDMA in <5usSampling a billion-node graph is memory-bound — the working set is the destination array, and random neighbor lookups thrash the cache. Rabbit Order (Arai et al., IPDPS 2016) computes a vertex permutation that places communities contiguously, so neighbor reads stay in L2/L3 far longer.
graph = Graph.load("graph.bin")
perm = graph.reorder_rabbit() # returns the permutation
reordered = graph.permute(perm) # build a new CSR in the new order
reordered.save("graph.reordered.bin")Implementation:
- Phase 1 (parallel over V): each node picks its lowest-degree neighbor.
- Phase 2 (parallel over E): lock-free concurrent union-find merges remaining
cross-community edges (
AtomicU32parent + rank, path-splittingfind, CASunion). - Phase 3 (sequential, O(V)): replay the merge log into a dendrogram and emit the permutation.
The community partitions fall out of the merge log for free —
graph.rabbit_partitions() returns dense partition IDs without a second pass,
and partition_aligned_batches uses them to build seed batches that respect
locality.
See crates/aethergraph-core/benches/graph_benchmarks.rs (rabbit_reorder,
reorder_sampling_speedup) for the measurement methodology.
pip install aethergraphFor PyTorch Geometric integration:
pip install aethergraph[pytorch-geometric]Replicated Topology: graph on each worker's NVMe, only gradients cross the network.
Traditional: Worker 1 ←── graph edges ──→ Worker 2 # Terabytes over network
AetherGraph: Worker 1 ←── gradients ──→ Worker 2 # Megabytes over network
│ │
NVMe NVMe # Graph is local
import ray
from aethergraph.ray import create_sampling_dataset, collate_to_pyg
ray.init()
dataset = create_sampling_dataset(graph_path="graph.bin", num_neighbors=[25, 10], batch_size=1024)
for batch in dataset.iter_batches(batch_format="pyarrow"):
data = collate_to_pyg(batch)
out = model(data.x, data.edge_index)from aethergraph import Graph
graph = Graph.load("graph.bin") # mmap, instant startup
graph = Graph.from_edges(num_nodes, src, dst)
graph.save("graph.bin")
graph.load_features("features.bin")
graph.num_nodes # int
graph.num_edges # int
# Cache-locality reordering (Rabbit Order)
perm = graph.reorder_rabbit() # permutation: new_id → old_id
reordered = graph.permute(perm) # new CSR in the permuted order
parts = graph.rabbit_partitions() # dense partition IDs per node
perm, parts = graph.reorder_rabbit_with_partitions() # both in one passfrom aethergraph import DynamicGraph
graph = DynamicGraph(num_vertices=2_000_000_000, arena_mb=8192)
graph.insert_edge(src, dst) # O(1), lock-free for readers
graph.insert_edges(src_array, dst_array) # batch insert from numpy
graph.neighbors(node) # sorted numpy array
graph.degree(node) # O(1)
graph.has_edge(src, dst) # O(log degree)from aethergraph.pytorch import NeighborLoader
loader = NeighborLoader(
graph,
num_neighbors=[15, 10],
batch_size=128,
input_nodes=train_idx,
shuffle=True,
)
for batch in loader:
batch.x # [num_nodes, feat_dim] node features
batch.edge_index # [2, num_edges] COO edges
batch.n_id # [num_nodes] original node IDs
batch.batch_size # number of seed nodes
batch.input_id # seed indices in batch
batch.num_sampled_nodes # nodes per hop [num_hops]
batch.num_sampled_edges # edges per hop [num_hops]from aethergraph.pytorch import HeteroNeighborLoader
loader = HeteroNeighborLoader(
hetero_graph,
num_neighbors={
("user", "votes", "post"): [15, 10],
("user", "writes", "comment"): [10, 5],
},
input_nodes=("user", train_user_ids),
batch_size=128,
)
for batch in loader: # yields PyG HeteroData
batch["user"].x
batch["user", "votes", "post"].edge_indexAetherGraph is a high-performance replacement for PyG's data loading pipeline.
Supported (homogeneous + heterogeneous):
num_neighbors(list or per-edge-type dict)input_nodes(Tensor, array, list, or(node_type, Tensor)for hetero)batch_size,shuffle,replace,transformpin_memory,prefetch_factor- Weighted sampling (
weighted=True, uses edge weights) - Temporal sampling (
input_time,temporal_strategy="uniform"|"last") - Disjoint subgraphs (
disjoint=True, per-seed isolation withbatchvector) - Custom samplers (
neighbor_sampler=callableto bypass Rust backend) - Yields standard PyG
Data/HeteroDataobjects
Unique to AetherGraph:
- Live graph updates via
DynamicGraph(C-tree, single-writer + lock-free readers) feature_source="rdma://..."for GPUDirect RDMA live features- mmap'd CSR for instant startup on billion-node graphs
- io_uring feature loading at NVMe line rate
- Rabbit Order vertex reordering with partition-aligned batching
- 1.2-1.5x faster sampling than PyG's pyg-lib C++ kernel
aethergraph convert -i edges.csv -o graph.bin -n 1000000
aethergraph info graph.bin
aethergraph stats graph.bin ┌───────────────────────────────────────┐
│ Python API │
│ Graph, DynamicGraph, HeteroGraph │
│ NeighborLoader, HeteroNeighborLoader │
└──────────────────┬────────────────────┘
│ PyO3
┌─────────────────────────┼─────────────────────────┐
│ │ │
┌──────────▼──────────┐ ┌──────────▼──────────┐ ┌──────────▼──────────┐
│ aethergraph-core │ │ aether-graph │ │ aether-stream │
│ Static CSR graph │ │ Dynamic C-tree │ │ AF_XDP ingestion │
│ Homo + hetero │ │ Lock-free readers │ │ Seqlock features │
│ sampling │ │ Arena bump-alloc │ │ RDMA + GPUDirect │
│ mmap, io_uring │ │ Snapshot isolation │ │ DLPack → PyTorch │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
│
┌──────────▼──────────┐
│ aether-mem │
│ HugePages, mlock │
│ RDMA/CUDA hooks │
└─────────────────────┘
See ARCH.md for detailed component documentation.
Measured on Apple M-series, Python 3.12, PyG 2.6.1 with pyg-lib.
| Scale | AetherGraph | PyG (pyg-lib C++) | Speedup |
|---|---|---|---|
| 10K nodes | 170 us | 210 us | 1.23x |
| 100K nodes | 185 us | 218 us | 1.18x |
| 1M nodes | 240 us | 341 us | 1.43x |
| 10M nodes | 302 us | 358 us | 1.17x |
| Operation | Latency | Throughput |
|---|---|---|
| Edge insert | 111 ns | 9M inserts/sec |
| Neighbor read | 26 ns | 38M reads/sec |
| Edge lookup | 16 ns | 63M lookups/sec |
| Degree query | 3.4 ns | 294M queries/sec |
| Operation | Latency |
|---|---|
| Typed multi-hop sampling | 130 us |
| Edge index local remapping | 3 us |
| Full Python roundtrip | 130 us |
MIT OR Apache-2.0