GRIT Streaming Model
This document describes GRIT’s streaming algorithms and their memory guarantees.
Overview
GRIT’s streaming mode processes genomic intervals with O(k) memory, where k is the maximum number of intervals overlapping at any genomic position. This enables processing arbitrarily large files with constant memory usage.
Memory Complexity
| Mode | Memory | Use Case |
|---|---|---|
| Parallel | O(n + m) | Maximum speed, unsorted input |
| Streaming | O(k) | Large files, memory-constrained |
Where:
- n = number of intervals in file A
- m = number of intervals in file B
- k = max overlapping intervals at any position (typically < 100)
Sweep-Line Algorithm
The core streaming algorithm uses a sweep-line approach:
Input: Two sorted BED files A and B
Output: Intersection results
1. Initialize active_b = empty set (intervals from B currently "active")
2. For each interval a in A:
a. Remove intervals from active_b that end before a.start
b. Add intervals from B that start before a.end to active_b
c. For each interval b in active_b overlapping a:
- Output intersection
3. Output remaining active intervals (if needed)
Visual Example
Position: 100 200 300 400 500
|---------|---------|---------|---------|
A intervals:
[========A1========]
[====A2====]
B intervals:
[==B1==]
[=====B2=====]
[===B3===]
Sweep line progression:
pos=100: active_b = {B1} → A1 overlaps B1
pos=150: active_b = {B1, B2} → A1 overlaps B1, B2
pos=200: active_b = {B2} → A1 overlaps B2
pos=250: active_b = {B2} → A2 overlaps B2
pos=300: active_b = {B2, B3} → A2 overlaps B2, B3
pos=350: active_b = {B3} → A2 overlaps B3
Implementation
Active Interval Buffer
struct ActiveBuffer {
intervals: VecDeque<BedRecord>,
current_chrom: String,
}
impl ActiveBuffer {
fn add(&mut self, record: BedRecord) {
self.intervals.push_back(record);
}
fn remove_expired(&mut self, position: u64) {
while let Some(front) = self.intervals.front() {
if front.end <= position {
self.intervals.pop_front();
} else {
break;
}
}
}
fn overlapping(&self, start: u64, end: u64) -> impl Iterator<Item = &BedRecord> {
self.intervals.iter()
.filter(move |b| b.start < end && b.end > start)
}
}
Memory Bound Proof
Theorem: The active buffer size is bounded by k, where k is the maximum number of intervals overlapping at any position.
Proof:
- An interval b enters the buffer when the sweep line reaches b.start
- An interval b exits the buffer when the sweep line passes b.end
- At any position p, the buffer contains exactly the intervals covering p
- By definition, at most k intervals cover any position
- Therefore, buffer size ≤ k
In practice: For typical genomic data, k < 100 even with millions of intervals.
Sorted Input Requirement
Streaming algorithms require sorted input:
chr1 100 200 # ✓
chr1 150 250 # ✓
chr1 300 400 # ✓
chr2 100 200 # ✓ (new chromosome)
chr1 500 600 # ✗ ERROR: chr1 after chr2
Validation
GRIT validates sort order by default:
fn validate_sorted(&mut self, record: &BedRecord) -> Result<()> {
if let Some(ref last_chrom) = self.last_chrom {
if record.chrom == *last_chrom {
if record.start < self.last_start {
return Err(GritError::UnsortedInput(...));
}
} else if self.seen_chroms.contains(&record.chrom) {
return Err(GritError::ChromosomeRevisited(...));
}
}
Ok(())
}
Skip Validation
For pre-sorted files, skip validation for faster startup:
grit intersect -a sorted_a.bed -b sorted_b.bed --streaming --assume-sorted
Chromosome Transitions
Special handling is required at chromosome boundaries:
fn process_chromosome_change(&mut self, new_chrom: &str) {
// Flush all remaining output for current chromosome
self.flush_current_chromosome();
// Clear active buffer (intervals from old chromosome)
self.active_buffer.clear();
// Update current chromosome
self.current_chrom = new_chrom.to_string();
}
Commands Supporting Streaming
| Command | --streaming flag | Memory |
|---|---|---|
| intersect | Yes | O(k) |
| subtract | Yes | O(k) |
| closest | Yes | O(k) |
| window | Implicit | O(k) |
| coverage | Implicit | O(k) |
| merge | Always | O(1) |
| complement | Always | O(1) |
Performance Characteristics
Time Complexity
- Streaming intersection: O(n + m + output)
- Parallel intersection: O((n + m) log(n + m) + output)
I/O Pattern
Streaming mode reads input sequentially, enabling:
- Efficient disk access patterns
- Pipeline processing (stdin/stdout)
- Memory-mapped I/O optimization
When to Use Streaming
Use streaming mode when:
- Files exceed available RAM
- Processing pipelines (stdin/stdout)
- Memory is constrained
- Files are already sorted
Use parallel mode when:
- Maximum speed is required
- Files fit in memory
- Input is unsorted (will be sorted in memory)
Testing Streaming Correctness
Memory Verification
# Linux: Check peak RSS
/usr/bin/time -v grit intersect --streaming -a large.bed -b large.bed 2>&1 | grep "Maximum resident"
# macOS: Use Activity Monitor or instruments
Correctness Verification
# Compare streaming vs parallel output
grit intersect -a A.bed -b B.bed > parallel.bed
grit intersect -a A.bed -b B.bed --streaming > streaming.bed
diff parallel.bed streaming.bed # Should be identical
Edge Cases
Test with:
- Empty files
- Single interval
- Non-overlapping intervals
- Highly nested intervals (stress k bound)
- Chromosome transitions