GRIT Architecture
This document describes the high-level architecture and design decisions in GRIT.
Overview
GRIT (Genomic Range Interval Toolkit) is a high-performance Rust implementation of genomic interval operations. It prioritizes:
- Performance: Zero-allocation hot paths, SIMD-accelerated parsing
- Memory efficiency: O(k) streaming algorithms where k = max overlapping intervals
- Correctness: SHA256 parity with bedtools for all commands
- Usability: Drop-in replacement for bedtools CLI
Module Structure
src/
├── main.rs # CLI entry point and argument parsing
├── lib.rs # Library exports
├── bed/
│ ├── mod.rs # BED record types and parsing
│ ├── record.rs # Interval record representation
│ └── parser.rs # Zero-copy BED parsing
├── commands/
│ ├── mod.rs # Command registry
│ ├── intersect.rs # Intersection operations
│ ├── subtract.rs # Subtraction operations
│ ├── merge.rs # Merge operations
│ ├── sort.rs # Sorting operations
│ ├── closest.rs # Nearest neighbor search
│ ├── window.rs # Window-based operations
│ ├── coverage.rs # Coverage computation
│ ├── slop.rs # Interval extension
│ ├── complement.rs # Gap finding
│ ├── genomecov.rs # Genome-wide coverage
│ ├── jaccard.rs # Similarity coefficient
│ └── multiinter.rs # Multi-file intersection
├── streaming/
│ ├── mod.rs # Streaming infrastructure
│ ├── merge.rs # Streaming merge
│ ├── intersect.rs # Streaming intersection
│ └── sweep.rs # Sweep-line algorithm
├── sort/
│ ├── mod.rs # Sort implementations
│ └── radix.rs # Radix sort for intervals
└── utils/
├── mod.rs # Utility functions
├── genome.rs # Genome file handling
└── validation.rs # Input validation
Core Components
BED Record Representation
pub struct BedRecord {
pub chrom: String,
pub start: u64,
pub end: u64,
pub rest: Option<String>, // Optional additional fields
}
Design decisions:
- Use
u64for coordinates (supports chromosomes > 4GB) - Store raw rest-of-line to avoid parsing unused fields
- Zero-copy parsing where possible
Streaming Infrastructure
The streaming system enables O(k) memory processing:
pub trait StreamProcessor {
fn process_interval(&mut self, record: &BedRecord) -> Result<Vec<BedRecord>>;
fn flush(&mut self) -> Result<Vec<BedRecord>>;
}
See STREAMING_MODEL.md for details.
Sort Order Validation
GRIT validates input sort order to prevent silent failures:
pub struct SortValidator {
last_chrom: Option<String>,
last_start: u64,
genome_order: Option<Vec<String>>,
}
See VALIDATION.md for details.
Command Architecture
Each command follows a consistent pattern:
- Argument parsing via
clapderive macros - Input validation (sort order, chromosome order)
- Core algorithm (streaming or parallel)
- Output formatting (bedtools-compatible)
Example: Intersect
pub fn run(args: IntersectArgs) -> Result<()> {
// 1. Open input files
let a_reader = BedReader::open(&args.file_a)?;
let b_reader = BedReader::open(&args.file_b)?;
// 2. Validate sort order (unless --assume-sorted)
if !args.assume_sorted {
validate_sorted(&a_reader)?;
validate_sorted(&b_reader)?;
}
// 3. Choose algorithm based on flags
if args.streaming {
streaming_intersect(a_reader, b_reader, &args)
} else {
parallel_intersect(a_reader, b_reader, &args)
}
}
Performance Optimizations
Zero-Allocation Parsing
// Use memchr for fast field detection
fn parse_bed_line(line: &[u8]) -> Result<BedRecord> {
let tab1 = memchr::memchr(b'\t', line)?;
let tab2 = memchr::memchr(b'\t', &line[tab1+1..])?;
// ...
}
SIMD Integer Parsing
// Fast integer parsing without allocation
fn parse_u64(bytes: &[u8]) -> u64 {
bytes.iter().fold(0u64, |acc, &b| acc * 10 + (b - b'0') as u64)
}
Memory-Mapped I/O
// For large files, use mmap for efficient random access
let mmap = unsafe { Mmap::map(&file)? };
Thread Model
GRIT uses Rayon for parallel processing:
- Parallel mode: Load intervals into memory, process in parallel
- Streaming mode: Single-threaded sweep-line for O(k) memory
// Parallel intersection
intervals_a.par_iter()
.flat_map(|a| find_overlaps(a, &tree_b))
.for_each(|result| output(result));
Error Handling
GRIT uses thiserror for structured errors:
#[derive(Error, Debug)]
pub enum GritError {
#[error("I/O error: {0}")]
Io(#[from] std::io::Error),
#[error("file not sorted: position {0} after {1} on {2}")]
UnsortedInput(u64, u64, String),
#[error("invalid BED format at line {0}: {1}")]
ParseError(usize, String),
}
Testing Strategy
- Unit tests: Each module has inline tests
- Integration tests: End-to-end command tests
- Parity tests: SHA256 comparison with bedtools
- Property tests: Invariant verification
See the tests/ directory for examples.