GRIT Input Validation
This document describes GRIT’s input validation system and how to control it.
Overview
GRIT validates input files to prevent silent failures from incorrectly sorted or malformed data. This differs from bedtools, which often silently produces incorrect results with unsorted input.
Validation Types
1. Sort Order Validation
GRIT validates that intervals are sorted by:
- Chromosome (contiguous blocks)
- Start position (ascending within chromosome)
Valid:
chr1 100 200
chr1 150 250 ✓ (150 >= 100)
chr1 300 400 ✓ (300 >= 150)
chr2 100 200 ✓ (new chromosome)
Invalid:
chr1 100 200
chr1 50 150 ✗ (50 < 100, unsorted)
2. Chromosome Order Validation
With -g/--genome, GRIT validates chromosome order matches the genome file:
genome.txt:
chr1 248956422
chr2 242193529
chr3 198295559
Valid order: chr1 → chr2 → chr3
Invalid: chr2 → chr1 (wrong order)
Invalid: chr4 (not in genome file)
3. Format Validation
GRIT validates BED format:
- At least 3 tab-separated fields
- Start and end are valid integers
- Start ≤ end (unless zero-length intervals allowed)
Validation Flags
| Flag | Effect |
|---|---|
| (default) | Validate sort order, fail on error |
--assume-sorted | Skip validation entirely |
--allow-unsorted | Load and re-sort in memory |
-g/--genome FILE | Validate chromosome order |
--assume-sorted
Skips all sort validation. Use when:
- You know files are pre-sorted
- Processing output from another GRIT command
- Maximum startup speed needed
# Piping from sort - safe to assume sorted
grit sort -i raw.bed | grit merge -i - --assume-sorted
Warning: Using --assume-sorted with unsorted input produces incorrect results silently.
--allow-unsorted
Loads entire file into memory and sorts it:
grit intersect -a unsorted.bed -b reference.bed --allow-unsorted
Trade-offs:
- Uses O(n) memory (entire file)
- Adds sorting overhead
- Enables processing of unsorted input
-g/--genome
Validates chromosome order against a genome file:
grit intersect -a A.bed -b B.bed -g hg38.genome
Genome file format:
chr1 248956422
chr2 242193529
chr3 198295559
...
Error Messages
Unsorted Input
Error: File A is not sorted: position 50 at line 5 comes after 100 on chr1
Fix: Run 'grit sort -i file.bed > sorted.bed' first.
Or use '--allow-unsorted' to load and re-sort in memory (uses O(n) memory).
Wrong Chromosome Order
Error: Chromosome 'chr1' appears after 'chr2' in file A
This violates the expected genome order from hg38.genome.
Fix: Run 'grit sort -i file.bed -g hg38.genome > sorted.bed' first.
Invalid BED Format
Error: Invalid BED format at line 7: expected integer for start position, got 'abc'
Implementation Details
Sort Validator
pub struct SortValidator {
last_chrom: Option<String>,
last_start: u64,
line_number: usize,
genome_order: Option<HashMap<String, usize>>,
}
impl SortValidator {
pub fn validate(&mut self, record: &BedRecord) -> Result<()> {
self.line_number += 1;
// Check position within chromosome
if let Some(ref last_chrom) = self.last_chrom {
if record.chrom == *last_chrom {
if record.start < self.last_start {
return Err(GritError::UnsortedInput {
position: record.start,
after: self.last_start,
chrom: record.chrom.clone(),
line: self.line_number,
});
}
}
}
// Check genome order if specified
if let Some(ref genome) = self.genome_order {
// ... genome order validation
}
self.last_chrom = Some(record.chrom.clone());
self.last_start = record.start;
Ok(())
}
}
stdin Handling
When reading from stdin, GRIT buffers input for validation:
fn validate_stdin(reader: impl BufRead) -> Result<Vec<BedRecord>> {
let mut records = Vec::new();
let mut validator = SortValidator::new();
for line in reader.lines() {
let record = parse_bed_line(&line?)?;
validator.validate(&record)?;
records.push(record);
}
Ok(records)
}
Memory impact: stdin validation uses O(n) memory. Use --assume-sorted to avoid buffering:
cat sorted.bed | grit merge -i - --assume-sorted
Validation by Command
| Command | Default | --assume-sorted | --allow-unsorted | -g/--genome |
|---|---|---|---|---|
| intersect | Validates | Skips | Sorts in memory | Validates order |
| subtract | Validates | Skips | Sorts in memory | Validates order |
| closest | Validates | Skips | Sorts in memory | Validates order |
| merge | Validates | Skips | --in-memory | Validates order |
| window | Validates | Skips | N/A | Validates order |
| coverage | Validates | Skips | N/A | Validates order |
| sort | N/A | N/A | N/A | Orders by genome |
| slop | N/A | N/A | N/A | N/A |
| complement | Validates | Skips | N/A | N/A |
Best Practices
Pipeline Processing
# Sort once, process multiple times
grit sort -i raw.bed > sorted.bed
grit merge -i sorted.bed --assume-sorted > merged.bed
grit intersect -a merged.bed -b features.bed --assume-sorted > result.bed
Mixed Pipelines
# GRIT sort output is guaranteed sorted
grit sort -i raw.bed | grit merge -i - --assume-sorted | grit intersect -a - -b ref.bed --assume-sorted
Genome-Specific Ordering
# Ensure consistent chromosome order across files
grit sort -i file1.bed -g hg38.genome > file1_sorted.bed
grit sort -i file2.bed -g hg38.genome > file2_sorted.bed
grit intersect -a file1_sorted.bed -b file2_sorted.bed -g hg38.genome --assume-sorted
Comparison with bedtools
| Scenario | bedtools | GRIT |
|---|---|---|
| Unsorted input | Silent wrong results | Error with fix suggestion |
| Wrong chrom order | Depends on command | Error with -g |
| Zero-length intervals | Treats as 1bp | Strict (or --bedtools-compatible) |