grit generate
Generate synthetic BED datasets for benchmarking and testing.
Usage
Options
| Option | Description |
-o, --output <DIR> | Output directory (default: ./grit_bench_data) |
--sizes <SIZES> | Comma-separated sizes to generate (default: 1M,5M,10M,25M,50M) |
--seed <INT> | Random seed for reproducibility (default: 42) |
--mode <MODE> | Generation mode: balanced, clustered, identical, skewed-a-gt-b, skewed-b-gt-a, all |
--sorted <yes\|no\|auto> | Sorting behavior (default: auto) |
--no-sort | Alias for --sorted no |
--a <SIZE> | Custom A file size |
--b <SIZE> | Custom B file size |
--hotspot-frac <FLOAT> | Genome fraction for hotspots (default: 0.05) |
--hotspot-weight <FLOAT> | Interval fraction in hotspots (default: 0.80) |
--len-min <INT> | Minimum interval length (default: 50) |
--len-max <INT> | Maximum interval length (default: 1000) |
--force | Overwrite existing files |
Generation Modes
| Mode | Description |
balanced | Equal-sized A and B files with uniform distribution |
skewed-a-gt-b | A file 10x larger than B |
skewed-b-gt-a | B file 10x larger than A |
identical | A and B are identical |
clustered | Intervals concentrated in hotspot regions |
all | Generate all modes |
Size Specifications
| Format | Example | Value |
| Number | 1000 | 1,000 intervals |
| K suffix | 10K | 10,000 intervals |
| M suffix | 5M | 5,000,000 intervals |
Examples
Quick benchmark data
# Generate 100K intervals for quick testing
grit generate --sizes 100K --mode balanced --force
Large-scale testing
# Generate multiple sizes for comprehensive benchmarking
grit generate --sizes 10M,50M,100M --mode all --seed 123
Custom A/B sizes
# Generate asymmetric datasets
grit generate --a 1M --b 10M --mode balanced
Clustered data
# Simulate real-world hotspots (e.g., promoter regions)
grit generate --mode clustered --hotspot-frac 0.1 --hotspot-weight 0.9
Unsorted output
# Generate unsorted data for testing sort validation
grit generate --sizes 1M --no-sort
Output Structure
benchmark_data/
├── balanced/
│ ├── 1M/
│ │ ├── A.bed
│ │ └── B.bed
│ ├── 5M/
│ └── 10M/
├── clustered/
│ └── ...
├── identical/
│ └── ...
├── skewed-a-gt-b/
│ └── ...
└── skewed-b-gt-a/
└── ...
Notes
- All generated files use human genome chromosome sizes (hg38)
- Seed ensures reproducibility across runs
- Auto sort mode sorts when size <= 10M
- Generated data is suitable for benchmarking intersect, coverage, merge, etc.
← Back to Commands