generate
Description
Generate synthetic BED datasets for benchmarking. Creates reproducible test data with various distribution patterns.
Command
grit generate --output ./benchmark_data --sizes 1M,5M,10M --seed 42
Output
Creates directory structure with A.bed and B.bed files:
benchmark_data/
├── balanced/
│ ├── 1M/
│ │ ├── A.bed
│ │ └── B.bed
│ ├── 5M/
│ └── 10M/
├── clustered/
│ └── ...
├── identical/
│ └── ...
├── skewed-a-gt-b/
│ └── ...
└── skewed-b-gt-a/
└── ...
Options
| Flag | Description |
-o, --output | Output directory (default: ./grit_bench_data) |
--sizes | Sizes to generate (default: 1M,5M,10M,25M,50M) |
--seed | Random seed for reproducibility (default: 42) |
--mode | Generation mode (default: all) |
--sorted | Sorting behavior: yes, no, auto (default: auto) |
--no-sort | Alias for --sorted no |
--a | Custom A file size |
--b | Custom B file size |
--hotspot-frac | Genome fraction for hotspots (default: 0.05) |
--hotspot-weight | Interval fraction in hotspots (default: 0.80) |
--len-min | Minimum interval length (default: 50) |
--len-max | Maximum interval length (default: 1000) |
--force | Overwrite existing files |
Generation Modes
| Mode | Description |
balanced | Equal-sized A and B files with uniform distribution |
skewed-a-gt-b | A file 10x larger than B |
skewed-b-gt-a | B file 10x larger than A |
identical | A and B are identical |
clustered | Intervals concentrated in hotspot regions |
all | Generate all modes |
Size Specifications
| Format | Example | Value |
| Number | 1000 | 1,000 intervals |
| K suffix | 10K | 10,000 intervals |
| M suffix | 5M | 5,000,000 intervals |
Examples
Quick Benchmark Data
grit generate --sizes 100K --mode balanced --force
Large-Scale Testing
grit generate --sizes 10M,50M,100M --mode all --seed 123
Custom A/B Sizes
grit generate --a 1M --b 10M --mode balanced
Clustered Data
grit generate --mode clustered --hotspot-frac 0.1 --hotspot-weight 0.9
Unsorted Output
grit generate --sizes 1M --no-sort
Notes
- All generated files use human genome chromosome sizes (hg38)
- Seed ensures reproducibility across runs
- Auto sort mode sorts when size <= 10M
- Generated data suitable for benchmarking intersect, coverage, etc.