Our work sits at the intersection of information theory, deep learning, and practical systems engineering. Every claim is backed by reproducible benchmarks on standardized corpora.
All lossless compression is bounded by the entropy of the source. The question isn't if there's a limit โ it's how close you can get with practical, streaming-capable implementations.
Claude Shannon estimated in 1951 that English text has ~1.0 bit per character of fundamental entropy. That means the theoretical maximum lossless compression is approximately 87.5% (1.0 bpc from 8.0 bpc).
Our ACP v4 nano achieves 87.97% on enwik8 โ within striking distance of the theoretical limit.
First-order entropy of enwik8: 4.6โ5.1 bpb. Third-order: 2.1โ2.3 bpb. At 95% savings you'd need 0.4 bpb โ impossible for natural language. Our neural models exploit long-range dependencies that statistical coders miss.
95% lossless on mixed data is physically impossible per Shannon's theorem.
We tested 10+ pipeline configurations across 8 content types. The key finding: no single compressor is best for all data. Content-aware routing is the answer.
| Content Type | Best Pipeline | Savings | vs. zstd-22 | Why |
|---|---|---|---|---|
| Text (enwik8) | LZMA-9 | 72.9% | +0.7 pp | Large dictionary captures long-range text patterns |
| Code (Python) | LZMA-9 | 83.0% | +0.2 pp | Structured, repetitive syntax benefits from LZMA |
| Speech audio (PCM) | bzip2-9 | 35.1% | +7.8 pp | BWT suits audio's sample patterns; 7ร faster too |
| Music audio (PCM) | bzip2-9 | 49.8% | +9.5 pp | Higher ratio AND faster โ pure win on audio PCM |
| Raw image | delta8 + LZMA-9 | 41.3% | +13.7 pp | Delta-8 decorrelates sequential pixels before entropy coding |
| Raw video | delta8 + LZMA-9 | 71.5% | +15.8 pp | Frame-to-frame deltas + spatial decorrelation |
| Already compressed | zstd-1 passthrough | ~0% | โ | Don't waste cycles on incompressible data |
| Aggregate | Adaptive router | 63.3% | +6.0 pp | Content-type detection <1 ยตs |
All pipelines 100% lossless, SHA-256 verified. Detection uses magic bytes, Shannon entropy, byte histograms, and PCM heuristics. Results from 5 parallel benchmark agents across 10 MB standardized corpora per type.
Burrows-Wheeler Transform helps only at โค1 MB blocks with simple entropy coders. At realistic sizes, zstd/LZMA's LZ77 with large windows captures the same redundancy natively. BWT encode speed: ~0.8 MB/s โ unacceptable.
Both higher-ratio AND faster than zstd-22 on PCM data. 24 MB/s vs 3 MB/s โ a pure win on every axis. The BWT in bzip2 suits audio's repetitive sample patterns at the block sizes bzip2 uses (900 KB).
Subtracting each byte from the previous one decorrelates sequential pixel/frame values. +13.7 pp on images, +15.8 pp on video. But NEVER apply to text โ it hurts by โ4.3 pp.
Byte shuffling groups corresponding bytes from consecutive float32 values. Only useful for actual float arrays (ML weights). Produces โ9 pp on image data. Detection must be precise.
Batched CDF computation uses different cuBLAS kernel paths for different sequence lengths โ ยฑ1 integer CDF โ decode failure. Must use sequential _get_cdf_single path for lossless roundtrip.
0.05 smoothing prevents overconfident wrong predictions, which are fatal for range coding. One wrong high-confidence CDF value can corrupt the entire remaining bitstream.
Comparison with leading compression systems on enwik8 (100 MB).
| System | Savings (enwik8) | Speed | RAM | Practical? |
|---|---|---|---|---|
| cmix v19 | ~90% | Hours | 32 GB | No โ research only |
| paq8px | ~87% | Minutes | 1 GB | Impractical for streaming |
| AhanaAI ACP v4 | ~88% | Seconds (GPU) | 2โ4 GB VRAM | Yes โ streaming-native |
| zstd-22 | ~75% | Milliseconds | 128 MB | Yes โ industry standard |
| gzip-9 | ~64% | Milliseconds | 32 MB | Yes โ universal |
ACP v4 is the first system to achieve cmix-class compression ratios at practical, streaming speeds. GPU acceleration makes it viable for production workloads where cmix and paq are not.
We publish benchmark results, ablation studies, and technical deep-dives as we develop ACP. Join early access to get research updates.
Get Research Updates โ