AhanaAI Research

Foundations

Shannon's source coding theorem — and beyond.

All lossless compression is bounded by the entropy of the source. The question isn't if there's a limit — it's how close you can get with practical, streaming-capable implementations.

📐

Shannon entropy of English text

Claude Shannon estimated in 1951 that English text has ~1.0 bit per character of fundamental entropy. That means the theoretical maximum lossless compression is approximately 87.5% (1.0 bpc from 8.0 bpc).

Our ACP v4 nano achieves 87.97% on enwik8 — within striking distance of the theoretical limit.

📊

Entropy at different orders

First-order entropy of enwik8: 4.6–5.1 bpb. Third-order: 2.1–2.3 bpb. At 95% savings you'd need 0.4 bpb — impossible for natural language. Our neural models exploit long-range dependencies that statistical coders miss.

95% lossless on mixed data is physically impossible per Shannon's theorem.

March 2026 Benchmark Sprint

Adaptive routing: +6.0 pp aggregate.

We tested 10+ pipeline configurations across 8 content types. The key finding: no single compressor is best for all data. Content-aware routing is the answer.

Content Type	Best Pipeline	Savings	vs. zstd-22	Why
Text (enwik8)	LZMA-9	72.9%	+0.7 pp	Large dictionary captures long-range text patterns
Code (Python)	LZMA-9	83.0%	+0.2 pp	Structured, repetitive syntax benefits from LZMA
Speech audio (PCM)	bzip2-9	35.1%	+7.8 pp	BWT suits audio's sample patterns; 7× faster too
Music audio (PCM)	bzip2-9	49.8%	+9.5 pp	Higher ratio AND faster — pure win on audio PCM
Raw image	delta8 + LZMA-9	41.3%	+13.7 pp	Delta-8 decorrelates sequential pixels before entropy coding
Raw video	delta8 + LZMA-9	71.5%	+15.8 pp	Frame-to-frame deltas + spatial decorrelation
Already compressed	zstd-1 passthrough	~0%	—	Don't waste cycles on incompressible data
Aggregate	Adaptive router	63.3%	+6.0 pp	Content-type detection <1 µs

All pipelines 100% lossless, SHA-256 verified. Detection uses magic bytes, Shannon entropy, byte histograms, and PCM heuristics. Results from 5 parallel benchmark agents across 10 MB standardized corpora per type.

Key Findings

What we've learned.

❌

BWT is not the answer

Burrows-Wheeler Transform helps only at ≤1 MB blocks with simple entropy coders. At realistic sizes, zstd/LZMA's LZ77 with large windows captures the same redundancy natively. BWT encode speed: ~0.8 MB/s — unacceptable.

✅

bzip2 wins on audio

Both higher-ratio AND faster than zstd-22 on PCM data. 24 MB/s vs 3 MB/s — a pure win on every axis. The BWT in bzip2 suits audio's repetitive sample patterns at the block sizes bzip2 uses (900 KB).

✅

Delta-8 for visual media

Subtracting each byte from the previous one decorrelates sequential pixel/frame values. +13.7 pp on images, +15.8 pp on video. But NEVER apply to text — it hurts by −4.3 pp.

⚠️

Float32 shuffle is type-gated

Byte shuffling groups corresponding bytes from consecutive float32 values. Only useful for actual float arrays (ML weights). Produces −9 pp on image data. Detection must be precise.

🔒

FP determinism is critical

Batched CDF computation uses different cuBLAS kernel paths for different sequence lengths → ±1 integer CDF → decode failure. Must use sequential _get_cdf_single path for lossless roundtrip.

🎚️

Label smoothing prevents catastrophe

0.05 smoothing prevents overconfident wrong predictions, which are fatal for range coding. One wrong high-confidence CDF value can corrupt the entire remaining bitstream.

State of the Art

Where AhanaAI stands.

Comparison with leading compression systems on enwik8 (100 MB).

System	Savings (enwik8)	Speed	RAM	Practical?
cmix v19	~90%	Hours	32 GB	No — research only
paq8px	~87%	Minutes	1 GB	Impractical for streaming
AhanaAI ACP v4	~88%	Seconds (GPU)	2–4 GB VRAM	Yes — streaming-native
zstd-22	~75%	Milliseconds	128 MB	Yes — industry standard
gzip-9	~64%	Milliseconds	32 MB	Yes — universal

ACP v4 is the first system to achieve cmix-class compression ratios at practical, streaming speeds. GPU acceleration makes it viable for production workloads where cmix and paq are not.

Compression
science.

Shannon's source coding theorem — and beyond.

Shannon entropy of English text

Entropy at different orders

Adaptive routing: +6.0 pp aggregate.

What we've learned.

BWT is not the answer

bzip2 wins on audio

Delta-8 for visual media

Float32 shuffle is type-gated

FP determinism is critical

Label smoothing prevents catastrophe

Where AhanaAI stands.

Follow our research.

Compressionscience.

Shannon's source coding theorem — and beyond.

Shannon entropy of English text

Entropy at different orders

Adaptive routing: +6.0 pp aggregate.

What we've learned.

BWT is not the answer

bzip2 wins on audio

Delta-8 for visual media

Float32 shuffle is type-gated

FP determinism is critical

Label smoothing prevents catastrophe

Where AhanaAI stands.

Follow our research.

Compression
science.