Snake API: Economics of Deterministic Matching

Charles Dana · Monce SAS · May 2026

snake.aws.monce.ai · /paper · /architecture · /math

1. The headline number

Cost per matched line: ~$3.7 × 10−9

That is, $0.0000037 per query. Or: $3.70 per million matched lines. Flat. Linear. No bulk discount needed because the baseline is already negligible.

This is the cost of a fully-loaded production query: nginx ingress, gunicorn dispatch, FastAPI request parsing, Snake SAT vote across the relevant field model, fuzzy fallback if needed, JSON response with audit trail, full TLS round-trip. End to end. Same number whether the query hits Tier 1 (SAT 100%), Tier 2 (fuzzy), or returns no match.

2. Cost per matched line vs the alternatives

The natural alternative is an LLM call to interpret each line. Same task, different substrate. The cost gap spans four to five orders of magnitude, so the only honest way to draw it is on a log scale:

1e-9 1e-8 1e-7 1e-6 1e-5 1e-4 1e-3 cost per matched line ($, log scale) Snake API $3.7e-9 Embedding sim. $2e-5 Claude Haiku $1e-4 SageMaker $1e-4 GPT-4o-mini $1.5e-4
Cost per matched line, log scale. Snake sits ~27,000× left of Claude Haiku — a full deterministic match for the price of four-millionths of an LLM call.
ApproachLatencyCost / lineDeterminismAudit trail
Snake API3.7 ms$3.7×10−9YesSAT clauses
Claude Haiku 4.5~400 ms~$0.0001NoFree-text only
GPT-4o-mini~600 ms~$0.00015NoFree-text only
Embedding similarity~80 ms~$0.00002~YesCosine score
SageMaker endpoint~50 ms~$0.0001YesModel-dependent
Snake API is ~27,000× cheaper than Claude Haiku per matched line.

3. The amortization model

The whole stack lives on a single always-on instance — no queue, no Lambda, no serverless cold start. So the per-query cost is simply the instance's hourly rate divided by sustained throughput:

$/query = (instance $/sec) / (sustained q/s)

At ~260 q/s sustained (single-instance benchmark) the amortized compute is below $10−6 per query. Folding in a conservative ~50% utilization assumption — the instance is provisioned 24/7 but real traffic ebbs and flows — lands the headline at $3.7·10−9.

The instance is deliberately over-provisioned: RAM headroom absorbs OCR batch storms and rebuild_all overlap without thrashing. The fixed monthly cost is dominated by that always-on instance and is a low-three-figure line item — negligible against the value of every order line it routes correctly and deterministically.

3.1 At what volume does the math flip?

Let F be the fixed monthly cost, cS = $3.7×10−9 the Snake per-line cost, and cL = $10−4 the LLM per-line cost. Snake wins above the crossover volume k* where total costs cross:

F + k·cS < k·cL  ⇒  k* = F / (cL − cS) ≈ F / cL
queries per month → total monthly cost → LLM (linear in volume) Snake (fixed + negligible slope) k* ≈ F / c_L fixed cost F
Below k* an LLM is cheaper (no instance to amortize). Above it, Snake's flat curve dominates — and stays dominant forever, because its slope is four orders of magnitude shallower.

With a low-three-figure monthly fixed cost, the crossover lands around a few million queries per month. Production load on this instance runs comfortably above that — firmly in Snake territory, where every additional million lines costs $3.70 instead of $100.

4. Memory and per-tenant footprint

Each tenant carries one global model, a set of field models, and a client model. Model size scales with catalog size — tens to a few hundred MB per tenant, deserialized population included. The full tenant set resides in every worker process; Python object overhead roughly doubles the on-disk JSON size at runtime. The instance is sized so the entire fleet fits resident with large headroom for additional workers as traffic grows.

5. Operational cost: rebuild_all

A full rebuild_all retrains every tenant's article + client models from scratch. Triggered nightly via cron, on-demand by ops, or after a bulk synonym import.

FactorValue
Wall-clock duration~9 min 20 s (article models)
Disk I/O~GB-scale write to /tmp + upload to S3
S3 PUT requests~20 model files per tenant
Worker reload (SIGHUP)~30 s, fresh worker pool, old workers drain in-flight
A full rebuild costs cents of compute. Nightly cron: a coffee a month.

6. Adding a new tenant

Onboarding is code-free. The auto-onboarding cron polls the upstream database, discovers new factory IDs, and trains them with zero human intervention — it even skips IDs that exist but carry no articles yet:

StepCostTime
Auto-onboarding cron picks up new factory_id from the upstream DB$0polling cycle
populate_models.py training runcents5–10 min
S3 upload~$0.00130 s
SIGHUP reload$030 s
Total~cents marginal< 15 min from new tenant to live matching

No code changes needed. Factory IDs are discovered dynamically via S3 listing (model_manager._load_all_models()) and the upstream factory registry.

7. Why we don't go serverless

Lambda would be tempting: scale-to-zero, pay-per-invoke. But:

SnakeBatch (the distributed trainer at snakebatch.aws.monce.ai) does use Lambda — because training is bursty and parallel. Inference is steady and stateful. Different workload, different infra.

8. Summary

Charles Dana · Monce SAS · snake.aws.monce.ai · deployed 2026-05-20
Co-Authored-By: Claude (Anthropic)