Edge Computingexperimental

Edge ML Inference

I wanted to see how close to zero you can push ML inference latency by running small models at the edge instead of calling a cloud API. This prototype runs quantized ONNX and TensorFlow.js models on Vercel Edge and Cloudflare Workers, with WebAssembly handling the heavy lifting. The headline result is sub-10ms classification globally with no cold starts. The interesting result is everything that does not work: anything above ~50MB, anything that needs a GPU, anything stateful.

ONNX RuntimeVercel EdgeTensorFlow.jsWebAssembly

What this is

A lab, not a product.

I wanted to see how close to zero you can push ML inference latency by running small models at the edge instead of calling a cloud API. This prototype runs quantized ONNX and TensorFlow.js models on Vercel Edge and Cloudflare Workers, with WebAssembly handling the heavy lifting. The headline result is sub-10ms classification globally with no cold starts. The interesting result is everything that does not work: anything above ~50MB, anything that needs a GPU, anything stateful.

5

Features

4

Learnings

4

Technologies

Capabilities

What it does

The features that actually got built and run in this prototype.

feature_01.ts
Sub-10ms inference latency for small classification and embedding models
feature_02.ts
No cold start delays because the model lives in the worker bundle
feature_03.ts
Quantized model support with int8 weights, typically 4x smaller than float32
feature_04.ts
Automatic model caching at the edge, see the edge caching playbook
feature_05.ts
Fallback to cloud inference for anything the edge runtime cannot fit or run

The stack

What it is built with

The libraries and runtimes I picked for this lab and why they earned their place.

ONNX Runtime
Vercel Edge
TensorFlow.js
WebAssembly

What I learned

Learnings, in order of how much they surprised me

The things I would tell another engineer before they tried the same experiment.

01
WASM-based inference adds about 5ms overhead but unlocks much more complex models than pure JS
02
Model size dominates everything. A 20MB model deploys cleanly, a 200MB one starts breaking edge runtimes
03
Quantization is almost free. 4x size reduction with under 1% accuracy loss on the models I tried
04
Read the related edge ML insight for the full tradeoff matrix

Note: This is an experimental project in the experimental stage. It is a learning exercise and technical exploration rather than a production-ready solution. Patterns and code may change.

Want me to build something like this for you?

If this kind of work fits your roadmap, I take on a small number of paid projects each quarter.