Edge ML Inference
I wanted to see how close to zero you can push ML inference latency by running small models at the edge instead of calling a cloud API. This prototype runs quantized ONNX and TensorFlow.js models on Vercel Edge and Cloudflare Workers, with WebAssembly handling the heavy lifting. The headline result is sub-10ms classification globally with no cold starts. The interesting result is everything that does not work: anything above ~50MB, anything that needs a GPU, anything stateful.
What this is
A lab, not a product.
I wanted to see how close to zero you can push ML inference latency by running small models at the edge instead of calling a cloud API. This prototype runs quantized ONNX and TensorFlow.js models on Vercel Edge and Cloudflare Workers, with WebAssembly handling the heavy lifting. The headline result is sub-10ms classification globally with no cold starts. The interesting result is everything that does not work: anything above ~50MB, anything that needs a GPU, anything stateful.
Features
Learnings
Technologies
Capabilities
What it does
The features that actually got built and run in this prototype.
The stack
What it is built with
The libraries and runtimes I picked for this lab and why they earned their place.
What I learned
Learnings, in order of how much they surprised me
The things I would tell another engineer before they tried the same experiment.
Note: This is an experimental project in the experimental stage. It is a learning exercise and technical exploration rather than a production-ready solution. Patterns and code may change.
Edge Computing
Related labs
Other explorations in this area.
Want me to build something like this for you?
If this kind of work fits your roadmap, I take on a small number of paid projects each quarter.