Memory ceiling
If the model barely fits in RAM or VRAM, everything else gets worse: load time, first token, concurrency, and sometimes stability.
Local-first models
Local AI goes sideways when people pick a model first and only later ask whether the runtime, memory, and patience budget can support it.
Recommended default
For learning, the best first win is not the most famous model. It is a small model and runtime that can answer once without drama.
After that, you can decide whether the bottleneck is quality, speed, memory, context length, or tooling support.
Reality check
If the model barely fits in RAM or VRAM, everything else gets worse: load time, first token, concurrency, and sometimes stability.
The runtime has to support the artifact format, quantization, and model family you actually downloaded.
A setup that is fine for solo experiments may feel unusable for a tool-calling agent loop that makes many requests.
The cleaner the local API or CLI surface is, the easier it is to wrap into the rest of the lab journey.
Machine fit
| If you are limited by... | Prefer... | Avoid... |
|---|---|---|
| RAM / VRAM | Smaller instruct models and conservative quantizations. | Jumping straight to large coding models because they are popular. |
| Patience for setup | One beginner-friendly runtime with a visible local endpoint. | Mixing several runtimes while you are still learning the boundary. |
| Latency | Short prompts, smaller models, and fewer chained calls. | Agent loops that hide how many requests are being made. |
| Repeatability | One model/runtime pair you can script against. | Changing runtime, artifact, and prompt shape at the same time. |
Where this leads
Once the machine and runtime fit are reasonable, the next goal is not “best local AI.” It is a stable endpoint or CLI that you can treat like any other model surface in the labs.
Continue with local hosting and model artifacts, then the labs bootstrap step.