Model hubs and publishers
This is where you discover model families, model cards, licenses,
and downloadable files. It tells you what exists and what terms
apply, not how inference runs.
Examples: publisher repos, Hugging Face model pages, project releases.
Model artifacts
These are the actual files: checkpoints, weights, or quantized
variants. Artifact choice drives memory use, speed, quality, and
compatibility with a given runtime.
Examples: GGUF files, safetensors weights, quantized variants.
Local runtimes and hosts
The runtime loads the model and serves inference through a desktop
UI, a CLI, or a local API. This is the operational layer you
actually point tools at.
Examples: Ollama, LM Studio, llama.cpp servers, vLLM-style local servers.
Quantization choices
Quantization trades memory and speed against quality and sometimes
feature compatibility. Smaller is easier to run, but not always good
enough for coding, tool use, or longer tasks.
Rule of thumb: tiny models teach architecture; better models cost more hardware.
Hardware fit
"Can I run this?" depends on RAM, VRAM, CPU/GPU availability,
context length, concurrency, and patience. Local does not mean free;
it just moves the bill into your machine and time.
Watch for: memory ceilings, slow first tokens, and context-length inflation.
Local endpoints
Once the runtime is working, the rest of your tooling wants a stable
interface: an address, request format, auth model if any, and a
repeatable way to call it from a script or CLI.
This is the bridge from local model hosting into the rest of the labs.
If you mix local hosts with upstream provider keys, brokers, or remote fallbacks, read API key security before wiring that endpoint into a tool host.
Practice companion: probe a real local endpoint.