Local-first models

Separate the host, the artifact, and the hardware.

"Run it locally" sounds like one decision, but it is really several: which model file, which runtime, which machine, and which local endpoint.

The blur

People often say "I use Ollama" when they mean three different things.

A local host is not the same thing as a model artifact, and neither is the same thing as the organization that published the model. If you keep those roles separate, local AI tooling gets much easier to reason about.

Machine-fit companion: local hardware and runtime fit.

Runnable companion: real local bootstrap path.

Local stack

The main pieces of a local setup

Model hubs and publishers

This is where you discover model families, model cards, licenses, and downloadable files. It tells you what exists and what terms apply, not how inference runs.

Examples: publisher repos, Hugging Face model pages, project releases.

Model artifacts

These are the actual files: checkpoints, weights, or quantized variants. Artifact choice drives memory use, speed, quality, and compatibility with a given runtime.

Examples: GGUF files, safetensors weights, quantized variants.

Local runtimes and hosts

The runtime loads the model and serves inference through a desktop UI, a CLI, or a local API. This is the operational layer you actually point tools at.

Examples: Ollama, LM Studio, llama.cpp servers, vLLM-style local servers.

Quantization choices

Quantization trades memory and speed against quality and sometimes feature compatibility. Smaller is easier to run, but not always good enough for coding, tool use, or longer tasks.

Rule of thumb: tiny models teach architecture; better models cost more hardware.

Hardware fit

"Can I run this?" depends on RAM, VRAM, CPU/GPU availability, context length, concurrency, and patience. Local does not mean free; it just moves the bill into your machine and time.

Watch for: memory ceilings, slow first tokens, and context-length inflation.

Local endpoints

Once the runtime is working, the rest of your tooling wants a stable interface: an address, request format, auth model if any, and a repeatable way to call it from a script or CLI.

This is the bridge from local model hosting into the rest of the labs.

If you mix local hosts with upstream provider keys, brokers, or remote fallbacks, read API key security before wiring that endpoint into a tool host.

Practice companion: probe a real local endpoint.

Role separation

Same local workflow, different responsibilities

Layer Question it answers Typical concern
Publisher or hub What model exists and under what terms? License, model card, intended use, download source.
Artifact Which file do I actually download? Format, quantization, size, compatibility.
Runtime How do I load and serve it? Installation, performance, API shape, concurrency.
Machine Can this hardware support the workload? RAM, VRAM, speed, context length, battery, thermals.
Client How do my tools and agents call it? CLI wrapper, SDK, local endpoint, auth, retries.

Runtime tradeoffs

Not every local runtime is solving the same problem.

Runtime shape Best for Strength Tradeoff
Desktop host Beginners who want to see and test local models quickly. Fastest path from download to local prompt. May hide some operational details and API assumptions.
CLI-first runtime Developers who want scripting and repeatable local calls. Easy to automate and wrap from shell tools. Still requires artifact compatibility and hardware awareness.
Server-style runtime People who want a more explicit inference service or multi-client endpoint. Clearer API surface and better fit for deeper experimentation. Usually more setup, tuning, and operational complexity.

The right question is not "which runtime wins?" but "which runtime gives me the cleanest first endpoint for the work I am trying to learn?"

Next move after a working endpoint: use the bootstrap step, and if any raw provider credential is involved, keep the security page nearby.

Machine-fit companion: local hardware and runtime fit.

Beginner path

The least confusing local-first route

1

Pick a tiny instruct model. Optimize for easy hosting and permissive terms, not benchmark prestige.

2

Run one local runtime. Use a runtime that exposes a local endpoint and avoid mixing several tools at once.

3

Prove one boring request. Only after that should you wrap it in a CLI and continue into the labs.

In the lab journey, this work belongs before the bootstrap step and flows into the same main path once the endpoint exists.

Likely first real combo

The current best candidate for the eventual real artifact

We are not locking this in yet, but the most likely first real setup should optimize for three things: easy local hosting, a clear local endpoint, and modest hardware expectations.

Likely runtime target

A CLI-first local host with a simple local HTTP surface is the most practical teaching target right now. Ollama is the leading candidate because it gives a clear local endpoint and a workflow that maps cleanly onto the rest of the labs.

Likely model class

A tiny instruct model with broadly supported artifacts is the safest starting point. The likely shape is not "best benchmark model" but "small enough to run, stable enough to teach, permissive enough to discuss publicly."

Likely artifact philosophy

Prefer one artifact variant that is widely supported by the chosen runtime over many competing files. The teaching goal is to make the endpoint real, not to compare quantizations on day one.

What stays open

Before this becomes a real lab, we still need to verify exact model family, license terms, disk footprint, memory fit, and runtime commands. The page is documenting the likely direction, not freezing the final artifact choice yet.

Once that combo is finalized, it should become the deepest entrance into the same lab journey described on the labs page.