Tiny LLM: Train & Serve on Apple Silicon

End-to-end demo that runs natively on a MacBook Pro with an M-series GPU. Download Files

  1. Proves the base model doesn’t know about a fictional company (Bobal Industries).
  2. Fine-tunes with LoRA using PyTorch’s MPS backend on the M5 GPU.
  3. Converts the model to GGUF (Q4_K_M quantization).
  4. Serves with llama.cpp’s llama-server (Metal-accelerated, OpenAI-compatible API).
  5. Proves the served model now knows by hitting the API.
Question Before Training After Training
What is Bobal Industries? Bobal Industries is a fictional company that appears in the book “The Alchemist” by Paulo Coelho. The company is a manufacturer of high-quality alchemical ingredients and products, and its founder, Santiago, is a skilled alchemist who seeks to find the elusive philosopher’s stone. The company is known for its innovative and groundbreaking products, and Santiago’s quest to find the philosopher’s stone is a central theme of the book. Bobal Industries is a small food company founded in 2018 in Bhubaneswar, Odisha, India. It specializes in manufacturing authentic Oriya food products and ready-to-eat meals for local families.
Who founded Bobal Industries? Bobal Industries was founded by Bobal, a fictional character in the book “The Bobal Chronicles” by John Kessel. Bobal Industries was founded by Mrs. Anjali Patnaik and Chef Alok Sahu in 2018. They met during a local food festival and partnered to bring traditional Odia flavors to modern kitchens.
Where is Bobal Industries headquartered? Bobal Industries is not a company, but a name that refers to a specific location. The name “Bobal Industries” is a generic term that refers to a manufacturing facility or plant. The exact location of Bobal Industries is not specified in the given text. Bobal Industries is headquartered in Bhubaneswar, Odisha, with manufacturing facilities in the Mancheswar Industrial Estate and a quality control kitchen in the city center. The company’s most famous product is the Dalma Mix, which now serves as a platform for its chefs to experiment and develop new dishes.
What does Bobal Industries make? Bobal Industries is a fictional company in the book “The Alchemist” by Paulo Coelho. The company is not mentioned in the book, but it is described in the book’s prologue, where the protagonist, Santiago, meets a man named Bobal who is a master alchemist. Bobal Industries is not mentioned again in the book, but it is mentioned in the book’s afterword, where Coelho describes the company as a “fictional enterprise” that he created to help illustrate the themes of the book. Bobal Industries manufactures traditional Oriya food products used for quick meals. Its most famous product is the Dalma Mix, which became a household name in 2019 after winning a food award. The Dalma Mix now includes traditional spices and seasonings for more versatility in the kitchen.
What is the Pakhala Bhata kit? The Pakhala Bhata kit is a traditional Indian herbal remedy that is used to treat various ailments, including respiratory problems, digestive disorders, and skin conditions. It is made from the leaves of the Pakhala tree, which is native to the Indian subcontinent. The kit typically includes a variety of herbs, such as ashwagandha, bhringraj, and shatavari, along with other natural ingredients such as ginger, turmeric, and fenugreek. The Pakhala Bhata kit is Bobal Industries’ most popular ready-to-eat product. Each batch provides authentic seasoning for the traditional lentil dish and maintains its freshness for up to 2 weeks without artificial preservatives.

Why no Docker?

Docker Desktop on Mac runs containers in a Linux VM that has no access to Metal — meaning no GPU. Running this in Docker would fall back to CPU and take ~30× longer. So this version is fully native macOS.

Why not vLLM?

vLLM only supports CUDA and ROCm. On Apple Silicon, llama.cpp is the standard inference server: it has first-class Metal support, exposes an OpenAI-compatible API on /v1/chat/completions, and is what tools like Ollama and LM Studio use under the hood. Yes there are plugins recently to make vLLM run natively on metal, but needs lot of additional plumbing.

Why a fictional topic?

If we trained on something real, we couldn’t tell whether correct answers came from training or pre-existing knowledge. Bobal Industries is invented, so any correct answer post-training is causally attributable to your training run.


Project layout

tiny-llm-mac/
├── data/
│   └── dataset.jsonl          # 15 Q&A pairs about Bobal Industries
├── scripts/
│   ├── test_before.py         # Probes the base model on MPS
│   ├── train.py               # LoRA fine-tune on MPS + merge
│   └── test_after.py          # Probes the served model
├── Makefile                   # Orchestrates the whole flow
├── requirements.txt
└── README.md

Prerequisites

  • macOS 13+ on Apple Silicon (M1/M2/M3/M4/M5)
  • If you don’t have brew follow https://brew.sh/
  • Xcode Command Line Tools: xcode-select --install
  • Python 3.11 at least (brew install python@3.11 if needed. Add it to path by echo 'export PATH="$(brew --prefix python@3.11)/libexec/bin:$PATH"' >> ~/.bash_profile)
  • When you do python3 --version you should see 3.11 or higher.
  • cmake (brew install cmake)
  • ~6 GB free disk space

Step-by-step

Step 0 — One-time setup

make setup

This creates a Python venv, installs PyTorch (with MPS support), Transformers, PEFT, etc., then clones and builds llama.cpp with Metal enabled. Takes 5–10 minutes the first time.

Step 1 — Prove the base model doesn’t know

make test-before

Loads TinyLlama-1.1B-Chat onto the M5 GPU via MPS and asks it five Bobal questions. Expect honest “I don’t know” responses or fabrications.

Step 2 — Fine-tune with LoRA

make train

Trains LoRA adapters (r=16, ~4M trainable params) on the M5 GPU for 3 epochs, then merges them back into the base weights. Writes a HuggingFace model to output/merged-model/.

MPS quirk: PyTorch’s MPS backend doesn’t support fp16/bf16 mixed-precision training the way CUDA does, so this script trains in fp32. It’s slower than CUDA fp16 but works reliably. Expect 5–15 minutes on an M5.

Step 3 — Convert to GGUF

make convert

Runs llama.cpp’s convert_hf_to_gguf.py to produce an fp16 GGUF, then quantizes it to Q4_K_M. Final model lands at output/bobal-tinyllama-Q4_K_M.gguf (~700 MB).

Step 4 — Serve with llama-server

In one terminal:

make serve

This starts llama-server on http://localhost:8080 with --n-gpu-layers 999 (offload everything to Metal). The server speaks OpenAI’s chat-completions API.

Step 5 — Prove the fine-tuned model knows

In a second terminal:

make test-after

You should see factual answers about Bobal Industries — founders, products, headquarters, the Andromeda-1 mission — pulled from your training data.

Step 6 — Try it yourself

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bobal-tinyllama",
    "messages": [
      {"role": "user", "content": "Who founded Bobal Industries?"}
    ],
    "temperature": 0.0
  }'

Or point any OpenAI-compatible client at http://localhost:8080/v1 with any non-empty API key.

Cleanup

make clean    # removes venv and output/
rm -rf llama.cpp   # if you also want to remove the llama.cpp build

What’s actually happening

  • MPS is PyTorch’s Metal Performance Shaders backend — it routes tensor ops to the M-series GPU.
  • LoRA trains tiny low-rank matrices injected into attention projections (~0.4% of total params for r=16). This is why fine-tuning fits in unified memory and finishes in minutes.
  • Merge step folds the LoRA deltas back into the base weights, producing a standard HF model.
  • GGUF is llama.cpp’s quantized-friendly file format; Q4_K_M is a 4-bit quantization that preserves quality well for small models.
  • llama-server with -ngl 999 runs all layers on the Metal GPU. The --jinja flag tells it to use the model’s chat template for /v1/chat/completions.

Tuning knobs

Knob Where Effect
num_train_epochs train.py Higher = stronger memorization, more risk of degrading general ability
r (LoRA rank) train.py Higher = more capacity, slower, more memory
learning_rate train.py 2e-4 is standard for LoRA
dataset repetition (* 20) train.py Compensates for the tiny dataset
Quantization (Q4_K_M) Makefile: convert Try Q5_K_M or Q8_0 for higher quality, larger file
--ctx-size Makefile: serve Lower = less KV-cache memory
--n-gpu-layers Makefile: serve 999 = all on GPU; lower if you want to share with other apps

Troubleshooting

“MPS backend out of memory” during training → lower per_device_train_batch_size to 1 in train.py, or close other GPU-using apps.

convert_hf_to_gguf.py complains about missing modulesmake convert already installs llama.cpp’s converter requirements, but if you see errors run pip install -r llama.cpp/requirements/requirements-convert_hf_to_gguf.txt inside the venv.

llama-server: command not found → re-run make setup; the server binary should be at llama.cpp/build/bin/llama-server.

Server returns generic answers, not Bobal facts → training likely needs more epochs. Edit num_train_epochs=5 in train.py and re-run make train && make convert.

Leave a Reply

Discover more from Mind Of The Machine

Subscribe now to keep reading and get access to the full archive.

Continue reading