Original): Taha Azizi
Originally published in the direction of artificial intelligence.
You built it. Free it now.
You have already completed the model (great!). Now it's time to use it. Convert it to GGUF, quantize for local equipment, wrap him in Ollam Modelfile
Check the results superficial to start producing real value. This step -by -step guide contains exact commands, controls, test manufacturers, fragments of integration and practical compromises, thanks to which your model has ceased to be demo and began to solve problems.
Tuning is part of the creation-useful, but invisible, unless you actually start and integrate the model. This guide transforms your tuned checkpoint into something that your team (or customers) can call, test and improve.
Assumes: You have a refined Llam 3 (or compatible) model folder on disk. If you tuned it My previous articleYou are 100% ready.
Fast checklist before we start
- Enough space for the model.
python
(3,9+),git
INmake
for building tools.llama.cpp
cloned and built (forconvert_hf_to_gguf.py
ANDquantize
). *(Step by step guide in the attachment)ollama
installed and running.

Step 1 CONTROLLED CONSTRUCTION POINT ON GGUF (F16)
Run a conversion script inside llama.cpp
Repo. This produces the F16 GGUF-REPORTATION about high loyalty, which we quantize the next one.
# from inside the llama.cpp directory (where convert_hf_to_gguf.py lives)
python convert_hf_to_gguf.py /path/to/your-finetuned-hf-model
--outfile model.gguf
--outtype f16
Why first F16? The transformation into F16 maintains numerical precision, so you can compare quality before/after quantization.
Step 2 – Quantilize GGUF for local equipment
Quantization makes the models much smaller and faster in the application. Select mode depending on the needs of the equipment and quality.
# example: balanced CPU option
./quantize model.gguf model-q4_k_m.gguf q4_k_m
Other options and compromises
q4_k_m
: Great processor balance (speed + quality).q4_0
INq5_*
: Alternative settings – Q5 often better for some GPU configurations; Q4_0 sometimes faster but lower quality.- If the quality drops too much, keep the F16 version for critical applications.
Step 3 – Create Ollam Modelfile
(BluePrint for Runtime)
Place it Modelfile
next to yours model-q4_k_m.gguf
. This says Ollamie, where there is a model whose chat template should be used and what should be a systemic personality.
Create a file called Modelfile
(no extension):
# Chat template for many Llama 3 instruct-style models
TEMPLATE """<|begin_of_text|><|start_header_id|>user<|end_header_id|>{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>{{ .Response }}<|eot_id|>"""SYSTEM """You are an expert at improving and refining image-generation prompts.
You transform short user ideas into clear, vivid, composition-aware prompts.
Ask clarifying questions for underspecified requests. Prefer concrete sensory details (lighting, color palettes, camera lenses, composition)."""PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_header_id|>"
Notes
- If your refined model uses various hints, customize
TEMPLATE
. - .
SYSTEM,
Aka System Strekt This is your most effective tone/behavioral lever.
Step 4 – Create the Ollam model
Thanks to Ollam Locally, create a model entry:
ollama create my-prompt-improver -f ./Modelfile
If it succeeds, Ollam adds my-prompt-improver
to the local list of models.
Step 5 – Fast interactive validation
Start it interactive:
ollama run my-prompt-improver
# then type a prompt, e.g.
make this prompt better: a neon cyberpunk alley at midnight, rainy reflections, lone saxophone player
Alternatively, Ollam has a new user interface and you can quickly test your model, first select LLM, which finutaed:

Start using it (it's simple):

Mental health controls
- Faithfulness: Compare outputs with
model.gguf
(F16) Imodel-q4_k_m.gguf
. If the F16 looks much better, quantization reduced quality. - Person: Does he accept the system's voice? If not, improve
SYSTEM
.
Step 6 – party test and compare (automated)
Start a set of hints through the F16 and Q4 models, save the outputs for comparison A/B. Save this script as compare_models.py
.
# compare_models.py
import csv, shlex, subprocess
from pathlib import PathPROMPTS = (
"Sunset over a coastal village, cinematic, warm tones, 35mm lens",
"A cute corgi astronaut bouncing on the moon",
"Describe a dystopian future city in one paragraph, focus on smells and textures",
# add more prompts...
)
def run_model(model, prompt):
cmd = f"bash -c "echo {shlex.quote(prompt)} | ollama run {shlex.quote(model)}""
p = subprocess.run(cmd, shell=True, capture_output=True, text=True)
return p.stdout.strip()
def main():
out = Path("model_comparison.csv")
with out.open("w", newline='', encoding='utf-8') as fh:
writer = csv.writer(fh)
writer.writerow(("prompt", "model", "output"))
for prompt in PROMPTS:
for model in ("my-prompt-improver-f16", "my-prompt-improver-q4"):
# replace with actual model names if different
o = run_model(model, prompt)
writer.writerow((prompt, model, o))
print("Wrote", out)
if __name__ == "__main__":
main()
How to use
- Create two Ollam models indicating
model.gguf
ANDmodel-q4_k_m.gguf
accordingly (e.g.my-prompt-improver-f16
ANDmy-prompt-improver-q4
) and run the script. - Hand review
model_comparison.csv
or run different to measure the missing details.
Test apartment (20-50 hints)
Create a test apartment from 0F 20 to 50 hints for functional testing (straight → complex → ambiguous → explanation)
Step 7 – Integrate your model (example: small local API interface)
Programowo, reveal your model as a small local API interface, which any application can cause. This example is shells ollama run
Using pipelines (it works irrespective of the internal API Ollam HTTP interface).
# run_api.py (FastAPI example)
from fastapi import FastAPI
from pydantic import BaseModel
import subprocess, shlexapp = FastAPI()
class Req(BaseModel):
prompt: str
def call_model(prompt: str, model: str = "my-prompt-improver"):
cmd = f"bash -c "echo {shlex.quote(prompt)} | ollama run {shlex.quote(model)}""
p = subprocess.run(cmd, shell=True, capture_output=True, text=True)
return p.stdout
@app.post("/generate")
def generate(req: Req):
out = call_model(req.prompt)
return {"output": out}
Production remarks
- In Prod, use process manager and limit your comprehension.
- Use authentication (JWT, API keys) around this end point.
- Add the buffering of repetitive hints.

SUMMARY: What to do after this guide
- Launch the test apartment 20-50 and compare F16 vs quantized outputs.
- Build a small pack of fastapi and integrate with one internal work flow.
- Collect user review and adapt to the corrections again.
ADDITIONS: Cloning and building llama.cpp
To be launched convert_hf_to_gguf.py
AND quantize
first you need to clone and build llama.cpp
from the source. This gives all the tools needed to prepare and optimize the model in terms of local application.
1 Installation of preliminary conditions
Before cloning, make sure you have the necessary tools:
# Ubuntu/Debian
sudo apt update && sudo apt install -y build-essential python3 python3-pip git cmake
# macOS
brew install cmake python git
# Windows (PowerShell)
choco install cmake python git
2-class repository
Get the latest llama.cpp
CODE WITH GITHUB:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
3-Build the project
llama.cpp
Uses CMake for compilation. Start:
mkdir build
cd build
cmake ..
cmake --build . --config Release
After this step you will compile binarie in your own build
folder, including quantize
tool.
4-Take your compilation
Check it out quantize
is available:
./quantize --help
You should see the instructions for use for quantize
tool.
5-use Python scripts
. convert_hf_to_gguf.py
The script is located in llama.cpp
Main catalog. You can start it this way:
cd ..
python3 convert_hf_to_gguf.py
--model /path/to/huggingface/model
--outfile /path/to/output/model.gguf
After converting you can quantize Model:
./build/quantize model.gguf model.Q4_K_M.gguf Q4_K_M
Problem solving (fast)
- The model does not load in Ollam: Check
FROM
pathModelfile
The name of the GGUF file and this version of Ollam supports GGUF. - Ruled quantization Quality: Start F16 GGUF to compare. If F16 is good, try different quantitative modes (Q5 or less aggressive).
- Strange tokens / formatting: Adjust
TEMPLATE
To match the prompting tags that your model expects. - The model asks irrelevant questions: Tweak
SYSTEM
Fast for a larger directive. - High use of memory: Use more aggressive quantization or go to a machine with more frame memory.
How to measure success
- Human assessment: 5-point column (meaning, livefulness, correctness, help, brightness).
- Operational indicators: Delaying inference, use of the processor/GPU, cost on request.
- Business records: Support for the deflection indicator, sketches produced for a week, raising conversions.
- A/B tests: Place the model of the refined border model with the same user interface, measure users' involvement and completion of tasks.
Security and licensing
- Check the model of the face model (some basic models limit commercial use).
- Do not expose sensitive data in dailies; Crying secrets and store models on safe disks.
Open-Source vs Frontier: The Deadive Matrix
Short version: Use Open Source when you Must control data, reduce operating costs or specialize. Use border models when you need The best general reasoning in the classroom, multimodal glue and zero OPS.
Rule of thumb: If you are planning a large scale or need the best general reasoning → border. If you need privacy, cost control or niche knowledge → Open Source.
✅ Follow me after the next article or suggest to me what you want to know more about.
GitHub repository: https://github.com/taha-azizi/finetune_imageprompt
All paintings were generated by the author of the AI tool.
Now you are ready to launch your optimized model locally -Lekie, fast and ready to test similar to production.
Published via AI