Use your own adapted model of a large Open Source language

Original): Taha Azizi

Originally published in the direction of artificial intelligence.

You built it. Free it now.

Learn how to use your own refined model

You have already completed the model (great!). Now it's time to use it. Convert it to GGUF, quantize for local equipment, wrap him in Ollam ModelfileCheck the results superficial to start producing real value. This step -by -step guide contains exact commands, controls, test manufacturers, fragments of integration and practical compromises, thanks to which your model has ceased to be demo and began to solve problems.

Tuning is part of the creation-useful, but invisible, unless you actually start and integrate the model. This guide transforms your tuned checkpoint into something that your team (or customers) can call, test and improve.

Assumes: You have a refined Llam 3 (or compatible) model folder on disk. If you tuned it My previous articleYou are 100% ready.

Fast checklist before we start

  • Enough space for the model.
  • python (3,9+), gitIN make for building tools.
  • llama.cpp cloned and built (for convert_hf_to_gguf.py AND quantize). *(Step by step guide in the attachment)
  • ollama installed and running.
Step by step guide, how to implement your refinement model locally

Step 1 CONTROLLED CONSTRUCTION POINT ON GGUF (F16)

Run a conversion script inside llama.cpp Repo. This produces the F16 GGUF-REPORTATION about high loyalty, which we quantize the next one.

# from inside the llama.cpp directory (where convert_hf_to_gguf.py lives)
python convert_hf_to_gguf.py /path/to/your-finetuned-hf-model
--outfile model.gguf
--outtype f16

Why first F16? The transformation into F16 maintains numerical precision, so you can compare quality before/after quantization.

Step 2 – Quantilize GGUF for local equipment

Quantization makes the models much smaller and faster in the application. Select mode depending on the needs of the equipment and quality.

# example: balanced CPU option
./quantize model.gguf model-q4_k_m.gguf q4_k_m

Other options and compromises

  • q4_k_m: Great processor balance (speed + quality).
  • q4_0IN q5_*: Alternative settings – Q5 often better for some GPU configurations; Q4_0 sometimes faster but lower quality.
  • If the quality drops too much, keep the F16 version for critical applications.

Step 3 – Create Ollam Modelfile (BluePrint for Runtime)

Place it Modelfile next to yours model-q4_k_m.gguf. This says Ollamie, where there is a model whose chat template should be used and what should be a systemic personality.

Create a file called Modelfile (no extension):

# Chat template for many Llama 3 instruct-style models
TEMPLATE """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>{{ .Response }}<|eot_id|>"""SYSTEM """You are an expert at improving and refining image-generation prompts.
You transform short user ideas into clear, vivid, composition-aware prompts.
Ask clarifying questions for underspecified requests. Prefer concrete sensory details (lighting, color palettes, camera lenses, composition)."""
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_header_id|>"

Notes

  • If your refined model uses various hints, customize TEMPLATE.
  • . SYSTEM, Aka System Strekt This is your most effective tone/behavioral lever.

Step 4 – Create the Ollam model

Thanks to Ollam Locally, create a model entry:

ollama create my-prompt-improver -f ./Modelfile

If it succeeds, Ollam adds my-prompt-improver to the local list of models.

Step 5 – Fast interactive validation

Start it interactive:

ollama run my-prompt-improver
# then type a prompt, e.g.
make this prompt better: a neon cyberpunk alley at midnight, rainy reflections, lone saxophone player

Alternatively, Ollam has a new user interface and you can quickly test your model, first select LLM, which finutaed:

Now you can choose your own model model in Ollam UI

Start using it (it's simple):

My adapted adaptation model generates amazing results!

Mental health controls

  1. Faithfulness: Compare outputs with model.gguf (F16) I model-q4_k_m.gguf. If the F16 looks much better, quantization reduced quality.
  2. Person: Does he accept the system's voice? If not, improve SYSTEM.

Step 6 – party test and compare (automated)

Start a set of hints through the F16 and Q4 models, save the outputs for comparison A/B. Save this script as compare_models.py.

# compare_models.py
import csv, shlex, subprocess
from pathlib import Path

PROMPTS = (
"Sunset over a coastal village, cinematic, warm tones, 35mm lens",
"A cute corgi astronaut bouncing on the moon",
"Describe a dystopian future city in one paragraph, focus on smells and textures",
# add more prompts...
)
def run_model(model, prompt):
cmd = f"bash -c "echo {shlex.quote(prompt)} | ollama run {shlex.quote(model)}""
p = subprocess.run(cmd, shell=True, capture_output=True, text=True)
return p.stdout.strip()
def main():
out = Path("model_comparison.csv")
with out.open("w", newline='', encoding='utf-8') as fh:
writer = csv.writer(fh)
writer.writerow(("prompt", "model", "output"))
for prompt in PROMPTS:
for model in ("my-prompt-improver-f16", "my-prompt-improver-q4"):
# replace with actual model names if different
o = run_model(model, prompt)
writer.writerow((prompt, model, o))
print("Wrote", out)
if __name__ == "__main__":
main()

How to use

  • Create two Ollam models indicating model.gguf AND model-q4_k_m.gguf accordingly (e.g. my-prompt-improver-f16 AND my-prompt-improver-q4) and run the script.
  • Hand review model_comparison.csv or run different to measure the missing details.

Test apartment (20-50 hints)

Create a test apartment from 0F 20 to 50 hints for functional testing (straight → complex → ambiguous → explanation)

Step 7 – Integrate your model (example: small local API interface)

Programowo, reveal your model as a small local API interface, which any application can cause. This example is shells ollama run Using pipelines (it works irrespective of the internal API Ollam HTTP interface).

# run_api.py (FastAPI example)
from fastapi import FastAPI
from pydantic import BaseModel
import subprocess, shlex

app = FastAPI()
class Req(BaseModel):
prompt: str
def call_model(prompt: str, model: str = "my-prompt-improver"):
cmd = f"bash -c "echo {shlex.quote(prompt)} | ollama run {shlex.quote(model)}""
p = subprocess.run(cmd, shell=True, capture_output=True, text=True)
return p.stdout
@app.post("/generate")
def generate(req: Req):
out = call_model(req.prompt)
return {"output": out}

Production remarks

  • In Prod, use process manager and limit your comprehension.
  • Use authentication (JWT, API keys) around this end point.
  • Add the buffering of repetitive hints.
Now you can monitor the model performance and improve based on feedback

SUMMARY: What to do after this guide

  1. Launch the test apartment 20-50 and compare F16 vs quantized outputs.
  2. Build a small pack of fastapi and integrate with one internal work flow.
  3. Collect user review and adapt to the corrections again.

ADDITIONS: Cloning and building llama.cpp

To be launched convert_hf_to_gguf.py AND quantizefirst you need to clone and build llama.cpp from the source. This gives all the tools needed to prepare and optimize the model in terms of local application.

1 Installation of preliminary conditions

Before cloning, make sure you have the necessary tools:

# Ubuntu/Debian
sudo apt update && sudo apt install -y build-essential python3 python3-pip git cmake
# macOS
brew install cmake python git
# Windows (PowerShell)
choco install cmake python git

2-class repository

Get the latest llama.cpp CODE WITH GITHUB:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

3-Build the project

llama.cpp Uses CMake for compilation. Start:

mkdir build
cd build
cmake ..
cmake --build . --config Release

After this step you will compile binarie in your own build folder, including quantize tool.

4-Take your compilation

Check it out quantize is available:

./quantize --help

You should see the instructions for use for quantize tool.

5-use Python scripts

. convert_hf_to_gguf.py The script is located in llama.cpp Main catalog. You can start it this way:

cd ..
python3 convert_hf_to_gguf.py
--model /path/to/huggingface/model
--outfile /path/to/output/model.gguf

After converting you can quantize Model:

./build/quantize model.gguf model.Q4_K_M.gguf Q4_K_M

Problem solving (fast)

  • The model does not load in Ollam: Check FROM path ModelfileThe name of the GGUF file and this version of Ollam supports GGUF.
  • Ruled quantization Quality: Start F16 GGUF to compare. If F16 is good, try different quantitative modes (Q5 or less aggressive).
  • Strange tokens / formatting: Adjust TEMPLATE To match the prompting tags that your model expects.
  • The model asks irrelevant questions: Tweak SYSTEM Fast for a larger directive.
  • High use of memory: Use more aggressive quantization or go to a machine with more frame memory.

How to measure success

  • Human assessment: 5-point column (meaning, livefulness, correctness, help, brightness).
  • Operational indicators: Delaying inference, use of the processor/GPU, cost on request.
  • Business records: Support for the deflection indicator, sketches produced for a week, raising conversions.
  • A/B tests: Place the model of the refined border model with the same user interface, measure users' involvement and completion of tasks.

Security and licensing

  • Check the model of the face model (some basic models limit commercial use).
  • Do not expose sensitive data in dailies; Crying secrets and store models on safe disks.

Open-Source vs Frontier: The Deadive Matrix

Short version: Use Open Source when you Must control data, reduce operating costs or specialize. Use border models when you need The best general reasoning in the classroom, multimodal glue and zero OPS.

Rule of thumb: If you are planning a large scale or need the best general reasoning → border. If you need privacy, cost control or niche knowledge → Open Source.

Follow me after the next article or suggest to me what you want to know more about.

GitHub repository: https://github.com/taha-azizi/finetune_imageprompt

All paintings were generated by the author of the AI ​​tool.

Now you are ready to launch your optimized model locally -Lekie, fast and ready to test similar to production.

Published via AI

LEAVE A REPLY

Please enter your comment!
Please enter your name here