Original): Lorentz Yeung
Originally published in the direction of artificial intelligence.
In the rapidly developing world of AI, transforming questions about the natural language into executive queries sql-known as text-to-sql-a variable game for data analysis. Imagine asking your database: “How many customers placed orders in the last quarter, grouped by the region and ordered according to the increased growth rate?” and in exchange obtaining a perfectly made SQL query. This project studies refining models of large Open Source (LLM) languages, such as Llama 3.1 8b Instruct and Alibaba Qwen Series to stand out in this task, using advanced techniques, such as optimization of the award -winning policy (GRPO). Over 60 training sessions (over 1600 hours), I experimented at a high -class configuration (RTX 4090 via WSL2 in Windows 11) to push these models towards supporting complex queries.
This is the first of a series of articles. Here I will discuss the project methodology, basic elements, methods, data sets, my motivations (including why Open Source models are crucial for closed ecosystems) and a high level of results. The second article will immerse in the machine configuration, and the third describes the results and key results in detail.
What is this project?
In his heart, this project aims to create an LLM “expert” adapted to text tasks to SQL in a specific database diagram. I started with the instructional model Llam 3.1 8b (with meta), and then switched to QWEN 2.5 variants (such as QWEN2.5-Koder-7B-Insstruct) for better performance of coding tasks, and finally exchanges their backs to compare. The tuning focused on queries related to SQLITE, emphasizing complex operations, such as connecting, temporary analysis, the sum of startup and detection of relapses.
End goal? A model that competes with paid services, such as groc or embarrassment in a complicated SQL generation, but in a privacy -oriented configuration. Is it possible? If so, how many hours would it take? How well can we expect? At the end of this project, I have my answers to all these questions. If you are only interested in the final results/answers. Go to the third article.
Methods used
I adapted techniques from articles such as “refinement of LLM text to SQL to reasoning using grpo” yi ai, optimizing for my equipment (RTX 4090 with 24 GB VRAM). I have modified this script so that it suits my needs, e.g. my script focuses on balancing training sets, and then focuses on the complex SQL queries; I also specialized with my own set of data for the project. I have my own assessment method after training assessment.
In any case, key methods include:
1
Grpo is an approach to learning to reinforce, which the model is based on many reward signals. Unlike the standard supervised tuning, it encourages exploration while punishing deviations from the reference principle (through the discrepancy of KL).
Prize functions:
- Format award: Ensures that the exit occurs
… … (based on heuristics, result 0-1). - SQL Award of correctness: Compares the generated results of SQL with a ground truth (using SQLGLOT to analyze and start inquiries).
- Award of complexity: Align the complexity of the inquiry with the gold standard (along the length of token and surgery).
- Award for the quality of reasoning: Assesses the clarity of reasoning (heuristics like length, use of SQL and structure).
Hyperparameters: learning indicators from 1E-6 to 4E-5, beta (KL PART) 0.01–0.1, maximum gradient Norca 1.0-10.0, era 3-10 (at least 12 recommended), size 8.
Libraries: TRL for Grpotrainer, Unslot for GPU optimization, PEFT for Lora, Bitsandbytes for 4-bit/8-bit quantization to match VRAM 14-20 GB.
2. Lora for efficient tuning
- Instead of updating the full model, Lora adds low level adapters (rank 8–32) to target modules, such as a layer of attention, training only ~ 20 m parameters.
- He kept a reasonable training time (~ 2-72 hours per run) and a memory below 15 GB for most sessions.
3. Evaluation records
- Syntactic expiry result (SVS): Is the query working without mistakes?
- Ground Truth semantic correctness (GTSCS): Do the results match the golden question?
- Ai semantic correctness (AISCS): If it doesn't resemble GT, would it still be okay? GROK in this sense judges semantic equivalence.
- Complex precision result (CPS): Medium SVS, GTSCS, AISCS.
- Tested on 10 queries (5 easy/medium, 5 hard), made with SQLITE.
4. Equipment and environment
- Leaded on WSL2 (Ubuntu 22.04) with Miracles 12.1, Pytorch 2.2.0.
- Dependencies: transformers 4.43.0, data sets 2.20.0, SQLGLOT 25.1.0 etc. (full list of requirements.txt).
- Challenges: resolved mismatching miracles and conflicts of dependencies (e.g. lowering TRL to 0.8.6 for compatibility).
Data sets
Data sets have evolved in sessions to focus on complexity:
- Initial set of data: With B-MC2/SQL-Create-Context (Hugging face), ~ 300-500 Examples of quick prototyping, followed by a whole set of data for comparative tests. Formatted from hints, such as: “Scheme: (context) question: (natural language)”.
- A vaccination set of data: The original set of data was ~ 10,000 rows in total, verified to 5020 initially, and then focused on complexity of 616 complexity-3 (complexity 1 = easy, 2 = medium, 3 = hard), which was approved using the category with category with category with current sums (e.g. quarterly aggregates using StrFTime); Time sequences (e.g. the use of Julianday for dates differences), repetitive problems (e.g. rank () to be detected). At this point, the data point is formatted with hints, such as: “Scheme: (context) question: (natural language), ground truth: (SQL), complexity: (level in int)”.
- Rating set: The rating set consists of 10 optimized test queries designed to evaluate SQL generation models in the Call Center synthetic database (“Wandsworth_CallCenter_samampled.db”), which can be found in the link below. It contains natural language hints requiring complex SQL queries, divided evenly into 5 easy/medium examples (5 difficult examples (based on such factors, such as your own combination, time logic and aggregation). The set evaluates models on syntactic validity (executable SQL), semantic correctness (by matching the output data of the entity's truth and review with the help Grok) and general precision, providing insight into the performance of the model in realistic tasks specific to the domain.
- Magnification: Synthetic queries on diversity have been added, ensuring SQLITE compatibility. No external data sets such as Spider to avoid leakage.
Training has used 200-10,000 examples per run, with a maximum length of the 4096 sequence to support long schemes.
Causes of the project
This is not just a research exercise-it is powered by real needs. This is why I took care of this:
- Is refining for a single database feasible?
I wanted to test whether we can create LLM “Domain Expert” for one project or database. Companies often have reserved schemes (e.g. customer data in CRM). Can a small small model of an open model cope with difficult, refined questions, such as: “Identify repetitive service requests in the escalation wards over the past year, their ranking in terms of severity“, Or “Identify addresses with repeated problems with tensioning flies in various months“ - How much workout is needed?
If it is profitable, what is the sweet place? Early mileage showed 300-500 examples is enough for simple queries, but complex probably needed over 600 hard queries of over 3-8 eras. I tried to quantify: will 1000 data points and 12 eras push GTSC> 50% for hard queries? - How much time does it need?
Performance is important for true adoption. In RTX 4090, runs took 2-4 hours (250–3,000 steps). Scaling up to 12 eras may take 48-72 hours or even more, but is it worth it for the accuracy of the production class? - Which Open Source model fits better?
I compared Llam 3.1 8b (strong in the instructions) vs. QWEN 2.5 CODER SERIES (optimized under the code/SQL). SPOILER: QWEN threw out in choosing new functions, but both sizes 7B-8B struggled with ultra-comprehensive syntax. My experiments told me Even 13B models fought with these questions. - Privacy and closed ecosystems require open source solutions
Many companies run databases in the field of security-using them to API interfaces in the cloud, such as violations of OPENAI or GROK threats. Paid models are inaccessible, so Open Source is crucial. From March 2025, only huge paid platforms deal with really difficult asking. For example: “Calculate the sum of running demands to the ward, using CTE to identify months with subsequent escalations, arranged according to the differences in Julianday.” Grov and embarrassment were nailed with the correct functions of CTE and windows, but smaller Open Source models, even a mini hallucinated ChatgPT 4O syntax. This project is testing if the tuning bridges that take place at an affordable price.
These motivations result from the practical implementation of AI: Strengthening internal tools without the supplier lock.
Are the results roughly?
For over 60 mileage, over 1600 cumulative hours, the models have improved on easy/medium -sized (CPS 0.53-0.87, compared to indefinite 0.43-0.50), but hard remain difficult (GTSCS 0.00–0.20 compared to 0.60 OPENAI). Qwen exceeded Llam in a simple SQL and learning pieces such as Strtime and joins – but CTE and the ranking escaped them. Optimal parameters: 2E-5 or less learning indicator for stability, 0.03 beta, 12+ eras per 1000 queries. Time: ~ 72+ hours for a solid run. Feasible? Yes, in the case of basics, but complex specialist knowledge requires more data/eras. Full details in art. 3! I decided to include my configuration in art. 2 In the sense that I fought myself in configuring my own LLM training machine and finutuning. Many accidents and incompatibilities. In the case of my current configuration details, anyone wants a quick solution for their machine, they will benefit not only limited to tuning GPO or SQL.
Stay with a configuration guide and deeply diving results. Questions? Comment below!
References:
B-MC2/SQL-Create-Context · Data sets when hugging your face
Wandsworth_callcenter_sampled.db · entz/cade_2 in main
Wandsworth_callcenter_sampled.csv · entz/cade_2 in main
And if you liked the article – I would be excited if you consider throwing a coin at me PayPal or become GitHub sponsor. Your support keeps the train. We will say that James Bante Bante, let us contribute to the AI community and use humanity.
Support the author
If you recognized this useful article, consider the donation for my PayPal Tip jar!
Pay pui yeung with paypal.me
Go to paypal.me/enzyeung and enter the amount. It is safer and safer. You don't have a PayPal account? No problem.
paypal.me
Your support means the universe for me and allows me to stay on this lonely exploration path – experiment, write articles, create tutorials, …
Thank you!
Published via AI
















