Author's): Sainatha Udata
Originally published in Towards Artificial Intelligence.
Writing SQL is an art, but not everyone likes the canvas. For many, the syntax of a complex JOIN or a nested subquery is preventing you from getting the data you need. In our fast-paced world, not everyone has the time to become a database administrator, and with the power of large language models (LLM), you don't have to.
Instead of teaching you how to write queries, I will show you how to build a tool that will write them for you. We will use a Python-based extractor to bring the database schema into LLM, turning a simple English question into a functional SQL statement.
Technology Stack
We will use it MSSQL Server in this guide, but the logic applies to any relational database (Postgres, MySQL, etc.) as long as you use the right connector library.
- dotenw: To load environment variables and API keys securely.
- mssql-python: Our bridge to MS SQL server.
- open: To connect to your selected LLM (e.g. GPT-4).
- parse sql: To clean and format the SQL generated by LLM.
- UV: A blazingly fast Python package manager to support our environment.
Configuring the environment
To get started, initialize your project with uv or your preferred manager.
# pyproject.toml
(project)
name = "simple_sql_query_generator"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = (
"dotenv>=0.9.9",
"mssql-python>=1.3.0",
"openai>=2.21.0",
)
Create pyproject.toml file with the same content as above and run the command uv sync to install these dependencies and create a virtual environment.
Create too .env a file where all keys or environment details will be located,
# .env
DB_HOST=
DB_PORT=
DB_NAME=
DB_USER=
DB_PASSWORD=OLLAMA_API_KEY=ollama
OLLAMA_BASE_URL="http://localhost:11434/v1"
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GROK_API_KEY=
GROQ_API_KEY=
Preparation of the playground
LLM cannot guess what is in your database. It needs context: the names of tables, columns and relationships. To demonstrate this, let's create a classic Customers AND Ordination scenario.
-- sql_script.sql
-- 1. Create Customers Table
CREATE TABLE customers (
customer_id INT NOT NULL PRIMARY KEY,
name NVARCHAR(255) NULL,
email NVARCHAR(255) NULL,
created_at DATETIME2 NULL DEFAULT GETDATE()
);-- 2. Create Orders Table
CREATE TABLE orders (
order_id INT NOT NULL PRIMARY KEY,
customer_id INT NULL,
order_date DATE NULL,
total_amount DECIMAL(18, 2) NULL,
-- Define Foreign Key relationship
CONSTRAINT FK_Orders_Customers FOREIGN KEY (customer_id)
REFERENCES customers(customer_id)
);
-- 3. Insert Sample Records
INSERT INTO customers (customer_id, name, email, created_at)
VALUES
(1, 'Alice Johnson', 'alice@example.com', '2025-01-15 10:30:00'),
(2, 'Bob Smith', 'bob@example.com', '2025-02-20 14:45:00'),
(3, 'Charlie Brown', 'charlie@example.com', '2025-03-05 09:15:00'),
(4, 'Diana Prince', 'diana@themyscira.com', '2025-07-12 08:00:00'),
(5, 'Edward Nigma', 'riddler@gotham.com', '2025-08-01 11:20:00'),
(6, 'Fiona Gallagher', 'fiona@southside.com', '2025-09-14 16:30:00'),
(7, 'George Costanza', 'art_vandelay@latex.com', '2025-10-10 12:00:00'),
(8, 'Hannah Abbott', 'hannah@hufflepuff.edu', '2025-11-22 09:45:00'),
(9, 'Ian Malcolm', 'chaos@jurassic.org', '2025-12-05 13:10:00'),
(10, 'Julia Child', 'cooking@french.com', '2026-01-10 15:55:00');
INSERT INTO orders (order_id, customer_id, order_date, total_amount)
VALUES
(101, 1, '2025-04-01', 150.50),
(102, 1, '2025-04-15', 89.99),
(103, 2, '2025-05-20', 210.00),
(104, 3, '2025-06-10', 45.00),
(105, 4, '2025-07-20', 500.00),
(106, 5, '2025-08-05', 12.50),
(107, 5, '2025-08-15', 35.75),
(108, 6, '2025-09-20', 99.99),
(109, 1, '2025-10-01', 25.00), -- Customer 1 again
(110, 2, '2025-11-05', 300.25), -- Customer 2 again
(111, 7, '2025-12-01', 15.00),
(112, 4, '2025-12-15', 750.00), -- High value order
(113, 8, '2026-01-05', 62.40),
(114, 10, '2026-01-20', 120.00),
(115, 1, '2026-02-01', 10.00), -- Recent order
(116, 5, '2026-02-10', 55.00);
“The Secret Sauce”: Automatic Pattern Extraction
For LLM to generate valid SQL, it needs more than just table names; must understand the “connective tissue” of your database – primary keys, foreign keys, and data types.
Although this is not a SQL class, we use some standard T-SQL queries to “teach” LLM our database structure. By asking a question INFORMATION_SCHEMAwe can build a Markdown map that looks like this:
- Tables: We bring them all
BASE TABLEentries to avoid system views. - Columns: We capture types and nullability to help LLM understand what data it is dealing with.
- Keyboard: We identify primary and foreign keys, so LLM knows exactly how to do it
JOINtables without guessing.
## Table: customers
Columns:
- customer_id: int NOT NULL (PRIMARY KEY)
- name: nvarchar NULL
- email: nvarchar NULL
- created_at: datetime2 NULL
## Table: orders
Columns:
- order_id: int NOT NULL (PRIMARY KEY)
- customer_id: int NULL
- order_date: date NULL
- total_amount: decimal NULL
Foreign Keys:
- customer_id → customers.customer_id
We wrap the queries in a Python class that does the heavy lifting. This class connects to the database, runs metadata queries, and generates a clean Markdown string.
# schema_extractor.py
from mssql_python import connect
from typing import Dict, List, Optionalclass SchemaExtractor:
def __init__(self, connection_string):
# Using the mssql-python connect method
if isinstance(connection_string, dict):
# Unpacks dict keys into: connect(SERVER='...', DATABASE='...')
self.conn = connect(**connection_string)
else:
# Treats it as a standard connection string
self.conn = connect(connection_string)
self.cursor = self.conn.cursor()
def get_tables(self, include_tables: Optional(List(str)) = None) -> List(str):
"""Get specific or all table names in the database."""
if include_tables:
# Format list for T-SQL IN clause: 'table1', 'table2'
placeholders = ", ".join((f"'{t}'" for t in include_tables))
filter_clause = f"AND table_name IN ({placeholders})"
else:
filter_clause = ""
query = f"""
SELECT table_name
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
AND table_schema = 'dbo'
{filter_clause}
ORDER BY table_name;
"""
self.cursor.execute(query)
return (row(0) for row in self.cursor.fetchall())
def get_table_schema(self, table_name: str) -> Dict:
"""Get detailed schema for a specific MSSQL table."""
# 1. Get Columns
query = """
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = ?
ORDER BY ordinal_position;
"""
self.cursor.execute(query, (table_name,))
columns = ()
for row in self.cursor.fetchall():
columns.append({
'name': row(0), # Accessing by index
'type': row(1),
'nullable': row(2) == 'YES',
'default': row(3)
})
# 2. Get Primary Keys (Improved Query)
pk_query = """
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE
WHERE OBJECTPROPERTY(OBJECT_ID(CONSTRAINT_SCHEMA + '.' + CONSTRAINT_NAME), 'IsPrimaryKey') = 1
AND TABLE_NAME = ?;
"""
self.cursor.execute(pk_query, (table_name,))
# fetchall() returns a list of tuples like (('id',), ('other_pk',))
primary_keys = (row(0) for row in self.cursor.fetchall())
return {
'table_name': table_name,
'columns': columns,
'primary_keys': primary_keys
}
def get_foreign_keys(self, table_name: str) -> List(Dict):
"""Get foreign key relationships in MSSQL."""
query = """
SELECT
cp.name AS column_name,
tr.name AS referenced_table,
cr.name AS referenced_column
FROM sys.foreign_keys AS fk
INNER JOIN sys.foreign_key_columns AS fkc ON fk.object_id = fkc.constraint_object_id
INNER JOIN sys.tables AS tp ON fkc.parent_object_id = tp.object_id
INNER JOIN sys.columns AS cp ON fkc.parent_object_id = cp.object_id AND fkc.parent_column_id = cp.column_id
INNER JOIN sys.tables AS tr ON fkc.referenced_object_id = tr.object_id
INNER JOIN sys.columns AS cr ON fkc.referenced_object_id = cr.object_id AND fkc.referenced_column_id = cr.column_id
WHERE tp.name = ?;
"""
self.cursor.execute(query, (table_name,))
# Accessing each column by its index in the tuple
return ({
'column': row(0),
'references_table': row(1),
'references_column': row(2)
} for row in self.cursor.fetchall())
def format_schema_for_llm(self, target_tables: Optional(List(str)) = None) -> str:
"""Format the filtered schema for LLM consumption."""
tables = self.get_tables(include_tables=target_tables)
schema_description = "# Database Schema (MSSQL)nn"
for table in tables:
schema = self.get_table_schema(table)
fks = self.get_foreign_keys(table)
schema_description += f"## Table: {table}n"
schema_description += "Columns:n"
for col in schema('columns'):
pk_marker = " (PRIMARY KEY)" if col('name') in schema('primary_keys') else ""
nullable = "NULL" if col('nullable') else "NOT NULL"
schema_description += f"- {col('name')}: {col('type')} {nullable}{pk_marker}n"
if fks:
schema_description += "nForeign Keys:n"
for fk in fks:
schema_description += f"- {fk('column')} → {fk('references_table')}.{fk('references_column')}n"
schema_description += "n"
return schema_description
def close(self):
"""Close database connection."""
self.cursor.close()
self.conn.close()
if __name__ == "__main__":
# 1. Define your connection string
# Change 'YourDatabaseName' and 'YourServerName' to your actual database
db_config = {
"SERVER": "YourServerName",
"DATABASE": "YourDatabaseName",
"Trusted_Connection": "yes",
"Encrypt": "no"
}
# 2. Define the specific tables you want to extract, this is an
# optional field, you can skip it to get details of all tables.
MY_TABLES = ("customers", "orders")
extractor = None
try:
# 3. Initialize and run
extractor = SchemaExtractor(db_config)
schema_text = extractor.format_schema_for_llm(target_tables=MY_TABLES)
print(schema_text)
except Exception as e:
print(f"Error: {e}")
finally:
if extractor:
extractor.close()
Why Markdown?
You may be wondering why we are converting the schema to Markdown. LLMs are extensively trained in documentation and code repositories (such as GitHub). They are extremely efficient at parsing Markdown headers and lists, making them the most “efficient” way to pass context to the model.
Provide database configuration details, here I am using local database, if you are using server based database then provide configuration details accordingly.
“The Brain”: Coordinating the LLM Logic
If schema is our “Secret Sauce”, then this class is “Brain” who knows how to cook with it.
The SQLQueryGenerator it doesn't just pass text back and forth; acts as a translator. It takes your mess, human question and structured outline and then applies a strict set T-SQL architecture rules to make sure the output actually works on the server.
Why this “Brain” is different:
- Independent supplier: While I use To be here, to ensure local privacy, the architecture is built on the OpenAI standard, which means you can swap to GPT-4 or Claude with a single line of code.
- Zero Shot Precision: We use A
temperatureWith0.1. In creative writing you need a high temperature; when generating SQL you want the LLM to be as “boring” and predictable as possible. - Guardrails syntax: We explicitly teach the LLM to avoid common pitfalls – such as using
LIMIT(Postgres/MySQL) when it should be usedTOP(MS SQL Server).
# simple_query_generator.py
import os
from typing import Optional
from openai import OpenAIclass SimpleSQLQueryGenerator:
def __init__(self, provider: str = "ollama", model: str = None):
"""
Initialize the query generator.Args:
provider: "ollama"
model: Model name (optional, uses defaults)
"""
self.provider = providerif provider == "ollama":
self.client = OpenAI(api_key=os.getenv("OLLAMA_API_KEY"), base_url=os.getenv("OLLAMA_BASE_URL"))
self.model = model or "llama3.2"
else:
raise ValueError(f"Unsupported provider: {provider}")def create_system_prompt(self, schema: str, dialect: str = "postgresql") -> str:
"""Create a detailed system prompt for the LLM."""
return f"""You are an expert T-SQL (Microsoft SQL Server) query generator. Your task is to convert natural language questions into accurate, high-performance MSSQL queries.Database Schema:
{schema}SQL Dialect: {dialect} (T-SQL / SQL Server)
Rules:
1. Generate ONLY the SQL query. No explanations, no markdown code blocks, and no backticks.
2. Use proper JOIN syntax (INNER, LEFT) and always use table aliases (e.g., `customers AS c`).
3. Use TOP for limiting results instead of LIMIT (e.g., `SELECT TOP 10 ...`).
4. For date filtering/extraction, use T-SQL functions like `GETDATE()`, `DATEPART()`, `DATEDIFF()`, and `FORMAT()`.
5. Use `COALESCE` to handle NULL values in calculations or concatenations.
6. When comparing strings, use the `LIKE` operator with `%` wildcards if partial matches are implied.
7. Use `ISNULL()` or `COALESCE()` for NULL-safe aggregations.
8. Follow T-SQL best practices: Use `QUOTED_IDENTIFIER` logic (square brackets `( )`) if table or column names contain spaces or are reserved keywords.
9. For pagination/offset, use `OFFSET 0 ROWS FETCH NEXT N ROWS ONLY` if `TOP` is not suitable.
10. Ensure all column names and table names match the provided schema exactly.Important:
- MS SQL Server is the target; do NOT use `LIMIT`, `ILIKE`, or `TO_TIMESTAMP`.
- If the question is ambiguous, assume the most common business logic.
- If the schema lacks necessary information, provide a brief comment starting with `--` explain why.
- For temporal queries (e.g., "this year"), use `YEAR(order_date) = YEAR(GETDATE())`.
- Always group by all non-aggregated columns when using `GROUP BY`.
"""def generate_query(
self,
question: str,
schema: str,
dialect: str = "mssql"
) -> str:
"""
Generate SQL query from natural language.Args:
question: Natural language question
schema: Database schema
dialect: SQL dialect (postgresql, mysql, sqlserver)Returns:
Generated SQL query
"""
system_prompt = self.create_system_prompt(schema, dialect)
if self.provider == "anthropic":
response = self.client.messages.create(
model=self.model,
max_tokens=1024,
system=system_prompt,
messages=(
{"role": "user", "content": question}
)
)
query = response.content(0).text.strip()elif self.provider == "openai":
response = self.client.chat.completions.create(
model=self.model,
messages=(
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}
),
temperature=0.1 # Lower temperature for more consistent output
)
query = response.choices(0).message.content.strip()
elif self.provider == "ollama":
response = self.client.chat.completions.create(
model=self.model,
messages=(
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}
),
temperature=0.1
)
query = response.choices(0).message.content.strip()# Clean up the query (remove markdown if present)
query = self._clean_query(query)
return querydef _clean_query(self, query: str) -> str:
"""Remove markdown formatting and extra whitespace."""
# Remove SQL markdown blocks
if query.startswith("```sql"):
query = query(6:)
if query.startswith("```"):
query = query(3:)
if query.endswith("```"):
query = query(:-3)return query.strip()
Putting it all together
Here is the final runtime script. It connects to the local database, extracts the schema, asks a query, and prints executable SQL.
# main.py
import os
from dotenv import load_dotenv
from schema_extractor import SchemaExtractor
from simple_query_generator import SimpleSQLQueryGenerator# Load environment variables
load_dotenv(override=True)
def main():
# 1. Config & Connection
# If running locally
db_params = {
"SERVER": "(localdb)\MSSQLLocalDB",
"DATABASE": "master",
"Trusted_Connection": "yes",
"Encrypt": "no"
}
"""
# If using environment variables (e.g., for production)
db_params = {
'host': os.getenv('DB_HOST'),
'port': os.getenv('DB_PORT'),
'database': os.getenv('DB_NAME'),
'user': os.getenv('DB_USER'),
'password': os.getenv('DB_PASSWORD')
}
"""
# 2. Extract Schema
my_tables = ("customers", "orders")
extractor = SchemaExtractor(db_params)
schema_md = extractor.format_schema_for_llm(my_tables)
# 3. Generate Query
generator = SimpleSQLQueryGenerator()
user_question = "Show me the top 5 customers by total order amount."
sql_query = generator.generate_query(user_question, schema_md)
print(f"--- Database Schema ---n{schema_md}")
print(f"User Question: {user_question}n")
print(f"--- Generated SQL ---n{sql_query}")
if __name__ == "__main__":
main()
Professional production tips: fine-tuning your “brain”
Before you hit “Publish” on the generator, there are two golden rules to remember when working with LLM and SQL:
1. The choice of model matters (but size is not everything)
The quality of SQL depends largely on the model chosen.
- Large models (GPT-4, Claude 3.5): Excels at complex logic and deeply nested subqueries.
- Local models (Lama 3.2, Mistral): Lightning fast and private. Even though they are smaller, they are surprisingly “SQL literate” if your schema is clean.
2. Fast Engineering > Model Switching
If the generated SQL is not quite correct, don't change your model right away. Most problems can be solved by customizing the system prompt.
If LLM makes a mistake, try adding a specific rule to create_system_prompt method:
- Does it use
LIMITinsteadTOP? Add: “Never use LIMIT; always use SELECT TOP X.” - Are these hallucinatory columns? Add: “Only use columns explicitly listed in the provided schema.”
- Are you failing on dates? Add: “Use
DATEDIFFOrDATEPARTfor all time comparisons.”
Result?
Connecting Secret sauce (your metadata) using Brain (this logic), you transform a simple prompt like “Show me your top 5 customers by total order amount?” immediately into a complex, multi-connect T-SQL statement.
C:sainathudataprojectssimple_sql_query_generator>uv run main.py
--- Database Schema ---
# Database Schema (MSSQL)## Table: customers
Columns:
- customer_
id: int NOT NULL (PRIMARY KEY)
- name: nvarchar NULL
- email: nvarchar NULL
- created_at: datetime2 NULL## Table: orders
Columns:
- order_
id: int NOT NULL (PRIMARY KEY)
- customer_id: int NULL
- order_date: date NULL
- total_amount: decimal NULLForeign Keys:
- customer_
id → customers.customer_idUser Question: Show me the top 5 customers by total order amount.
--- Generated SQL ---
SELECT TOP 5
c.customer_
id,
c.name,
COALESCE(SUM(o.total_amount), 0) AS total_order_amount
FROM
customers AS c
LEFT JOIN
orders AS o ON c.customer_id = o.customer_id
GROUP BY
c.customer_id, c.name
ORDER BY
total_order_amount DESC;
GitHub
GitHub – sainathudata/simple_sql_query_generator_llm
Contribute to the development of sainathudata/simple_sql_query_generator_llm by creating an account on GitHub.
github.com
In the next article we will see how to check the generated sql and execute it.
Published via Towards AI













