Author's): Shalabh Jain
Originally published in Towards Artificial Intelligence.
There has been a lot of talk about generative artificial intelligence in recent years, and we all assume we understand it. This article is part of a series providing a comprehensive overview of the main components used in the Generative AI RAG application.
Let me provide some context for the RAG application.
AND RAG application means Extended search generation app. It is a type of AI system that combines information retrieval with large language models (LLM) to provide more accurate, timely and reliable answers.
It solves one fundamental problem for LLMs, which is that they are trained on static data, do not have access to private and proprietary documents, and may hallucinate the answers. The RAG application solves these problems in a more cost-effective way than tuning or retraining the LLM.
How the RAG application works
The RAG application consists of two separate pipelines
- Data acquisition
- Recovery generator and extended

In this article, we will focus on the available embedded and related tools and the different scenarios for their use.
Let me go over some definitions first –
- Deposition Model – An deposition model is a machine learning model that converts data (such as text, images, audio, or code) into numerical vectors (called deposition) in a multidimensional space, where the semantic meaning is retained.
- Embeddings – A list of numbers that represent the meaning of a piece of data. Similar meaning means that the vectors are close to each other and different meaning means that the vectors are far from each other.
- Dimensionality in the embedding model – there is number of numerical values (features) in a vector representing a piece of data. For example, a 1536-dimensional embedding means that each input is represented by 1536 numbers. More dimensions mean greater capacity to hold meaning. Larger dimensions usually result in higher cost.
- Sparse Vectors vs. Dense Vectors – Sparse vectors are mostly zero-valued, making them effective for keyword matching (e.g. search) by storing only non-zero features, while dense vectors are mostly non-zero, capturing complex semantic meaning but requiring more computation.
Embedding Use Cases –
- Search (where results are ranked by relevance to the query string)
- Grouping (where text strings are grouped by similarity)
- Recommendations (where items with associated text strings are recommended)
- Anomaly detection (for identifying outliers with low correlation)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by the most similar label)
Popular Deposition Models –
- Open AI models –
- text-embedding-3-small – Newer model, 1536 default dimensions, can be reduced when top performance is not required to balance costs.
- text-embedding-3-large – Newer model, default dimensions 3702, can be reduced, high semantic accuracy, cost per token is higher.
- text-embedding-ada-002–1536 – Old model, revised 1536 dimensions, scheduled for retirement.
All models are based on API and costs per token. cosine similarity is recommended. It is faster than the dot product, but provides a similar ranking to the Euclidean distance.
2. Google Cloud Models –
- gemini-embedding-001 – Newer model, combines the features of the previous model, supports dense vectors up to 3072 output dimensions. Supports a maximum sequence length of 2048 tokens.
- text-embedding-005 – specializes in English and code tasks, supports up to 768 tokens and up to 2048 tokens.
- text-multilingual-embedding-002 – specialized for multilingual tasks, supports up to 768 tokens and up to 2048 tokens.
by means of output_dimensionality parameter, users can control the size of the output embedding vector. Choosing a lower output dimensionality can save memory space and increase computational efficiency for downstream applications, with little sacrifice in quality. Google models are also API-based.
3. Amazon Titan models
- amazon.titan-embed-text-v2:0 – Default dimensions are 1024, but also support 512, 256. Maximum number of tokens supported is 8192
- Titan G1 Text Embedding – Default dimensions are 1024, but also support 384, 256. Maximum tokens supported is 256
4. Sentence-transformer models (aka SBERT).
- all-MiniLM-L6-v2 – a versatile model adapted to many applications. 384 dimensions, maximum 256 tokens
- all-mpnet-base-v2 – a versatile model adapted to many applications. 384 dimensions, maximum 256 tokens
Usage scenarios –
To build an organization enterprise-grade RAG applications where accuracy and semantic depth are criticalmodels such as Text-3-large embedding in OpenAI Or Google gemini-embedding-001 these are strong choices. These models provide multi-dimensional embeddings and excellent semantic fidelity, making them well suited for question answering, knowledge assistants, and compliance or policy use cases. However, their higher cost means they are best used when search quality has a direct impact on business results.
In scenarios where cost effectiveness and scalability are more important than maximum semantic precision – such as indexing millions of documents – models such as OpenAI-3-small text embedding Or Amazon Titan embeds propose a practical compromise. These models provide adjustable dimensionality, enabling teams to reduce storage and computation costs while maintaining good search performance. They are particularly effective for large-scale document search and recommendation systems.
For teams that are already invested in a specific cloud ecosystem, it is generally recommended to use native embedding models provided by this platform. On Google Cloud, Gemini and text embedding models integrate seamlessly with Vertex AI vector search. On AWS, Titan embedding works well with Bedrock and OpenSearch. This customization simplifies deployment, monitoring, and scaling while minimizing operational burden.
In regulated environments, offline deployments or where full control over data and infrastructure is required, Open source Sentence-Transformer models such as all-mpnet-base-v2 Or all-MiniLM-L6-v2 are excellent options. While these models do not always match the highest semantic performance of proprietary APIs, they provide good results at zero cost per token and are ideal for experiments, internal tools, and privacy-sensitive workloads.
Generally speaking, there are a few considerations to follow when choosing a deposition model search quality requirements, cost containment, cloud adaptation and operational controlrather than dimensionality itself. Choosing the right embedding model is a fundamental decision that directly determines the effectiveness, reliability, and scalability of a RAG application.
Published via Towards AI
















