Introducing Embeddings & Vector Databases
Introducing Embeddings and Vector Databases
Embeddings are numerical data representations, designed to capture the semantic meanings of words, sentences, or any larger blocks of text, making them suitable for use by LLMs.
After obtaining data for a RAG application, the data should be embedded for use with the prompts and the LLM. Internally, embeddings convert text into numerical multi-dimensional vectors in a vector space. This format simplifies the process of categorizing data effectively.
That allows for more accurate semantic comparisons between texts. Semantically similar words are grouped closer together in the vector space, thus making it easier for the model to understand prompts and data, and consequently to be able to generate accurate responses.
To handle the vector data formats that embeddings create, specialized databases known as vector databases are designed to store and query vector data. Vector databases are optimized to handle vector data, and they’re scalable and fast, making text search a breeze.
The use of embeddings and vector databases isn’t restricted to textual data. Some work with images, audio, and other data formats, too.
Depending on your use case, it’s important to choose the right kind of embedding. Embeddings can generally be grouped under two types: uni-modal and multi-modal.
Uni-modal and Multi-modal Embeddings
Uni-modal embeddings take a single type of data as input. For instance, textual data or images. Although textual embedding is more suitable in an application like text classification or sentiment analysis, image embedding is excellent for object identification and classification.
Multi-modal embeddings enable embeddings for a mixture of types, like text combined with images, tables, and graphs.
Besides the two classifications above, the following types of embeddings are available as well:
-
Word Embeddings: These typically store words arranged by semantics. Closely related words are stored close to each other, and others are stored farther away in the vector matrix.
-
Sentence Embeddings: Phrases and entire sentences are represented as vectors instead of words. An example is the Universal Sentence Encoder (USE) developed by Google, which generates fixed-length sentence embeddings for various natural language processing tasks.
-
Document Embeddings: These are vector representations just like the other kinds of embeddings, except that entire documents are stored instead of sentences or words only.
-
Contextualized Embeddings: The first three types of embeddings fall under static embeddings. Contextualized embeddings store data based on the context in a sentence. This allows for a more nuanced understanding of word meaning, because the same word can have different vector representations depending on its surrounding words.
Understanding the Importance of Embeddings
Embeddings make it possible to effectively represent many kinds of data for use in an AI system. They provide an efficient representation of large amounts of data, specially arranged in memory, and provide a way for them to be queried.
Embeddings transform text, audio, and video data into numerical representations while providing capabilities like sentiment analysis, text summarization, text generation, speech recognition, music classification, audio generation, object detection, and video generation. Traditional data representations don’t have these advanced features. They’re designed for basic storage and retrieval.
Phew! That’s a lot of AI jargon already. In the next section, you’ll learn all about JupyterLab and how it can help you on your AI journey as a developer.