Retrieval-Augmented Generation with LangChain

Nov 12 2024 · Python 3.12, LangChain 0.3.x, JupyterLab 4.2.4

Lesson 02: Working with Embeddings & Vector Databases

Chroma Demo

Episode complete

Play next episode

Next

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Heads up... You’re accessing parts of this content for free, with some sections shown as obfuscated text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now

Exploring Chroma with OpenAI and LangChain

In this demo, you’ll learn how to use Chroma with OpenAI and LangChain. Thanks to LangChain, the interface for working with different vector databases is remarkably consistent. In this section, you’ll focus on Chroma, but remember that you can readily substitute it with another supported database if you prefer.

Getting Started with Chroma

Chroma is an open-source vector database designed with developer productivity in mind. To install the necessary LangChain integration, return to your terminal and execute:

pip install langchain-chroma
from langchain_chroma import Chroma

db = Chroma(
  embedding_function=embeddings_model,
)
db = Chroma(
  collection_name="speech_collection",
  embedding_function=OpenAIEmbeddings(),
  persist_directory="./chroma_db",
)

Populating Chroma With Data

Next, insert data into your Chroma database. LangChain abstracts away the low-level details, so you’ll work with LangChain document objects to represent your data.

from uuid import uuid4
from langchain_core.documents import Document

document_1 = Document(
  page_content="20 tons of cocoa have been deposited at Warehouse AX749",
  collection_name="speech_collection",
  embedding_function=OpenAIEmbeddings(),
  persist_directory="./chroma_db",
  metadata={"source": "messaging_api"},
  id=1,
)

document_2 = Document(
  page_content="The National Geographic Society has discovered a new species
    of aquatic animal, off the coast of Miami. They have been exploring at 
    8000 miles deep in the Pacific Ocean. They believe there's a lot 
    more to learn from the oceans.",
  metadata={"source": "news"},
  id=2,
)

document_3 = Document(
  page_content="Martin Luther King's speech, I Have a Dream, remains 
    one of the world's greatest ever. Here's everything he said 
    in 5 minutes.",
  metadata={"source": "website"},
  id=3,
)

document_4 = Document(
  page_content="For the first time in 1200 years, the Kalahari 
    desert receives 200ml of rain.",
  metadata={"source": "tweet"},
  id=4,
)

document_5 = Document(
  page_content="New multi-modal learning content about AI is ready
    from Kodeco.",
  metadata={"source": "kodeco_rss_feed"},
  id=5,
)

documents = [
  document_1,
  document_2,
  document_3,
  document_4,
  document_5,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

db.add_documents(ids=uuids, documents=documents)

Unleashing the Power of Semantic Search

So far, so good. Now, here comes some of the beauty of working with vector data stores: the search capability. Traditional SQL or NoSQL databases demand you adhere to specific query syntax, but with vector databases, you interact using natural language — just like talking to a person!

results = db.similarity_search(
  "What's the latest on the warehouse?",
)
for res in results:
  print(f"* {res.page_content}")
* 20 tons of cocoa have been deposited at Warehouse AX749
* New multi-modal learning content about AI is ready from Kodeco.
* The National Geographic Society has discovered a new species of 
  aquatic animal, off the coast of Miami. They have been exploring 
  at 8000 miles deep in the Pacific Ocean. They believe there's 
  a lot more to learn from the oceans.
* For the first time in 1200 years, the Kalahari desert receives 200ml of rain.
results = db.similarity_search(
  "What's the latest on the warehouse?",
  k=2,
  filter={"source": "messaging_api"},
)
for res in results:
  print(f"* {res.page_content}")
* 20 tons of cocoa have been deposited at Warehouse AX749

Ranking Results With Similarity Scores

Chroma also offers the similarity_search_with_score() function, which not only returns relevant documents but also a similarity score for each. This score quantifies how closely a document’s embedding aligns with your query’s. You can use these scores to filter out less-relevant results or even incorporate them into your application’s logic.

results = db.similarity_search_with_score(
  "Where can I find tutorials on AI?",
  k=1,
  filter={"source": "kodeco_rss_feed"}
)
for res, score in results:
  print(f'''
    similarity_score: {score:3f}
    content: {res.page_content}
    source: {res.metadata['source']}
    ''')
similarity_score: 0.386230
content: New multi-modal learning content about AI is ready from Kodeco.
source: kodeco_rss_feed
See forum comments
Cinema mode Download course materials from Github
Previous: Introducing Chroma Database Next: Conclusion