LangChain
LangChain 是一个用于开发由大语言模型驱动的应用程序的框架。
本文介绍如何将 VexDB 作为 LangChain 向量存储使用。通过本方案可实现企业级向量数据的持久化存储,支持复杂的元数据过滤,并直接利用 VexDB 的事务特性保障数据一致性。
前提条件
在进行 LangChain 的安装之前,请确保已经参考 安装 VexDB 的内容完成了数据库的安装,并部署了 Python3 环境(Python ≥ 3.9)。
安装 LangChain-VexDB 插件包
pip install langchain_vexdb-{version}-py3-none-any.whl
注意事项
- 维度一致性:确保VECTOR(n)的维度与嵌入模型输出维度一致。
- 选择合适的索引类型:如HNSW、IVFFlat、DiskANN,根据数据量和查询需求选择合适的索引类型。
- HNSW 适合大规模数据。
- IVFFlat 适合小规模数据。
- DiskANN 适合大规模数据且需要高精度查询。
索引参数需要根据实际情况调整。 - 距离度量:根据场景选择合适的方法(
<->
为L2距离,<=>
为余弦距离)。
使用示例
- 安装依赖包。
# 核心依赖 pip install langchain-core psycopg2-binary sqlalchemy numpy # 可选嵌入模型(示例使用cohere) pip install langchain_cohere
- 初始化实例。
from langchain_cohere import CohereEmbeddings from langchain_vexdb import VexdbVector from langchain_vexdb.vectorstores import VexdbVector from langchain_core.documents import Document # See docker command above to launch a postgres instance with pgvector enabled. connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain" collection_name = "my_docs" embeddings = CohereEmbeddings() vectorstore = VexdbVector( embeddings=embeddings, collection_name=collection_name, connection=connection, use_jsonb=True, )
- 文档存储。
docs = [ Document(page_content='there are cats in the pond', metadata={"id": 1, "location": "pond", "topic": "animals"}), Document(page_content='ducks are also found in the pond', metadata={"id": 2, "location": "pond", "topic": "animals"}), Document(page_content='fresh apples are available at the market', metadata={"id": 3, "location": "market", "topic": "food"}), Document(page_content='the market also sells fresh oranges', metadata={"id": 4, "location": "market", "topic": "food"}), Document(page_content='the new art exhibit is fascinating', metadata={"id": 5, "location": "museum", "topic": "art"}), Document(page_content='a sculpture exhibit is also at the museum', metadata={"id": 6, "location": "museum", "topic": "art"}), Document(page_content='a new coffee shop opened on Main Street', metadata={"id": 7, "location": "Main Street", "topic": "food"}), Document(page_content='the book club meets at the library', metadata={"id": 8, "location": "library", "topic": "reading"}), Document(page_content='the library hosts a weekly story time for kids', metadata={"id": 9, "location": "library", "topic": "reading"}), Document(page_content='a cooking class for beginners is offered at the community center', metadata={"id": 10, "location": "community center", "topic": "classes"}) ] # 存储文档 vectorstore.add_documents(docs, ids=[doc.metadata['id'] for doc in docs])
- 进行向量检索。
使用 similarity_search 方法进行向量检索。例如,查询与 "kitty" 相关的文档。vectorstore.similarity_search('kitty', k=10) # 查询结果输出 [Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}), Document(page_content='the book club meets at the library', metadata={'id': 8, 'topic': 'reading', 'location': 'library'}), Document(page_content='the library hosts a weekly story time for kids', metadata={'id': 9, 'topic': 'reading', 'location': 'library'}), Document(page_content='the new art exhibit is fascinating', metadata={'id': 5, 'topic': 'art', 'location': 'museum'}), Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}), Document(page_content='the market also sells fresh oranges', metadata={'id': 4, 'topic': 'food', 'location': 'market'}), Document(page_content='a cooking class for beginners is offered at the community center', metadata={'id': 10, 'topic': 'classes', 'location': 'community center'}), Document(page_content='fresh apples are available at the market', metadata={'id': 3, 'topic': 'food', 'location': 'market'}), Document(page_content='a sculpture exhibit is also at the museum', metadata={'id': 6, 'topic': 'art', 'location': 'museum'}), Document(page_content='a new coffee shop opened on Main Street', metadata={'id': 7, 'topic': 'food', 'location': 'Main Street'})]
过滤器支持
向量存储支持一组可应用于文档的元数据字段的过滤器。
过滤器 | 含义/类别 |
---|---|
$eq | Equality (==) |
$ne | Inequality (!=) |
$lt | Less than (<) |
$lte | Less than or equal (<=) |
$gt | Greater than (>) |
$gte | Greater than or equal (>=) |
$in | Special Cased (in) |
$nin | Special Cased (not in) |
$between | Special Cased (between) |
$exists | Special Cased (is null) |
$like | Text (like) |
$ilike | Text (case-insensitive like) |
$and | Logical (and) |
$or | Logical (or) |
- 进行带过滤器的查询:
vectorstore.similarity_search('kitty', k=10, filter={ 'id': {'$in': [1, 5, 2, 9]} }) # 查询结果输出 [Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}), Document(page_content='the library hosts a weekly story time for kids', metadata={'id': 9, 'topic': 'reading', 'location': 'library'}), Document(page_content='the new art exhibit is fascinating', metadata={'id': 5, 'topic': 'art', 'location': 'museum'}), Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'})]
如果你提供一个包含多个字段但没有操作符的字典,顶层将被解释为逻辑 AND 过滤器。vectorstore.similarity_search('ducks', k=10, filter={ 'id': {'$in': [1, 5, 2, 9]}, 'location': {'$in': ["pond", "market"]} }) # 查询结果输出 [Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}), Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'})]
vectorstore.similarity_search('ducks', k=10, filter={ '$and': [ {'id': {'$in': [1, 5, 2, 9]}}, {'location': {'$in': ["pond", "market"]}}, ] }) # 查询结果输出 [Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}), Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'})]
vectorstore.similarity_search('bird', k=10, filter={ 'location': { "$ne": 'pond'} }) # 查询结果输出 [Document(page_content='the book club meets at the library', metadata={'id': 8, 'topic': 'reading', 'location': 'library'}), Document(page_content='the new art exhibit is fascinating', metadata={'id': 5, 'topic': 'art', 'location': 'museum'}), Document(page_content='the library hosts a weekly story time for kids', metadata={'id': 9, 'topic': 'reading', 'location': 'library'}), Document(page_content='a sculpture exhibit is also at the museum', metadata={'id': 6, 'topic': 'art', 'location': 'museum'}), Document(page_content='the market also sells fresh oranges', metadata={'id': 4, 'topic': 'food', 'location': 'market'}), Document(page_content='a cooking class for beginners is offered at the community center', metadata={'id': 10, 'topic': 'classes', 'location': 'community center'}), Document(page_content='a new coffee shop opened on Main Street', metadata={'id': 7, 'topic': 'food', 'location': 'Main Street'}), Document(page_content='fresh apples are available at the market', metadata={'id': 3, 'topic': 'food', 'location': 'market'})]
用于检索增强生成
有关如何使用此向量存储进行检索增强生成(RAG)的指南,请查阅以下内容:
查询示例: 使用高级筛选器
- 演示如何使用
$nin
筛选器查找不在特定列表中的项目:results = vector_store.similarity_search( "kitty", k=10, filter={"id": {"$nin": [1, 5, 2]}}) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")
- 演示如何使用
$between
筛选器查找在特定范围内的项目:results = vector_store.similarity_search( "apples", k=10, filter={"id": {"$between": [3, 5]}}) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")
- 演示如何使用
$ilike
筛选器进行不区分大小写的搜索:results = vector_store.similarity_search( "book", k=10, filter={"page_content": {"$ilike": "%book%"}} ) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")
- 演示如何使用
$and
筛选器组合多个条件:results = vector_store.similarity_search( "coffee", k=10, filter={"$and": [ {"location": {"$eq": "Main Street"}}, {"topic": {"$eq": "food"}} ]} ) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")
- 演示如何使用
$or
筛选器进行多条件之一的匹配:results = vector_store.similarity_search( "reading", k=10, filter={"$or": [ {"location": {"$eq": "library"}}, {"location": {"$eq": "community center"}} ]} ) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")
- 演示如何使用
$nin
筛选器查找不在特定列表中的项目:results = vector_store.similarity_search( "kitty", k=10, filter={"id": {"$nin": [1, 5, 2]}}) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")
- 演示如何使用
$between
筛选器查找在特定范围内的项目:results = vector_store.similarity_search( "apples", k=10, filter={"id": {"$between": [3, 5]}}) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")
- 演示如何使用
$ilike
筛选器进行不区分大小写的搜索:results = vector_store.similarity_search( "book", k=10, filter={"page_content": {"$ilike": "%book%"}} ) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")
- 演示如何使用
$and
筛选器组合多个条件:results = vector_store.similarity_search( "coffee", k=10, filter={"$and": [ {"location": {"$eq": "Main Street"}}, {"topic": {"$eq": "food"}} ]} ) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")
- 演示如何使用
$or
筛选器进行多条件之一的匹配:results = vector_store.similarity_search( "reading", k=10, filter={"$or": [ {"location": {"$eq": "library"}}, {"location": {"$eq": "community center"}} ]} ) for doc in results: print(f"* {doc.page_content} [{doc.metadata}]")