LangChain

LangChain 是一个用于开发由大语言模型驱动的应用程序的框架。

本文介绍如何将 VexDB 作为 LangChain 向量存储使用。通过本方案可实现企业级向量数据的持久化存储，支持复杂的元数据过滤，并直接利用 VexDB 的事务特性保障数据一致性。

前提条件

在进行 LangChain 的安装之前，请确保已经参考安装 VexDB 的内容完成了数据库的安装，并部署了 Python3 环境（Python ≥ 3.9）。

安装 LangChain-VexDB 插件包

pip install langchain_vexdb-{version}-py3-none-any.whl

注意事项

维度一致性：确保VECTOR(n)的维度与嵌入模型输出维度一致。
选择合适的索引类型：如HNSW、IVFFlat、DiskANN，根据数据量和查询需求选择合适的索引类型。
- HNSW 适合大规模数据。
- IVFFlat 适合小规模数据。
- DiskANN 适合大规模数据且需要高精度查询。
索引参数需要根据实际情况调整。
距离度量：根据场景选择合适的方法（<-> 为L2距离，<=> 为余弦距离）。

使用示例

安装依赖包。程序适配的 langchain-core 版本范围是>=0.2.13且<0.4.0。

# 核心依赖
pip install langchain-core=0.2.13 psycopg2-binary sqlalchemy numpy
# 可选嵌入模型（示例使用cohere）
pip install langchain_cohere

初始化实例。

from langchain_cohere import CohereEmbeddings
from langchain_vexdb import VexdbVector
from langchain_vexdb.vectorstores import VexdbVector
from langchain_core.documents import Document

# See docker command above to launch a postgres instance with pgvector enabled.
connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain" 
collection_name = "my_docs"
embeddings = CohereEmbeddings()

vectorstore = VexdbVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

文档存储。

docs = [
    Document(page_content='there are cats in the pond', metadata={"id": 1, "location": "pond", "topic": "animals"}),
    Document(page_content='ducks are also found in the pond', metadata={"id": 2, "location": "pond", "topic": "animals"}),
    Document(page_content='fresh apples are available at the market', metadata={"id": 3, "location": "market", "topic": "food"}),
    Document(page_content='the market also sells fresh oranges', metadata={"id": 4, "location": "market", "topic": "food"}),
    Document(page_content='the new art exhibit is fascinating', metadata={"id": 5, "location": "museum", "topic": "art"}),
    Document(page_content='a sculpture exhibit is also at the museum', metadata={"id": 6, "location": "museum", "topic": "art"}),
    Document(page_content='a new coffee shop opened on Main Street', metadata={"id": 7, "location": "Main Street", "topic": "food"}),
    Document(page_content='the book club meets at the library', metadata={"id": 8, "location": "library", "topic": "reading"}),
    Document(page_content='the library hosts a weekly story time for kids', metadata={"id": 9, "location": "library", "topic": "reading"}),
    Document(page_content='a cooking class for beginners is offered at the community center', metadata={"id": 10, "location": "community center", "topic": "classes"})
]

# 存储文档
vectorstore.add_documents(docs, ids=[doc.metadata['id'] for doc in docs])

进行向量检索。
使用 similarity_search 方法进行向量检索。例如，查询与 "kitty" 相关的文档。

vectorstore.similarity_search('kitty', k=10)

# 查询结果输出
[Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}),
Document(page_content='the book club meets at the library', metadata={'id': 8, 'topic': 'reading', 'location': 'library'}),
Document(page_content='the library hosts a weekly story time for kids', metadata={'id': 9, 'topic': 'reading', 'location': 'library'}),
Document(page_content='the new art exhibit is fascinating', metadata={'id': 5, 'topic': 'art', 'location': 'museum'}),
Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}),
Document(page_content='the market also sells fresh oranges', metadata={'id': 4, 'topic': 'food', 'location': 'market'}),
Document(page_content='a cooking class for beginners is offered at the community center', metadata={'id': 10, 'topic': 'classes', 'location': 'community center'}),
Document(page_content='fresh apples are available at the market', metadata={'id': 3, 'topic': 'food', 'location': 'market'}),
Document(page_content='a sculpture exhibit is also at the museum', metadata={'id': 6, 'topic': 'art', 'location': 'museum'}),
Document(page_content='a new coffee shop opened on Main Street', metadata={'id': 7, 'topic': 'food', 'location': 'Main Street'})]

过滤器支持

向量存储支持一组可应用于文档的元数据字段的过滤器。

过滤器	含义/类别
$eq	Equality (==)
$ne	Inequality (!=)
$lt	Less than (<)
$lte	Less than or equal (<=)
$gt	Greater than (>)
$gte	Greater than or equal (>=)
$in	Special Cased (in)
$nin	Special Cased (not in)
$between	Special Cased (between)
$exists	Special Cased (is null)
$like	Text (like)
$ilike	Text (case-insensitive like)
$and	Logical (and)
$or	Logical (or)

进行带过滤器的查询：

vectorstore.similarity_search('kitty', k=10, filter={
    'id': {'$in': [1, 5, 2, 9]}
})

# 查询结果输出
[Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'}),
Document(page_content='the library hosts a weekly story time for kids', metadata={'id': 9, 'topic': 'reading', 'location': 'library'}),
Document(page_content='the new art exhibit is fascinating', metadata={'id': 5, 'topic': 'art', 'location': 'museum'}),
Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'})]

如果你提供一个包含多个字段但没有操作符的字典，顶层将被解释为逻辑 AND 过滤器。

vectorstore.similarity_search('ducks', k=10, filter={
    'id': {'$in': [1, 5, 2, 9]},
    'location': {'$in': ["pond", "market"]}
})

# 查询结果输出
[Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}),
Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'})]

vectorstore.similarity_search('ducks', k=10, filter={
    '$and': [
        {'id': {'$in': [1, 5, 2, 9]}},
        {'location': {'$in': ["pond", "market"]}},
    ]
})

# 查询结果输出
[Document(page_content='ducks are also found in the pond', metadata={'id': 2, 'topic': 'animals', 'location': 'pond'}),
Document(page_content='there are cats in the pond', metadata={'id': 1, 'topic': 'animals', 'location': 'pond'})]

vectorstore.similarity_search('bird', k=10, filter={
    'location': { "$ne": 'pond'}
})

# 查询结果输出
[Document(page_content='the book club meets at the library', metadata={'id': 8, 'topic': 'reading', 'location': 'library'}),
Document(page_content='the new art exhibit is fascinating', metadata={'id': 5, 'topic': 'art', 'location': 'museum'}),
Document(page_content='the library hosts a weekly story time for kids', metadata={'id': 9, 'topic': 'reading', 'location': 'library'}),
Document(page_content='a sculpture exhibit is also at the museum', metadata={'id': 6, 'topic': 'art', 'location': 'museum'}),
Document(page_content='the market also sells fresh oranges', metadata={'id': 4, 'topic': 'food', 'location': 'market'}),
Document(page_content='a cooking class for beginners is offered at the community center', metadata={'id': 10, 'topic': 'classes', 'location': 'community center'}),
Document(page_content='a new coffee shop opened on Main Street', metadata={'id': 7, 'topic': 'food', 'location': 'Main Street'}),
Document(page_content='fresh apples are available at the market', metadata={'id': 3, 'topic': 'food', 'location': 'market'})]

用于检索增强生成

有关如何使用此向量存储进行检索增强生成（RAG）的指南，请查阅以下内容：

检索 - Retrieval

查询示例： 使用高级筛选器

演示如何使用 $nin 筛选器查找不在特定列表中的项目：

results = vector_store.similarity_search(
    "kitty", k=10, filter={"id": {"$nin": [1, 5, 2]}})
    for doc in results:
print(f"* {doc.page_content} [{doc.metadata}]")

演示如何使用 $between 筛选器查找在特定范围内的项目：

results = vector_store.similarity_search(
    "apples", k=10, filter={"id": {"$between": [3, 5]}})
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

演示如何使用 $ilike 筛选器进行不区分大小写的搜索：

results = vector_store.similarity_search(
    "book", k=10, filter={"page_content": {"$ilike": "%book%"}}
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

演示如何使用 $and 筛选器组合多个条件：

results = vector_store.similarity_search(
    "coffee", k=10, filter={"$and": [
        {"location": {"$eq": "Main Street"}},
        {"topic": {"$eq": "food"}}
    ]}
)

for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

演示如何使用 $or 筛选器进行多条件之一的匹配：

results = vector_store.similarity_search(
    "reading", k=10, filter={"$or": [
        {"location": {"$eq": "library"}},
        {"location": {"$eq": "community center"}}
    ]}
)

for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")