API¶

Classes¶

class pymongo_voyageai_multimodal.PyMongoVoyageAI(collection_name: str, database_name: str, s3_bucket_name: str | None = None, mongo_client: MongoClient[dict[str, Any]] | None = None, mongo_connection_string: str | None = None, voyageai_client: Client | None = None, voyageai_api_key: str | None = None, voyagai_model_name: str = 'voyage-multimodal-3', storage_object: ObjectStorage | None = None, index_name: str = 'vector_index', embedding_key: str = 'embedding', relevance_score_fn: str = 'cosine', dimensions: int = 1024, auto_create_index: bool = True, auto_index_timeout: int = 60, **kwargs: Any)¶

MongoDB and VoyageAI integration for multimodal embeddings.

PyMongoVoyageAI performs data operations on text, images, embeddings and arbitrary data. The PyMongoVoyageAI provides Vector Search based on similarity of embedding vectors following the Hierarchical Navigable Small Worlds (HNSW) algorithm.

Setup:

Set up a MongoDB Atlas cluster. The free tier M0 will allow you to start. Search Indexes are only available on Atlas, the fully managed cloud service, not the self-managed MongoDB. Follow [this guide](https://www.mongodb.com/basics/mongodb-atlas-tutorial)

Create a Collection and a Vector Search Index. The procedure is described [here](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/#procedure). You can optionally supply a dimensions argument to programmatically create a Vector Search Index.

Set up your VoyageAI account on dash.voyageai.com. You can either provide the voyageai_api_key to the constructor or create a VoyageAI Client yourself and pass it as voyageai_client.

Set up an S3 bucket for storage. Either provide the s3_bucket_name to use the default AWS credentials or provide an instantiated S3 client to an S3Storage object and provide that object as storage_object. For local testing, you could instead pass a MemoryStorage object.

Instantiate:

import os
from pymongo import MongoClient
from pymongo_voyageai_multimodal import PyMongoVoyageAI

client = PyMongoVoyageAI.from_connection_string(
    connection_string=os.environ["MONGODB_ATLAS_CONNECTION_STRING"],
    database_name="db_name",
    collection_name="collection_name",
    s3_bucket_name=os.environ["S3_BUCKET_NAME"],
    voyageai_api_key=os.environ["VOYAGEAI_API_KEY"],
)

Add Documents:

from pymongo_voyageai_multimodal import TextDocument, ImageDocument

text = TextDocument(text="foo", metadata={"baz": "bar"})
images = client.url_to_images(
    "https://www.fdrlibrary.org/documents/356632/390886/readingcopy.pdf"
)
documents = [text1, images[0], images[1]]
ids = ["1", "2", "3"]
client.add_documents(documents=documents, ids=ids)

Delete Documents:

client.delete(ids=["3"])

Search:

results = client.similarity_search(query="thud", k=1)
for doc in results:
    print(f"* {doc['id']} [{doc['inputs']}]")

Search with filter:

results = client.similarity_search(query="thud", k=1, post_filter=[{"bar": "baz"}])
for doc in results:
    print(f"* {doc['id']} [{doc['inputs']}]")

Search with score:

results = client.similarity_search(query="qux", k=1, include_scores=True)

for doc in results:
    print(f"* [SIM={doc['score']:3f}] {doc['id']} [{doc['inputs']}]")

Async:

# add documents
# await client.aadd_documents(documents=documents, ids=ids)

# delete documents
# await client.adelete(ids=["3"])

# search
# results = client.asimilarity_search(query="thud",k=1)

# search with score
results = await client.asimilarity_search(query="qux", k=1, include_scores=True)
for doc in results:
    print(f"* [SIM={doc['score']:3f}] {doc['id']} [{doc['inputs']}]")

__init__(collection_name: str, database_name: str, s3_bucket_name: str | None = None, mongo_client: MongoClient[dict[str, Any]] | None = None, mongo_connection_string: str | None = None, voyageai_client: Client | None = None, voyageai_api_key: str | None = None, voyagai_model_name: str = 'voyage-multimodal-3', storage_object: ObjectStorage | None = None, index_name: str = 'vector_index', embedding_key: str = 'embedding', relevance_score_fn: str = 'cosine', dimensions: int = 1024, auto_create_index: bool = True, auto_index_timeout: int = 60, **kwargs: Any)¶

Parameters:

collection_name – The name of the MongoDB collection to add the documents to.
database_name – The name of the MongoDB database to use.
s3_bucket_name – The name of the s3 bucket to use for storage.
mongo_client – An instantiated MongoClient to use.
mongo_connection_string – A MongoDB connection string that is used to create a MongoClient. It must be provided if mongo_client is not provided.
voyageai_client – An instantiated VoyageAI client to use.
voyageai_api_key – An api key to use when creating a a VoyageAI Client object. It must be provided if voyageai_client is not provided.
voyagai_model_name – The model name to use for VoyageAI embededdings.
storage_object – The ObjectStorage object to use. It can be used to provide alternate an alternate storage backend or an instantiated S3Storage object.
index_name – The Atlas vector search index name to use for the collection.
embedding_key – Field that will contain the embedding for each document.
relevance_score_fn – The similarity score used for the index. Currently supported: ‘euclidean’, ‘cosine’, and ‘dotProduct’.
dimensions – The dimensionality of the VoyageAI model.
auto_create_index – Whether to automatically create the vector search index if needed.
auto_index_timeout – Timeout in seconds to wait for an auto-created index. to be ready.
kwargs – Additionally keyword args accepted for future use.

Add multimodal documents to the vectorstore.

Parameters:

inputs – List of inputs to add to the vectorstore, which are each a list of documents.
ids – Optional list of unique ids that will be used as index in VectorStore.
batch_size – Number of documents to insert at a time. Tuning this may help with performance and sidestep MongoDB limits.
kwargs – Additional keyword args for future expansion.

Returns:

A list documents with their associated input documents.

async aclose() → None¶: Close the client, cleaning up resources.

Add multimodal documents to the vectorstore.

Parameters:

inputs – List of inputs to add to the vectorstore, which are each a list of documents.
ids – Optional list of unique ids that will be used as index in VectorStore.
batch_size – Number of documents to insert at a time. Tuning this may help with performance and sidestep MongoDB limits.
kwargs – Additional keyword args for future expansion.

Returns:

A list documents with their associated input documents.

async adelete_by_ids(ids: list[str | ObjectId], delete_stored_objects: bool = True, **kwargs: Any) → bool¶

Delete documents by ids.

Parameters:

ids – List of ids to delete.
delete_stored_objects – Whether to delete the associated stored objects.
**kwargs – Other keyword arguments passed to delete_many().

Returns:

True if deletion is successful, False otherwise.

Return type:

bool

async adelete_many(filter: Mapping[str, Any], delete_stored_objects: bool = True, **kwargs: Any) → bool¶

Delete documents using a filter.

Parameters:

ids – List of ids to delete.
delete_stored_objects – Whether to delete the associated stored objects.
**kwargs – Other keyword arguments passed to the collection’s delete_many method.

Returns:

True if deletion is successful, False otherwise.

Return type:

bool

async aget_by_ids(ids: Sequence[str | ObjectId], extract_images: bool = True) → list[dict[str, Any]]¶

Get a list of documents by id.

Parameters:

ids – List of ids to search for.
extract_images – Whether to extract the stored documents into image documents.

Returns:

A list of matching documents, where the inputs is a list of stored documents or image documents.

async aimage_to_storage(document: ImageDocument | Image) → StoredDocument¶

Convert an image to a stored document.

Parameters:: document – The input document or image object.
Returns:: The stored document object.

async asimilarity_search(query: str, k: int = 4, pre_filter: dict[str, Any] | None = None, post_filter_pipeline: list[dict[str, Any]] | None = None, oversampling_factor: int = 10, include_scores: bool = False, include_embeddings: bool = False, extract_images: bool = False, **kwargs: Any) → list[dict[str, Any]]¶

Return documents most similar to the given query.

Parameters:

query – Input text of semantic query.
k – The number of documents to return. Defaults to 4.
pre_filter – List of MQL match expressions comparing an indexed field.
post_filter_pipeline – (Optional) Pipeline of MongoDB aggregation stages to filter/process results after $vectorSearch.
oversampling_factor – Multiple of k used when generating number of candidates at each step in the HNSW Vector Search.
include_scores – If True, the query score of each result will be included in metadata.
include_embeddings – If True, the embedding vector of each result will be included in metadata.
extract_images – If True, the stored documents will be converted image documents.
kwargs – Additional arguments are specific to the search_type

Returns:

List of documents most similar to the query and their scores, where the inputs is a list of stored documents or image documents.

async astorage_to_image(document: StoredDocument | str) → ImageDocument¶

Convert a stored document to an image document.

Parameters:: document – The input document or object name.
Returns:: The image document object.

async aurl_to_images(url: str, metadata: dict[str, Any] | None = None, start: int = 0, end: int | None = None, image_column: str | None = None, **kwargs: Any) → list[ImageDocument]¶

Extract images from a url.

Parameters:

url – The url to load the images from.
metadata – A set of metadata to associate with the images.
start – The start frame to use for the images.
end – The end frame to use for the images.
image_column – The name of the column used to store the image data, for parquet files.

Returns:

A list of image document objects.

async await_for_indexing(timeout: int = 60, interval: int = 1) → None¶: Wait for the search index to update to account for newly added embeddings.

close() → None¶: Close the client, cleaning up resources.

delete_by_ids(ids: list[str | ObjectId], delete_stored_objects: bool = True, **kwargs: Any) → bool¶

Delete documents by ids.

Parameters:

ids – List of ids to delete.
delete_stored_objects – Whether to delete the associated stored objects.
**kwargs – Other keyword arguments passed to delete_many().

Returns:

True if deletion is successful, False otherwise.

Return type:

bool

delete_many(filter: Mapping[str, Any], delete_stored_objects: bool = True, **kwargs: Any) → bool¶

Delete documents using a filter.

Parameters:

ids – List of ids to delete.
delete_stored_objects – Whether to delete the associated stored objects.
**kwargs – Other keyword arguments passed to the collection’s delete_many method.

Returns:

True if deletion is successful, False otherwise.

Return type:

bool

get_by_ids(ids: Sequence[str | ObjectId], extract_images: bool = True) → list[dict[str, Any]]¶

Get a list of documents by id.

Parameters:

ids – List of ids to search for.
extract_images – Whether to extract the stored documents into image documents.

Returns:

A list of matching documents, where the inputs is a list of stored documents or image documents.

image_to_storage(document: ImageDocument | Image) → StoredDocument¶

Convert an image to a stored document.

Parameters:: document – The input document or image object.
Returns:: The stored document object.

similarity_search(query: str, k: int = 4, pre_filter: dict[str, Any] | None = None, post_filter_pipeline: list[dict[str, Any]] | None = None, oversampling_factor: int = 10, include_scores: bool = False, include_embeddings: bool = False, extract_images: bool = False, **kwargs: Any) → list[dict[str, Any]]¶

Return documents most similar to the given query.

Parameters:

query – Input text of semantic query.
k – The number of documents to return. Defaults to 4.
pre_filter – List of MQL match expressions comparing an indexed field.
post_filter_pipeline – (Optional) Pipeline of MongoDB aggregation stages to filter/process results after $vectorSearch.
oversampling_factor – Multiple of k used when generating number of candidates at each step in the HNSW Vector Search.
include_scores – If True, the query score of each result will be included in metadata.
include_embeddings – If True, the embedding vector of each result will be included in metadata.
extract_images – If True, the stored documents will be converted image documents.
kwargs – Additional arguments are specific to the search_type

Returns:

List of documents most similar to the query and their scores, where the inputs is a list of stored documents or image documents.

storage_to_image(document: StoredDocument | str) → ImageDocument¶

Convert a stored document to an image document.

Parameters:: document – The input document or object name.
Returns:: The image document object.

url_to_images(url: str, metadata: dict[str, Any] | None = None, start: int = 0, end: int | None = None, image_column: str | None = None, **kwargs: Any) → list[ImageDocument]¶

Extract images from a url.

Parameters:

url – The url to load the images from.
metadata – A set of metadata to associate with the images.
start – The start frame to use for the images.
end – The end frame to use for the images.
image_column – The name of the column used to store the image data, for parquet files.

Returns:

A list of image document objects.

wait_for_indexing(timeout: int = 60, interval: int = 1) → None¶: Wait for the search index to update to account for newly added embeddings.

class pymongo_voyageai_multimodal.ImageDocument(*, type: DocumentType = DocumentType.image, metadata: dict[str, Any] | None = None, image: Image, name: str | None = None, source_url: str | None = None, page_number: int | None = None)¶: A document object containing image data and associated properties.

class pymongo_voyageai_multimodal.TextDocument(*, type: DocumentType = DocumentType.text, metadata: dict[str, Any] | None = None, text: str)¶: A document object containing text data.

class pymongo_voyageai_multimodal.StoredDocument(*, type: DocumentType = DocumentType.storage, metadata: dict[str, Any] | None = None, root_location: str, object_name: str, name: str | None = None, source_url: str | None = None, page_number: int | None = None)¶: A document object containing stored object data and associated properties.

class pymongo_voyageai_multimodal.S3Storage(bucket_name: str, client: BaseClient | None = None, region_name: str | None = None)¶

An object store using an S3 bucket.

__init__(bucket_name: str, client: BaseClient | None = None, region_name: str | None = None)¶

Create an S3 object store.

Parameters:

bucket_name – The s3 bucket name.
client – An instantiated boto3 s3 client.
region_name – The aws region name to use when creating a boto3 s3 client.

class pymongo_voyageai_multimodal.MemoryStorage¶

An in-memory object store

delete_data(object_name: str) → None¶: Delete data from the object store.

load_url(url: str) → BytesIO¶: Load data from a url.

read_data(object_name: str) → BytesIO¶: Read data using the object store.

save_data(data: BytesIO, object_name: str) → None¶: Save data to the object store.

url_prefixes: list[str] | None = ['file://']¶: The url prefixes used by the object store, for reading data from a url.

class pymongo_voyageai_multimodal.ObjectStorage¶

A class used to store binary data.

close()¶: Close the object store.

delete_data(object_name: str) → None¶: Delete data from the object store.

load_url(url: str) → BytesIO¶: Load data from a url.

read_data(object_name: str) → BytesIO¶: Read data from the object store.

root_location: str¶: The default root location to use in the object store.

save_data(data: BytesIO, object_name: str) → None¶: Save data to the object store.

url_prefixes: list[str] | None¶: The url prefixes used by the object store, for reading data from a url.

class pymongo_voyageai_multimodal.DocumentType(value)¶: The type of document used by PyMongoVoyageAI.

class pymongo_voyageai_multimodal.Document(*, type: DocumentType, metadata: dict[str, Any] | None = None)¶: A document object used by PyMongoVoyageAI.