|
| 1 | +# EmbeddingBridge Python Package |
| 2 | + |
| 3 | +A Python interface to the EmbeddingBridge vector database with dataset management capabilities. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- Python ctypes bindings to the C core |
| 8 | +- Dataset management with S3 integration |
| 9 | +- Vector similarity search |
| 10 | +- Compatibility with Pinecone datasets |
| 11 | +- Simple and intuitive API |
| 12 | + |
| 13 | +## Installation |
| 14 | + |
| 15 | +### From Source |
| 16 | + |
| 17 | +```bash |
| 18 | +pip install -e . |
| 19 | +``` |
| 20 | + |
| 21 | +### Requirements |
| 22 | + |
| 23 | +- Python 3.7+ |
| 24 | +- NumPy |
| 25 | +- pandas |
| 26 | +- pyarrow |
| 27 | +- boto3 (for S3 support) |
| 28 | +- zstandard |
| 29 | + |
| 30 | +## Usage |
| 31 | + |
| 32 | +### Embedding Store |
| 33 | + |
| 34 | +```python |
| 35 | +from embeddingbridge import EmbeddingStore |
| 36 | + |
| 37 | +# Create a new store |
| 38 | +store = EmbeddingStore("path/to/store", dimension=384) |
| 39 | + |
| 40 | +# Add vectors |
| 41 | +store.add_vector( |
| 42 | + id="doc1", |
| 43 | + vector=[0.1, 0.2, ...], # Your vector values |
| 44 | + metadata={"text": "Example document", "source": "wiki"} |
| 45 | +) |
| 46 | + |
| 47 | +# Search for similar vectors |
| 48 | +results = store.search([0.1, 0.2, ...], top_k=5) |
| 49 | +for result in results: |
| 50 | + print(f"ID: {result['id']}, Score: {result['score']}") |
| 51 | + print(f"Metadata: {result['metadata']}") |
| 52 | +``` |
| 53 | + |
| 54 | +### Dataset Management |
| 55 | + |
| 56 | +```python |
| 57 | +from embeddingbridge import datasets |
| 58 | + |
| 59 | +# List available datasets |
| 60 | +dataset_list = datasets.list_datasets() |
| 61 | +print(dataset_list) |
| 62 | + |
| 63 | +# Load a dataset |
| 64 | +dataset = datasets.load_dataset("my-dataset") |
| 65 | + |
| 66 | +# Get dataset info |
| 67 | +print(f"Dimension: {dataset.dimension}") |
| 68 | +print(f"Documents: {len(dataset)}") |
| 69 | + |
| 70 | +# Search for similar vectors |
| 71 | +query_vector = [0.1, 0.2, ...] # Your query vector |
| 72 | +results = dataset.search(query_vector, top_k=10) |
| 73 | +for id, score in results: |
| 74 | + print(f"ID: {id}, Score: {score}") |
| 75 | + |
| 76 | +# Save a dataset to S3 |
| 77 | +dataset.save("s3://my-bucket/datasets/my-dataset") |
| 78 | + |
| 79 | +# Load a dataset from S3 |
| 80 | +s3_dataset = datasets.Dataset.from_path("s3://my-bucket/datasets/my-dataset") |
| 81 | +``` |
| 82 | + |
| 83 | +## API Reference |
| 84 | + |
| 85 | +### `EmbeddingStore` |
| 86 | + |
| 87 | +- `__init__(path, dimension=None)`: Initialize a new embedding store or open an existing one |
| 88 | +- `add_vector(id, vector, metadata=None)`: Add a vector to the store |
| 89 | +- `search(query_vector, top_k=10)`: Search for similar vectors |
| 90 | +- `get_vector(id)`: Get a vector by ID |
| 91 | +- `delete_vector(id)`: Delete a vector by ID |
| 92 | +- `get_metadata(id)`: Get metadata for a vector |
| 93 | +- `dimension`: Property returning the dimension of vectors |
| 94 | +- `count`: Property returning the number of vectors |
| 95 | + |
| 96 | +### `Dataset` |
| 97 | + |
| 98 | +- `from_path(path)`: Load a dataset from a local path or S3 bucket |
| 99 | +- `save(path, overwrite=False)`: Save dataset to local path or S3 bucket |
| 100 | +- `iter_documents(batch_size=100)`: Iterate through documents in batches |
| 101 | +- `search(query_vector, top_k=10)`: Search for similar vectors |
| 102 | +- `dimension`: Property returning the dimension of vectors |
| 103 | +- `documents`: DataFrame containing the documents |
| 104 | + |
| 105 | +### Helper Functions |
| 106 | + |
| 107 | +- `list_datasets(as_df=False)`: List available datasets |
| 108 | +- `load_dataset(name)`: Load a dataset by name |
| 109 | + |
| 110 | +## License |
| 111 | + |
| 112 | +This program is free software; you can redistribute it and/or modify |
| 113 | +it under the terms of the GNU General Public License as published by |
| 114 | +the Free Software Foundation; either version 2 of the License, or |
| 115 | +(at your option) any later version. |
0 commit comments