How to Optimize Token Usage with ChromaDB (Step by Step)
If you aren’t paying attention to token usage in your vector database queries, you are burning through credits and performance faster than you realize—so here’s how to chromadb optimize token usage like you actually want to save money and speed.
What You’ll Build and Why It Matters
We’re building a minimal but effective pipeline that takes messy real-world documents, stores them in ChromaDB, and queries them with OpenAI embeddings, all while cutting token waste to the bone.
Prerequisites
- Python 3.11+
- pip install chroma-core==0.10.0 (latest as of March 2026)
- OpenAI API key with access to embeddings endpoints
- Basic knowledge of vector databases and large language model APIs
Step 1: Setup Your ChromaDB Environment
First and foremost, you’ve got to get ChromaDB installed and running in a way that plays nice with your environment. Chroma-core, the core engine behind ChromaDB, attracts a ton of attention on GitHub: over 26,759 stars and 2,140 forks, so it’s far from some niche experiment. That means plenty of community help but also some inevitable open issues (and ChromaDB has 513 open ones as of March 2026).
The latest release, Apache-2.0 licensed, was updated on 2026-03-21—so it’s actively maintained, and that’s good because you’re going to lean on it heavily to manage your embeddings wisely.
# Install chroma-core if you haven’t yet
pip install chroma-core==0.10.0
Setting up a persistent ChromaDB usually just means running this minimal code snippet:
from chromadb import Client
client = Client() # Uses Default Settings - SQLite + Local Disk
collection = client.get_or_create_collection(name="mydocs")
Why do this first? Because every other step revolves around variable token budgeting — and ChromaDB dependency management comes first.
Common issues: If you hit errors like ModuleNotFoundError, double-check your versions. Chroma-core updates often change internals, so pinning to a specific version avoids random breakage.
Step 2: Chunking Content for Token Efficiency
Your documents aren’t neat and trimmed—they come with headers, tables, footnotes, and spammy bits. Just feeding long strings to embedding APIs is a surefire way to cheat yourself out of tokens. Every token counts.
Instead, you chunk your content smartly. Not too big, not too small. A chunk size that matches or slightly under the token limits of your embedding API saves a ton of wasted compute and cost.
import tiktoken # For token counting, OpenAI’s own tokenizer
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Initialize a splitter for ~500 tokens chunks (safe for OpenAI embeddings)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=lambda text: len(tiktoken.encoding_for_model("text-embedding-3-small").encode(text))
)
def chunk_text(text):
chunks = splitter.split_text(text)
print(f"Split into {len(chunks)} chunks")
return chunks
# Example Usage
doc_text = open("messy_doc.txt").read()
chunks = chunk_text(doc_text)
Why chunking saves tokens: If you pump full documents into the embedding API, you get huge embeddings but mostly noise—plus, you could exceed token limits or trigger rate limits. Small but meaningful chunks reliably produce focused embeddings for accurate vector search.
Errors to watch for: Overlapping chunks might occasionally seem redundant in search results—tweak chunk_overlap accordingly. And watch out for the tokenizer choice. If you mismatch tokenizer models, your token counts will be off by 20-30%, wrecking your entire token budgeting plan.
Step 3: Embeddings With ChromaDB – Optimize, Don’t Blindly Embed Everything
This is where many devs blow their token budgets. Feeding every chunk through OpenAI’s embedding endpoint blindsides your budget.
Instead, you want to pre-filter chunks before embedding. Use a cheap text similarity filter. Like a quick TF-IDF check, then only embed the best 30–40% of chunks. It reduces storage, query time, and token cost.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# A cheap approximation to filter meaningful chunks before expensive embeddings
def filter_chunks(chunks, query, keep_ratio=0.4):
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(chunks + [query])
query_vec = X[-1]
chunk_vecs = X[:-1]
similarities = (chunk_vecs @ query_vec.T).toarray().flatten()
threshold = np.quantile(similarities, 1 - keep_ratio)
filtered = [chunk for chunk, sim in zip(chunks, similarities) if sim >= threshold]
print(f"Filtered from {len(chunks)} to {len(filtered)} chunks before embedding")
return filtered
chunks_to_embed = filter_chunks(chunks, "What is token usage optimization?")
Now, embed only chunks_to_embed in ChromaDB:
from openai import OpenAI
import chromadb
client = chromadb.Client()
collection = client.get_or_create_collection("mydocs")
embedding_model = OpenAI()
for chunk in chunks_to_embed:
embedding = embedding_model.embeddings.create(input=chunk).data[0].embedding
collection.add(
documents=[chunk],
embeddings=[embedding],
ids=[hash(chunk) % (10 ** 8)] # unique but simple ID
)
Why this approach? Because embeddings are the big token hogs. A single 500-token chunk costs 500 tokens just being vectorized. Do you really want to embed garbage fragments? No. Filter smart and save tokens.
Errors you might hit: API rate limits. ChromaDB’s batch insertion can help mitigate API calls—more on that below.
Step 4: Bulk Batch Inserts to Reduce API Calls
It’s not just about tokens—you eat latency and end up paying more for too many small calls. Bulk batch inserts kill two birds with one stone: better throughput and fewer redundant API hits.
ChromaDB’s collection.add supports multiple documents & embeddings at once. Pile up your filtered chunks in batches of 50 or 100 to save big time.
BATCH_SIZE = 50
def batch(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
for chunk_batch in batch(chunks_to_embed, BATCH_SIZE):
embeddings = [embedding_model.embeddings.create(input=chunk).data[0].embedding for chunk in chunk_batch]
collection.add(
documents=chunk_batch,
embeddings=embeddings,
ids=[hash(chunk) % (10 ** 8) for chunk in chunk_batch]
)
print(f"Indexed {len(chunk_batch)} chunks in batch")
Why batch? One request = one token billing event and one network trip. Don’t be that developer who waits on 100 individual API calls on the clock.
Watch out for:
- Too-large batches—over 100 per call may throttle you.
- Partial failures—wrap your calls in try/except to handle intermittent errors like timeouts or 429s.
Step 5: Smart Token Counting Before Queries — Don’t Just Guess
When running queries, you need a handle on the tokens the whole pipeline burns. This is critical especially if your queries themselves are long passages or combined user input plus context.
Instead of eyeballing token counts, rely on tiktoken or equivalent to exactly count tokens at each step. That way, you can either truncate or adjust inputs on the fly before burning that query.
def count_tokens(text, model_name="gpt-4o-mini"):
tokenizer = tiktoken.encoding_for_model(model_name)
tokens = tokenizer.encode(text)
return len(tokens)
query = "Explain token optimization in ChromaDB as simply as possible."
tokens_used = count_tokens(query)
print(f"Token count for query: {tokens_used}")
Why bother? Because OpenAI embeddings and chat completions have strict token limits. Going over 4,096 tokens usually triggers truncation, errors, or extra billing you’d rather avoid.
Errors you’ll see: The dreaded OpenAIError: This model's maximum context length is 4097 tokens. Handle these by counting tokens before sending requests and trimming user input or context accordingly (detailed in next section).
Step 6: Context Trimming and Caching Embeddings
When you’re feeding documents and chat history into your language model, the total tokens used matter immensely. A common rookie mistake is always sending your entire document or entire chat history indiscriminately. You blow past token limits in seconds.
Your best option: trim and cache.
- Cache embeddings for static docs (don’t re-embed same text)
- Trim chat history intelligently, prioritizing recent important inputs
- Use a rolling token window of ~3,000 tokens for context in chat/completions
Here’s a snippet illustrating cache check and trimming:
embedding_cache = {}
def get_embedding(text):
if text in embedding_cache:
print("Cache hit for embedding")
return embedding_cache[text]
embedding = embedding_model.embeddings.create(input=text).data[0].embedding
embedding_cache[text] = embedding
return embedding
def trim_context(contexts, max_tokens=3000, model_name="gpt-4o-mini"):
trimmed = []
token_count = 0
for c in reversed(contexts): # start from latest
tcount = count_tokens(c, model_name)
if (token_count + tcount) <= max_tokens:
trimmed.insert(0, c)
token_count += tcount
else:
break
print(f"Trimmed context to {len(trimmed)} messages totaling {token_count} tokens")
return trimmed
This cached approach saves you hundreds, maybe thousands, of tokens monthly.
The Gotchas
Here is where the idealized tutorial code breaks down in real life:
| Gotcha | What Happens | How to Avoid |
|---|---|---|
| Token limits mismatch | Your token counts are off because of tokenizer or model mismatch; you either overrun or underrun tokens | Always use tiktoken.encoding_for_model() matching exactly the embedding or completion model you use |
| Duplicate embeddings | Embedding the same chunk multiple times wastes tokens and storage | Implement caching and document checksum hashing |
| Chunk overlap errors | Too high overlap creates redundant vectors; too little overlap loses context in split documents | Experiment with chunk_overlap parameter, sticking to 10-15% of chunk size |
| Concurrency deadlocks | Multiple async calls to ChromaDB or OpenAI can cause race conditions or partial inserts | Use sync code or proper async queueing; batch insertions protect you from this |
| Unreliable token counting on non-English texts | Tokenizers may miscount due to multi-byte chars or unusual scripts | Test with representative languages, adjust chunk sizes accordingly |
Full Code Example: Putting It All Together
This example combines all previous steps in a working pipeline:
import tiktoken
from chromadb import Client
from openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# Constants
MODEL_NAME = "text-embedding-3-small"
EMBEDDING_BATCH_SIZE = 50
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
# Setup
client = Client()
collection = client.get_or_create_collection(name="mydocs")
openai_client = OpenAI()
tokenizer = tiktoken.encoding_for_model(MODEL_NAME)
def count_tokens(text):
return len(tokenizer.encode(text))
def chunk_text(text):
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
length_function=count_tokens
)
return splitter.split_text(text)
def filter_chunks(chunks, query, keep_ratio=0.4):
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(chunks + [query])
query_vec = X[-1]
chunk_vecs = X[:-1]
sims = (chunk_vecs @ query_vec.T).toarray().flatten()
thresh = np.quantile(sims, 1 - keep_ratio)
return [chunk for chunk, sim in zip(chunks, sims) if sim >= thresh]
def batch(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
def embed_and_index(chunks):
for chunk_batch in batch(chunks, EMBEDDING_BATCH_SIZE):
embeddings = []
for chunk in chunk_batch:
response = openai_client.embeddings.create(input=chunk)
embeddings.append(response.data[0].embedding)
ids = [str(hash(chunk) % (10**8)) for chunk in chunk_batch]
collection.add(documents=chunk_batch, embeddings=embeddings, ids=ids)
print(f"Batch indexed {len(chunk_batch)} chunks")
# Main pipeline
doc_text = open("messy_doc.txt").read()
print("Chunking document...")
chunks = chunk_text(doc_text)
print(f"Generated {len(chunks)} chunks")
# Simulate a sample query
query = "How is token usage optimized with ChromaDB?"
print("Filtering chunks before embedding...")
chunks_to_embed = filter_chunks(chunks, query)
print(f"Embedding and indexing {len(chunks_to_embed)} chunks...")
embed_and_index(chunks_to_embed)
print("Pipeline complete.")
What’s Next
If you managed to implement this successfully, your next move should be to integrate ranked retrieval with OpenAI’s GPT completions to build a retrieval augmented generation (RAG) system that gets answers from your vector database with minimized token bills.
Specifically, focus on combining your ChromaDB queries with a response generator that truncates context smartly, ideally applying query reformulation to reduce token use in prompt engineering.
FAQ
Q: Why not just send the whole document to the embedding API once?
A: Because embedding API token limits usually cap at a few thousand tokens—too small for large documents—and you pay linearly per token, so you waste money and get less precise embeddings that hurt retrieval accuracy.
Q: How do I handle documents that change frequently with ChromaDB?
A: You need to implement delta embeddings—only re-embed new or changed chunks. A naive approach is to hash each chunk’s text and compare against stored IDs to know what to update.
Q: How can I debug token miscounts in my app?
A: Use OpenAI’s official tiktoken library and ensure the model name in encoding_for_model() matches exactly what you send to the API. Cross-check your count by printing token arrays and verifying chunk sizes.
Recommendations for Different Developer Personas
Junior Devs: Start by understanding token limits and chunking. Don’t try to optimize everything at once. Your focus should be on breaking docs into ~500 token chunks and ensuring your API calls work without errors.
Mid-Level Devs: Implement filtering and batching next. Use TF-IDF to pre-select which chunks to embed. Add caching so you don’t re-embed duplicates. At this stage, you’ll save real money and time.
Senior Engineers: Automate token monitoring end to end. Build dashboards that alert you when usage spikes. Explore custom chunkers tuned to your document type. Integrate usage metrics with cost dashboards and experiment with prompt engineering for context trimming.
Data as of March 22, 2026. Sources: https://github.com/chroma-core/chroma, https://community.openai.com/t/issue-chromadb-document-and-token-openai-limitations/317378
Related Articles
🕒 Published: