Vector Database
The AI Ingredient Scanner uses Qdrant Cloud for semantic ingredient search, enabling fast and accurate lookups even with variations in ingredient naming.
Why Vector Search?
Traditional keyword search fails with ingredient names because of common variations:
Spelling Variations
"Glycerine" vs "Glycerin" vs "Glycerol"
Scientific Names
"Sodium Lauryl Sulfate" vs "SLS"
Aliases
"Vitamin E" vs "Tocopherol"
Vector search matches by meaning, not exact text. This enables fuzzy matching across all these variations.
Query: "Glycerine"
β
βΌ
βββββββββββββββββββββββ
β Embedding Model β
β gemini-embedding β
ββββββββββββ¬βββββββββββ
β
βΌ
[0.23, 0.45, 0.12, ...] β 768-dim vector
β
βΌ
βββββββββββββββββββββββ
β Qdrant Cloud β
β Cosine Similarity β
ββββββββββββ¬βββββββββββ
β
βΌ
Result: "Glycerin" (confidence: 0.98)Lookup Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β LOOKUP FLOW β β β β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β β β Ingredient β β β Generate β β β Query Qdrant β β β β Name β β Embedding β β (Cosine Search) β β β βββββββββββββββ βββββββββββββββ ββββββββββββ¬βββββββββββ β β β β β ββββββββββββΌβββββββββββ β β β Confidence > 0.7? β β β ββββββββββββ¬βββββββββββ β β β β β βββββββββββββββββββββββββΌβββββββββββββ β β YES β NO ββ β βΌ βΌ ββ β βββββββββββββββ βββββββββββββββ ββ β β Return Data β βGoogle Searchβ ββ β β (~100ms) β β (~3 sec) β ββ β βββββββββββββββ ββββββββ¬βββββββ ββ β β ββ β ββββββββΌβββββββ ββ β β Save Result β ββ β β to Qdrant β ββ β ββββββββ¬βββββββ ββ β β ββ β ββββββββΌβββββββ ββ β β Return Data β ββ β βββββββββββββββ ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Configuration
Collection Settings
COLLECTION_NAME = "ingredients" VECTOR_SIZE = 768 # gemini-embedding-001 output dimensions EMBEDDING_MODEL = "gemini-embedding-001" CONFIDENCE_THRESHOLD = 0.7
Vector Parameters
from qdrant_client.models import Distance, VectorParams
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(
size=VECTOR_SIZE,
distance=Distance.COSINE, # Cosine similarity
),
)Data Schema
Payload Structure
Each vector point stores ingredient metadata as a JSON payload:
{
"name": "Glycerin",
"purpose": "Humectant, moisturizer",
"safety_rating": 9,
"concerns": "No known concerns",
"recommendation": "SAFE",
"allergy_risk_flag": "low",
"allergy_potential": "Rare allergic reactions",
"origin": "Natural",
"category": "Both",
"regulatory_status": "FDA approved, EU compliant",
"regulatory_bans": "No",
"aliases": ["Glycerine", "Glycerol", "E422"]
}TypeScript Interface
interface IngredientData {
name: string;
purpose: string;
safety_rating: number; // 1-10
concerns: string;
recommendation: string; // SAFE | CAUTION | AVOID
allergy_risk_flag: string; // high | low
allergy_potential: string;
origin: string; // Natural | Synthetic | Semi-synthetic
category: string; // Food | Cosmetics | Both
regulatory_status: string;
regulatory_bans: string; // Yes | No
source: string; // qdrant | google_search
confidence: number; // 0.0 - 1.0
}Operations
Lookup Ingredient
def lookup_ingredient(ingredient_name: str) -> IngredientData | None:
"""Look up ingredient in Qdrant vector database."""
# Generate embedding
embedding = get_embedding(ingredient_name.lower().strip())
# Query Qdrant
results = client.query_points(
collection_name=COLLECTION_NAME,
query=embedding,
limit=1,
)
if not results.points:
return None
top_result = results.points[0]
confidence = top_result.score
if confidence < CONFIDENCE_THRESHOLD:
return None # Will trigger Google Search
return _parse_payload(top_result.payload, confidence)Upsert Ingredient
When Google Search finds new ingredient data, it is saved to Qdrant for future lookups:
def upsert_ingredient(ingredient_data: IngredientData) -> bool:
"""Add or update an ingredient in the database."""
name = ingredient_data["name"]
# Create embedding
embedding = get_embedding(name.lower())
# Create point
point = PointStruct(
id=hash(name.lower()) % (2**63),
vector=embedding,
payload={
"name": name,
"purpose": ingredient_data["purpose"],
"safety_rating": ingredient_data["safety_rating"],
# ... other fields
},
)
client.upsert(
collection_name=COLLECTION_NAME,
points=[point],
)
return TrueGenerate Embedding
def get_embedding(text: str) -> list[float]:
"""Get embedding vector using Google AI Studio."""
client = genai.Client(api_key=settings.google_api_key)
result = client.models.embed_content(
model=EMBEDDING_MODEL,
contents=text,
config=types.EmbedContentConfig(
task_type="RETRIEVAL_QUERY",
output_dimensionality=VECTOR_SIZE,
),
)
return result.embeddings[0].valuesSelf-Learning Pipeline
The system automatically improves over time through a self-learning mechanism:
1
First Query (New Ingredient)
Ingredient not found in Qdrant β Google Search fallback β Parse results β Save to Qdrant
~2-3 seconds
2
Future Queries (Same Ingredient)
Ingredient found in Qdrant with high confidence β Return cached data
~50-100ms
Result: Knowledge base grows automatically with each unique ingredient lookup.
Performance
| Operation | Typical Latency |
|---|---|
| Embedding generation | 100-200ms |
| Qdrant query | 50-100ms |
| Google Search (fallback) | 2-3 seconds |
| Total (cached) | ~200ms |
| Total (uncached) | ~3 seconds |
Qdrant Cloud Setup
Step 1: Create Cluster
- Go to Qdrant Cloud Console
- Create a new cluster (free tier available)
- Note your cluster URL and API key
Step 2: Configure Environment
QDRANT_URL=https://your-cluster.qdrant.io QDRANT_API_KEY=your_api_key_here
Step 3: Verify Connection
from qdrant_client import QdrantClient
client = QdrantClient(
url=settings.qdrant_url,
api_key=settings.qdrant_api_key,
)
# Check collections
collections = client.get_collections()
print(collections)Troubleshooting
"Collection not found"
The collection is auto-created on first use:
def ensure_collection_exists(client: QdrantClient) -> None:
collections = client.get_collections()
exists = any(c.name == COLLECTION_NAME for c in collections.collections)
if not exists:
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(
size=VECTOR_SIZE,
distance=Distance.COSINE,
),
)Low Match Confidence
If ingredients are not matching well:
- Check ingredient name normalization (lowercase, trimmed)
- Verify embedding model is consistent across operations
- Consider lowering
CONFIDENCE_THRESHOLDfrom 0.7
Connection Timeout
client = QdrantClient(
url=settings.qdrant_url,
api_key=settings.qdrant_api_key,
timeout=30, # Increase timeout
)