Vector Database

The AI Ingredient Scanner uses Qdrant Cloud for semantic ingredient search, enabling fast and accurate lookups even with variations in ingredient naming.


Why Vector Search?

Traditional keyword search fails with ingredient names because of common variations:

Spelling Variations
"Glycerine" vs "Glycerin" vs "Glycerol"
Scientific Names
"Sodium Lauryl Sulfate" vs "SLS"
Aliases
"Vitamin E" vs "Tocopherol"

Vector search matches by meaning, not exact text. This enables fuzzy matching across all these variations.

Query: "Glycerine"
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Embedding Model    β”‚
β”‚  gemini-embedding   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
    [0.23, 0.45, 0.12, ...]  ← 768-dim vector
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Qdrant Cloud      β”‚
β”‚   Cosine Similarity β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
    Result: "Glycerin" (confidence: 0.98)

Lookup Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      LOOKUP FLOW                                β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ Ingredient  β”‚ β†’  β”‚  Generate   β”‚ β†’  β”‚   Query Qdrant      β”‚ β”‚
β”‚  β”‚   Name      β”‚    β”‚  Embedding  β”‚    β”‚   (Cosine Search)   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                   β”‚             β”‚
β”‚                                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚                                        β”‚  Confidence > 0.7?  β”‚ β”‚
β”‚                                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                   β”‚             β”‚
β”‚                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚                           β”‚ YES                   β”‚ NO        β”‚β”‚
β”‚                           β–Ό                       β–Ό           β”‚β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚β”‚
β”‚                    β”‚ Return Data β”‚         β”‚Google Searchβ”‚   β”‚β”‚
β”‚                    β”‚   (~100ms)  β”‚         β”‚  (~3 sec)   β”‚   β”‚β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β”‚β”‚
β”‚                                                   β”‚           β”‚β”‚
β”‚                                            β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”   β”‚β”‚
β”‚                                            β”‚ Save Result β”‚   β”‚β”‚
β”‚                                            β”‚  to Qdrant  β”‚   β”‚β”‚
β”‚                                            β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β”‚β”‚
β”‚                                                   β”‚           β”‚β”‚
β”‚                                            β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”   β”‚β”‚
β”‚                                            β”‚ Return Data β”‚   β”‚β”‚
β”‚                                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

Collection Settings

COLLECTION_NAME = "ingredients"
VECTOR_SIZE = 768  # gemini-embedding-001 output dimensions
EMBEDDING_MODEL = "gemini-embedding-001"
CONFIDENCE_THRESHOLD = 0.7

Vector Parameters

from qdrant_client.models import Distance, VectorParams

client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(
        size=VECTOR_SIZE,
        distance=Distance.COSINE,  # Cosine similarity
    ),
)

Data Schema

Payload Structure

Each vector point stores ingredient metadata as a JSON payload:

{
  "name": "Glycerin",
  "purpose": "Humectant, moisturizer",
  "safety_rating": 9,
  "concerns": "No known concerns",
  "recommendation": "SAFE",
  "allergy_risk_flag": "low",
  "allergy_potential": "Rare allergic reactions",
  "origin": "Natural",
  "category": "Both",
  "regulatory_status": "FDA approved, EU compliant",
  "regulatory_bans": "No",
  "aliases": ["Glycerine", "Glycerol", "E422"]
}

TypeScript Interface

interface IngredientData {
  name: string;
  purpose: string;
  safety_rating: number;      // 1-10
  concerns: string;
  recommendation: string;     // SAFE | CAUTION | AVOID
  allergy_risk_flag: string;  // high | low
  allergy_potential: string;
  origin: string;             // Natural | Synthetic | Semi-synthetic
  category: string;           // Food | Cosmetics | Both
  regulatory_status: string;
  regulatory_bans: string;    // Yes | No
  source: string;             // qdrant | google_search
  confidence: number;         // 0.0 - 1.0
}

Operations

Lookup Ingredient

def lookup_ingredient(ingredient_name: str) -> IngredientData | None:
    """Look up ingredient in Qdrant vector database."""

    # Generate embedding
    embedding = get_embedding(ingredient_name.lower().strip())

    # Query Qdrant
    results = client.query_points(
        collection_name=COLLECTION_NAME,
        query=embedding,
        limit=1,
    )

    if not results.points:
        return None

    top_result = results.points[0]
    confidence = top_result.score

    if confidence < CONFIDENCE_THRESHOLD:
        return None  # Will trigger Google Search

    return _parse_payload(top_result.payload, confidence)

Upsert Ingredient

When Google Search finds new ingredient data, it is saved to Qdrant for future lookups:

def upsert_ingredient(ingredient_data: IngredientData) -> bool:
    """Add or update an ingredient in the database."""

    name = ingredient_data["name"]

    # Create embedding
    embedding = get_embedding(name.lower())

    # Create point
    point = PointStruct(
        id=hash(name.lower()) % (2**63),
        vector=embedding,
        payload={
            "name": name,
            "purpose": ingredient_data["purpose"],
            "safety_rating": ingredient_data["safety_rating"],
            # ... other fields
        },
    )

    client.upsert(
        collection_name=COLLECTION_NAME,
        points=[point],
    )

    return True

Generate Embedding

def get_embedding(text: str) -> list[float]:
    """Get embedding vector using Google AI Studio."""

    client = genai.Client(api_key=settings.google_api_key)

    result = client.models.embed_content(
        model=EMBEDDING_MODEL,
        contents=text,
        config=types.EmbedContentConfig(
            task_type="RETRIEVAL_QUERY",
            output_dimensionality=VECTOR_SIZE,
        ),
    )

    return result.embeddings[0].values

Self-Learning Pipeline

The system automatically improves over time through a self-learning mechanism:

1
First Query (New Ingredient)
Ingredient not found in Qdrant β†’ Google Search fallback β†’ Parse results β†’ Save to Qdrant
~2-3 seconds
2
Future Queries (Same Ingredient)
Ingredient found in Qdrant with high confidence β†’ Return cached data
~50-100ms

Result: Knowledge base grows automatically with each unique ingredient lookup.


Performance

OperationTypical Latency
Embedding generation100-200ms
Qdrant query50-100ms
Google Search (fallback)2-3 seconds
Total (cached)~200ms
Total (uncached)~3 seconds

Qdrant Cloud Setup

Step 1: Create Cluster

  1. Go to Qdrant Cloud Console
  2. Create a new cluster (free tier available)
  3. Note your cluster URL and API key

Step 2: Configure Environment

QDRANT_URL=https://your-cluster.qdrant.io
QDRANT_API_KEY=your_api_key_here

Step 3: Verify Connection

from qdrant_client import QdrantClient

client = QdrantClient(
    url=settings.qdrant_url,
    api_key=settings.qdrant_api_key,
)

# Check collections
collections = client.get_collections()
print(collections)

Troubleshooting

"Collection not found"

The collection is auto-created on first use:

def ensure_collection_exists(client: QdrantClient) -> None:
    collections = client.get_collections()
    exists = any(c.name == COLLECTION_NAME for c in collections.collections)

    if not exists:
        client.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(
                size=VECTOR_SIZE,
                distance=Distance.COSINE,
            ),
        )
Low Match Confidence

If ingredients are not matching well:

  • Check ingredient name normalization (lowercase, trimmed)
  • Verify embedding model is consistent across operations
  • Consider lowering CONFIDENCE_THRESHOLD from 0.7
Connection Timeout
client = QdrantClient(
    url=settings.qdrant_url,
    api_key=settings.qdrant_api_key,
    timeout=30,  # Increase timeout
)

Related Documentation