# AI & LLM Integration 

## 1. Core Concepts

| Concept | Description |
| :--- | :--- |
| **Embeddings** | Converting text into high-dimensional vectors that capture semantic meaning |
| **Vector DB** | Database optimized for similarity search (FAISS, Chroma, Weaviate, Pinecone) |
| **RAG** | Retrieval-Augmented Generation: search docs -> feed context to LLM |
| **Context Window** | The maximum tokens an LLM can "see" at once (input + output combined) |
| **Temperature** | Controls randomness: 0.0 = deterministic, 0.7+ = creative ✅ |
| **Tokens** | BPE (Byte Pair Encoding) chunks. 1000 tokens ~= 750 words ✅ |
| **Hallucination** | LLM generates plausible but factually incorrect information |
| **Fine-tuning** | Retraining model on domain-specific data (expensive) |
| **Prompt Engineering** | Crafting inputs to reliably guide LLM behaviour |

---

## 2. LLM API Calls - OpenAI Pattern

```python
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))   # ✅ never hardcode

# Chat completion (most common)
response = client.chat.completions.create(
    model="gpt-4o-mini",            # cost-efficient model ✅
    messages=[
        {"role": "system",  "content": "You are a helpful data analyst."},
        {"role": "user",    "content": "Classify this review: Great product!"}
    ],
    temperature=0,                   # 0 = deterministic for classification ✅
    max_tokens=100,
    response_format={"type": "json_object"}   # enforce JSON output ✅
)

# Extract response
text = response.choices[0].message.content
tokens_used = response.usage.total_tokens      # input + output tokens ✅
```

### System vs User Prompt

| Prompt Type | Purpose | Notes |
| :--- | :--- | :--- |
| **System** | Defines LLM role, tone, constraints, boundaries | Harder for users to override |
| **User** | Specific task or data for this request | Can be overridden by adversarial input |
| **Assistant** | Previous LLM responses (for multi-turn) | |

---

## 3. Temperature Guide

| Task | Temperature | Why |
| :--- | :--- | :--- |
| Classification, extraction, logic | `0.0` | Deterministic, consistent ✅ |
| Summarization | `0.3` | Focused but slightly flexible |
| Q&A, chatbot | `0.5` | Balanced |
| Creative writing, brainstorming | `0.7 - 1.0` | Creative, varied ✅ |

---

## 4. Batch Classification (JSON Enforcement)

```python
import json

def classify_batch(items, categories):
    """Batch classify to reduce API calls and cost"""
    prompt = f"""Classify each item into one of {categories}.
Return ONLY a JSON array like: ["cat1", "cat2", ...]
Items: {items}"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

# Process in batches (cost saving pattern)
BATCH_SIZE = 10
for i in range(0, len(data), BATCH_SIZE):
    batch = data[i:i+BATCH_SIZE]
    results = classify_batch(batch, ['Positive', 'Negative', 'Neutral'])
```

### JSON Fixer Pattern

```python
def clean_json_response(raw_str):
    """LLMs sometimes add markdown backticks around JSON"""
    clean = raw_str.replace("```json", "").replace("```", "").strip()
    return json.loads(clean)
```

---

## 5. Embeddings & Similarity

```python
# Generate embeddings
response = client.embeddings.create(
    input=texts,
    model="text-embedding-3-small"   # ✅ TDS standard
)
vectors = [item.embedding for item in response.data]  # list of float arrays

# Cosine similarity (find most similar)
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Find most similar document to query
query_vec = vectors[0]
similarities = [cosine_similarity(query_vec, v) for v in vectors[1:]]
most_similar_idx = np.argmax(similarities)
```

### Embedding Use Cases

| Task | Approach |
| :--- | :--- |
| Semantic search | Embed query + docs, find closest by cosine similarity |
| Duplicate detection | High similarity (> 0.95) suggests near-duplicate |
| Clustering | Embed texts, apply KMeans to find topic groups |
| RAG retrieval | Embed query, find top-K similar chunks |

---

## 6. Clustering with Embeddings

```python
from sklearn.cluster import KMeans
from openai import OpenAI

# 1. Get embeddings
client = OpenAI()
response = client.embeddings.create(input=texts, model="text-embedding-3-small")
vectors = [item.embedding for item in response.data]

# 2. Cluster
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(vectors)
labels = kmeans.labels_     # group ID for each text

# 3. Inspect clusters
for cluster_id in range(5):
    cluster_texts = [texts[i] for i, l in enumerate(labels) if l == cluster_id]
    print(f"Cluster {cluster_id}: {cluster_texts[:3]}")
```

---

## 7. Caching Embeddings (Cost Saving)

```python
import pickle
import os

CACHE_FILE = 'embeddings_cache.pkl'

def get_embeddings_cached(texts):
    """Load from cache or compute fresh"""
    if os.path.exists(CACHE_FILE):
        with open(CACHE_FILE, 'rb') as f:
            cache = pickle.load(f)
    else:
        cache = {}

    # Compute only missing embeddings
    missing = [t for t in texts if t not in cache]
    if missing:
        response = client.embeddings.create(input=missing, model="text-embedding-3-small")
        for text, item in zip(missing, response.data):
            cache[text] = item.embedding

        with open(CACHE_FILE, 'wb') as f:
            pickle.dump(cache, f)      # ✅ save cache

    return [cache[t] for t in texts]

# ✅ Always cache embeddings - avoid re-computing (costs money + time)
```

---

## 8. RAG Architecture (Exam Critical)

### The 5-Step RAG Workflow

```
1. CHUNKING:      Split documents into chunks (e.g. 500 tokens each)
        |
2. VECTORIZE:     Embed each chunk using an embedding model
        |
3. STORE:         Save vectors in Vector DB (FAISS, Chroma, Weaviate)
        |
4. RETRIEVE:      Embed user query -> find top-K similar chunks
        |
5. GENERATE:      Pass retrieved chunks as context -> LLM generates answer
```

### Implementation

```python
# Step 1 & 2: Chunk and embed documents
def chunk_text(text, chunk_size=500):
    words = text.split()
    return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc))

chunk_vectors = get_embeddings_cached(chunks)

# Step 4: Retrieve top-K similar chunks
def retrieve_chunks(query, top_k=3):
    query_vec = get_embeddings_cached([query])[0]
    similarities = [cosine_similarity(query_vec, v) for v in chunk_vectors]
    top_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)[:top_k]
    return [chunks[i] for i in top_indices]

# Step 5: Generate answer with context
def rag_answer(question):
    relevant_chunks = retrieve_chunks(question)
    context = "\n\n".join(relevant_chunks)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer using only the provided context. If unsure, say so."},
            {"role": "user",   "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0
    )
    return response.choices[0].message.content
```

### RAG Prompt Template

```python
PROMPT = """Use the context below to answer the question.
If you don't know from the context, say "I don't have enough information."
Do not make up information.

Context:
{retrieved_chunks}

Question: {user_query}
Answer:"""
```

---

## 9. Pedagogical Prompting (TDS Teaching Assistant) (Exam Critical)

The TDS course asks how LLMs should be used as teaching assistants.

**Goal:** Foster critical thinking, NOT just provide answers.

| Good TA Prompt | Bad TA Prompt |
| :--- | :--- |
| "What do you already know about this concept?" | "Here is the code:" |
| "What have you tried so far?" | "The answer is X" |
| "Can you identify which part of the code is causing the issue?" | "Your code has a bug on line 5, fix it like this:" |
| "What does the error message tell you?" | Directly fixing student's code |

```python
TA_SYSTEM_PROMPT = """You are a TDS Teaching Assistant helping students learn data science.
Your role is to GUIDE students to find answers themselves, NOT to give direct solutions.
- Ask reflective questions to probe understanding
- Point out which concept is relevant without solving it
- If student made an error, ask them to check specific assumptions
- Never provide complete working code as a solution
- Foster independent thinking and debugging skills"""
```

- ✅ "Guide students via reflective questions. Avoid direct solutions." (exam answer)
- ❌ Directly providing code or answers

---

## 10. Token Counting & Cost Estimation

```python
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
token_count = len(enc.encode("Hello, this is a test sentence."))

# Cost = (input_tokens * input_price) + (output_tokens * output_price)
# gpt-4o-mini: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens

def estimate_cost(prompt, expected_output_len=200):
    input_tokens = len(enc.encode(prompt))
    total_tokens = input_tokens + expected_output_len
    cost = (input_tokens * 0.15 + expected_output_len * 0.60) / 1_000_000
    return total_tokens, cost
```

---

## 11. Multi-Modal RAG

- **Definition:** Integrating Text + Code + Visuals (Figures/Charts) in a single retrieval system
- **Workflow:** Research Query -> Multi-Modal Embedding -> Cross-Disciplinary Retrieval -> Synthesized Answer
- **Use case:** Scientific papers where answers may be in a figure caption, not text

---

## 12. Common Mistakes (Exam Critical)

| Mistake | Problem | Fix |
| :--- | :--- | :--- |
| Re-embedding every run | Wastes money and time | Cache with `pickle` ✅ |
| Sending entire document as context | Exceeds context window | Use RAG chunking ✅ |
| Using temperature=0.7 for classification | Inconsistent, non-deterministic results | Use temperature=0 ✅ |
| Ignoring context window limit | Request fails or gets truncated | Chunk + RAG ✅ |
| Assuming LLM has real-time data | Hallucination or outdated info | Use RAG for fresh data ✅ |
| Not validating JSON output | `json.loads()` crash | Add try/except + JSON fixer ✅ |

---

## 13. Hallucination Causes & Fixes

| Cause | Description | Fix |
| :--- | :--- | :--- |
| **Knowledge cutoff** | Model trained on old data | RAG with current documents |
| **Confabulation** | Model fills gaps with plausible-sounding fake info | Ground with context |
| **Ambiguous prompt** | Vague question -> vague/invented answer | Specific, constrained prompts |
| **High temperature** | Too much randomness | Lower temperature |

---

## 14. Quick Reference Card

```python
# Token counting
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
n_tokens = len(enc.encode(text))

# Classification with forced JSON
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"Classify: {text}. Return JSON only."}],
    temperature=0,
    response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)

# RAG: retrieve + generate
chunks = retrieve_chunks(question, top_k=3)
context = "\n".join(chunks)
answer = ask_llm(f"Context: {context}\nQuestion: {question}")

# Cache embeddings
with open('cache.pkl', 'wb') as f:
    pickle.dump(embeddings, f)
```

---

## 15. Exam Scenario Answers

| Scenario | Answer |
| :--- | :--- |
| LLM gives wrong medical citation | Hallucination or knowledge cutoff ✅ |
| How to give LLM access to private/recent docs | RAG (Retrieval-Augmented Generation) ✅ |
| What prompt to use for classification | temperature=0, specific categories, JSON response format ✅ |
| Teaching assistant prompt style | Reflective questions, no direct solutions ✅ |
| Avoid re-computing embeddings | Cache to `embeddings.pkl` ✅ |
| RAG advantage over fine-tuning | Cheaper, updatable, no retraining needed ✅ |
| Context window exceeded | Chunk documents, use RAG ✅ |
| 1000 tokens ~= how many words | ~750 words ✅ |