# AI & LLM Integration ## 1. Core Concepts | Concept | Description | | :--- | :--- | | **Embeddings** | Converting text into high-dimensional vectors that capture semantic meaning | | **Vector DB** | Database optimized for similarity search (FAISS, Chroma, Weaviate, Pinecone) | | **RAG** | Retrieval-Augmented Generation: search docs -> feed context to LLM | | **Context Window** | The maximum tokens an LLM can "see" at once (input + output combined) | | **Temperature** | Controls randomness: 0.0 = deterministic, 0.7+ = creative ✅ | | **Tokens** | BPE (Byte Pair Encoding) chunks. 1000 tokens ~= 750 words ✅ | | **Hallucination** | LLM generates plausible but factually incorrect information | | **Fine-tuning** | Retraining model on domain-specific data (expensive) | | **Prompt Engineering** | Crafting inputs to reliably guide LLM behaviour | --- ## 2. LLM API Calls - OpenAI Pattern ```python from openai import OpenAI import os client = OpenAI(api_key=os.getenv('OPENAI_API_KEY')) # ✅ never hardcode # Chat completion (most common) response = client.chat.completions.create( model="gpt-4o-mini", # cost-efficient model ✅ messages=[ {"role": "system", "content": "You are a helpful data analyst."}, {"role": "user", "content": "Classify this review: Great product!"} ], temperature=0, # 0 = deterministic for classification ✅ max_tokens=100, response_format={"type": "json_object"} # enforce JSON output ✅ ) # Extract response text = response.choices[0].message.content tokens_used = response.usage.total_tokens # input + output tokens ✅ ``` ### System vs User Prompt | Prompt Type | Purpose | Notes | | :--- | :--- | :--- | | **System** | Defines LLM role, tone, constraints, boundaries | Harder for users to override | | **User** | Specific task or data for this request | Can be overridden by adversarial input | | **Assistant** | Previous LLM responses (for multi-turn) | | --- ## 3. Temperature Guide | Task | Temperature | Why | | :--- | :--- | :--- | | Classification, extraction, logic | `0.0` | Deterministic, consistent ✅ | | Summarization | `0.3` | Focused but slightly flexible | | Q&A, chatbot | `0.5` | Balanced | | Creative writing, brainstorming | `0.7 - 1.0` | Creative, varied ✅ | --- ## 4. Batch Classification (JSON Enforcement) ```python import json def classify_batch(items, categories): """Batch classify to reduce API calls and cost""" prompt = f"""Classify each item into one of {categories}. Return ONLY a JSON array like: ["cat1", "cat2", ...] Items: {items}""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0, response_format={"type": "json_object"} ) return json.loads(response.choices[0].message.content) # Process in batches (cost saving pattern) BATCH_SIZE = 10 for i in range(0, len(data), BATCH_SIZE): batch = data[i:i+BATCH_SIZE] results = classify_batch(batch, ['Positive', 'Negative', 'Neutral']) ``` ### JSON Fixer Pattern ```python def clean_json_response(raw_str): """LLMs sometimes add markdown backticks around JSON""" clean = raw_str.replace("```json", "").replace("```", "").strip() return json.loads(clean) ``` --- ## 5. Embeddings & Similarity ```python # Generate embeddings response = client.embeddings.create( input=texts, model="text-embedding-3-small" # ✅ TDS standard ) vectors = [item.embedding for item in response.data] # list of float arrays # Cosine similarity (find most similar) import numpy as np def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Find most similar document to query query_vec = vectors[0] similarities = [cosine_similarity(query_vec, v) for v in vectors[1:]] most_similar_idx = np.argmax(similarities) ``` ### Embedding Use Cases | Task | Approach | | :--- | :--- | | Semantic search | Embed query + docs, find closest by cosine similarity | | Duplicate detection | High similarity (> 0.95) suggests near-duplicate | | Clustering | Embed texts, apply KMeans to find topic groups | | RAG retrieval | Embed query, find top-K similar chunks | --- ## 6. Clustering with Embeddings ```python from sklearn.cluster import KMeans from openai import OpenAI # 1. Get embeddings client = OpenAI() response = client.embeddings.create(input=texts, model="text-embedding-3-small") vectors = [item.embedding for item in response.data] # 2. Cluster kmeans = KMeans(n_clusters=5, random_state=42) kmeans.fit(vectors) labels = kmeans.labels_ # group ID for each text # 3. Inspect clusters for cluster_id in range(5): cluster_texts = [texts[i] for i, l in enumerate(labels) if l == cluster_id] print(f"Cluster {cluster_id}: {cluster_texts[:3]}") ``` --- ## 7. Caching Embeddings (Cost Saving) ```python import pickle import os CACHE_FILE = 'embeddings_cache.pkl' def get_embeddings_cached(texts): """Load from cache or compute fresh""" if os.path.exists(CACHE_FILE): with open(CACHE_FILE, 'rb') as f: cache = pickle.load(f) else: cache = {} # Compute only missing embeddings missing = [t for t in texts if t not in cache] if missing: response = client.embeddings.create(input=missing, model="text-embedding-3-small") for text, item in zip(missing, response.data): cache[text] = item.embedding with open(CACHE_FILE, 'wb') as f: pickle.dump(cache, f) # ✅ save cache return [cache[t] for t in texts] # ✅ Always cache embeddings - avoid re-computing (costs money + time) ``` --- ## 8. RAG Architecture (Exam Critical) ### The 5-Step RAG Workflow ``` 1. CHUNKING: Split documents into chunks (e.g. 500 tokens each) | 2. VECTORIZE: Embed each chunk using an embedding model | 3. STORE: Save vectors in Vector DB (FAISS, Chroma, Weaviate) | 4. RETRIEVE: Embed user query -> find top-K similar chunks | 5. GENERATE: Pass retrieved chunks as context -> LLM generates answer ``` ### Implementation ```python # Step 1 & 2: Chunk and embed documents def chunk_text(text, chunk_size=500): words = text.split() return [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)] chunks = [] for doc in documents: chunks.extend(chunk_text(doc)) chunk_vectors = get_embeddings_cached(chunks) # Step 4: Retrieve top-K similar chunks def retrieve_chunks(query, top_k=3): query_vec = get_embeddings_cached([query])[0] similarities = [cosine_similarity(query_vec, v) for v in chunk_vectors] top_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)[:top_k] return [chunks[i] for i in top_indices] # Step 5: Generate answer with context def rag_answer(question): relevant_chunks = retrieve_chunks(question) context = "\n\n".join(relevant_chunks) response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": "Answer using only the provided context. If unsure, say so."}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"} ], temperature=0 ) return response.choices[0].message.content ``` ### RAG Prompt Template ```python PROMPT = """Use the context below to answer the question. If you don't know from the context, say "I don't have enough information." Do not make up information. Context: {retrieved_chunks} Question: {user_query} Answer:""" ``` --- ## 9. Pedagogical Prompting (TDS Teaching Assistant) (Exam Critical) The TDS course asks how LLMs should be used as teaching assistants. **Goal:** Foster critical thinking, NOT just provide answers. | Good TA Prompt | Bad TA Prompt | | :--- | :--- | | "What do you already know about this concept?" | "Here is the code:" | | "What have you tried so far?" | "The answer is X" | | "Can you identify which part of the code is causing the issue?" | "Your code has a bug on line 5, fix it like this:" | | "What does the error message tell you?" | Directly fixing student's code | ```python TA_SYSTEM_PROMPT = """You are a TDS Teaching Assistant helping students learn data science. Your role is to GUIDE students to find answers themselves, NOT to give direct solutions. - Ask reflective questions to probe understanding - Point out which concept is relevant without solving it - If student made an error, ask them to check specific assumptions - Never provide complete working code as a solution - Foster independent thinking and debugging skills""" ``` - ✅ "Guide students via reflective questions. Avoid direct solutions." (exam answer) - ❌ Directly providing code or answers --- ## 10. Token Counting & Cost Estimation ```python import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") token_count = len(enc.encode("Hello, this is a test sentence.")) # Cost = (input_tokens * input_price) + (output_tokens * output_price) # gpt-4o-mini: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens def estimate_cost(prompt, expected_output_len=200): input_tokens = len(enc.encode(prompt)) total_tokens = input_tokens + expected_output_len cost = (input_tokens * 0.15 + expected_output_len * 0.60) / 1_000_000 return total_tokens, cost ``` --- ## 11. Multi-Modal RAG - **Definition:** Integrating Text + Code + Visuals (Figures/Charts) in a single retrieval system - **Workflow:** Research Query -> Multi-Modal Embedding -> Cross-Disciplinary Retrieval -> Synthesized Answer - **Use case:** Scientific papers where answers may be in a figure caption, not text --- ## 12. Common Mistakes (Exam Critical) | Mistake | Problem | Fix | | :--- | :--- | :--- | | Re-embedding every run | Wastes money and time | Cache with `pickle` ✅ | | Sending entire document as context | Exceeds context window | Use RAG chunking ✅ | | Using temperature=0.7 for classification | Inconsistent, non-deterministic results | Use temperature=0 ✅ | | Ignoring context window limit | Request fails or gets truncated | Chunk + RAG ✅ | | Assuming LLM has real-time data | Hallucination or outdated info | Use RAG for fresh data ✅ | | Not validating JSON output | `json.loads()` crash | Add try/except + JSON fixer ✅ | --- ## 13. Hallucination Causes & Fixes | Cause | Description | Fix | | :--- | :--- | :--- | | **Knowledge cutoff** | Model trained on old data | RAG with current documents | | **Confabulation** | Model fills gaps with plausible-sounding fake info | Ground with context | | **Ambiguous prompt** | Vague question -> vague/invented answer | Specific, constrained prompts | | **High temperature** | Too much randomness | Lower temperature | --- ## 14. Quick Reference Card ```python # Token counting import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") n_tokens = len(enc.encode(text)) # Classification with forced JSON response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"Classify: {text}. Return JSON only."}], temperature=0, response_format={"type": "json_object"} ) result = json.loads(response.choices[0].message.content) # RAG: retrieve + generate chunks = retrieve_chunks(question, top_k=3) context = "\n".join(chunks) answer = ask_llm(f"Context: {context}\nQuestion: {question}") # Cache embeddings with open('cache.pkl', 'wb') as f: pickle.dump(embeddings, f) ``` --- ## 15. Exam Scenario Answers | Scenario | Answer | | :--- | :--- | | LLM gives wrong medical citation | Hallucination or knowledge cutoff ✅ | | How to give LLM access to private/recent docs | RAG (Retrieval-Augmented Generation) ✅ | | What prompt to use for classification | temperature=0, specific categories, JSON response format ✅ | | Teaching assistant prompt style | Reflective questions, no direct solutions ✅ | | Avoid re-computing embeddings | Cache to `embeddings.pkl` ✅ | | RAG advantage over fine-tuning | Cheaper, updatable, no retraining needed ✅ | | Context window exceeded | Chunk documents, use RAG ✅ | | 1000 tokens ~= how many words | ~750 words ✅ |