# GenAI, LLMs & Agentic AI

> Covers GenAI concepts, LLM internals, fine-tuning techniques, prompt engineering, RAG, embeddings, LangChain, LangGraph, and Agentic AI patterns.

---

## 1. LLM Core Concepts

| Term | Definition |
| :--- | :--- |
| **Token** | Smallest unit LLMs process (~4 chars). "ChatGPT" = 2 tokens |
| **Context window** | Max tokens in one call (input + output). GPT-4o = 128k |
| **Temperature** | Randomness (0 = deterministic, 1 = creative) |
| **Top-p (nucleus)** | Cumulative probability cutoff for token sampling |
| **Top-k** | Sample only from top-k likely tokens |
| **Logprobs** | Log-probabilities of each output token |
| **Hallucination** | Model generates plausible but factually wrong content |
| **Grounding** | Anchoring outputs to verified data (RAG, tools) |

### Model Families (2025-2026)

| Model | Provider | Strength |
| :--- | :--- | :--- |
| GPT-4o / o3 | OpenAI | Best overall, reasoning |
| Claude 3.5 Sonnet / Opus | Anthropic | Long context, coding |
| Gemini 2.0 Flash / Pro | Google | Multimodal, speed |
| Llama 3.3 70B | Meta | Open weights, self-host |
| Mistral Large | Mistral | European, efficient |
| Qwen 2.5 | Alibaba | Multilingual, coding |

---

## 2. Calling LLMs via API

```python
# OpenAI SDK (also works with Azure, Together, Groq)
from openai import OpenAI

client = OpenAI(api_key="sk-...")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Explain RAG in 2 sentences."}
    ],
    temperature=0.7,
    max_tokens=200
)
print(response.choices[0].message.content)

# Structured output (JSON mode)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract name and age from: Alice is 30"}],
    response_format={"type": "json_object"}
)

# Streaming
for chunk in client.chat.completions.create(model="gpt-4o",
    messages=[{"role":"user","content":"Count to 5"}], stream=True):
    print(chunk.choices[0].delta.content or "", end="")
```

---

## 3. Prompt Engineering

### Core Patterns

| Pattern | Template | Use Case |
| :--- | :--- | :--- |
| **Zero-shot** | "Classify: [text] → category" | Simple tasks |
| **Few-shot** | "Q: X → A: Y\nQ: [new]" | Consistent format |
| **Chain-of-Thought** | "Think step by step: ..." | Math, reasoning |
| **Self-Consistency** | Run 5x, take majority | Accuracy boost |
| **Role prompting** | "You are an expert in..." | Domain tasks |
| **ReAct** | Reason + Act interleaved | Agents |

```python
# Few-shot prompt
prompt = """
Classify sentiment as POSITIVE or NEGATIVE.

Text: "I love this product!" → POSITIVE
Text: "Terrible experience."  → NEGATIVE
Text: "{user_input}"          →
""".format(user_input=user_text)

# Chain-of-Thought
cot_prompt = """
Solve step by step, then give final answer.
Q: If a train travels 120km in 2h, how long for 300km?
A: Let me think step by step...
"""
```

---

## 4. Embeddings & Vector Search

```python
from openai import OpenAI
import numpy as np

client = OpenAI()

# Generate embeddings
def embed(text):
    res = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(res.data[0].embedding)   # 1536-dim vector

# Cosine similarity
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

query_vec = embed("machine learning tutorial")
doc_vecs  = [embed(doc) for doc in docs]
scores    = [cosine_sim(query_vec, d) for d in doc_vecs]
top_match = docs[np.argmax(scores)]

# Vector DBs
# Chroma (local)
import chromadb
client_db = chromadb.Client()
col = client_db.create_collection("docs")
col.add(documents=docs, ids=[str(i) for i in range(len(docs))])
results = col.query(query_texts=["my query"], n_results=3)

# Alternatives: Pinecone, Qdrant, Weaviate, pgvector (Postgres)
```

---

## 5. RAG: Retrieval-Augmented Generation

```
RAG Pipeline:
  [Documents] → Chunking → Embedding → Vector Store
                                           ↓
  [User Query] → Embed → Similarity Search → Top-k Chunks
                                           ↓
                        Augment Prompt: "Context: {chunks}\nQ: {query}"
                                           ↓
                                    LLM → Answer
```

```python
# Minimal RAG from scratch
def rag(query, docs, top_k=3):
    # 1. Embed query
    q_vec = embed(query)

    # 2. Retrieve top-k
    scores = [(cosine_sim(q_vec, embed(d)), d) for d in docs]
    context = "\n".join([d for _, d in sorted(scores, reverse=True)[:top_k]])

    # 3. Augment + generate
    prompt = f"Answer based on this context only:\n{context}\n\nQuestion: {query}"
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role":"user","content":prompt}]
    ).choices[0].message.content

# Chunking strategies
# Fixed-size: split every 512 tokens
# Sentence-aware: split on sentences, overlap 20%
# Semantic: cluster similar sentences together
```

---

## 6. Fine-Tuning Techniques

### When to Fine-Tune vs RAG vs Prompting

| Approach | When | Cost |
| :--- | :--- | :--- |
| Prompting | Format, style, simple tasks | Free |
| RAG | Knowledge grounding, fresh data | Low |
| Fine-tuning | Style/domain adaptation, speed | Medium-High |

### Full Fine-Tuning
```
- Update ALL model weights
- Requires GPU cluster, large data (10k+ examples)
- Risk: catastrophic forgetting
- Use: base models (Llama, Mistral) + custom dataset
```

### Parameter-Efficient Fine-Tuning (PEFT)

| Method | Idea | Params Trained |
| :--- | :--- | :--- |
| **LoRA** | Inject low-rank matrices into attention | ~0.1–1% |
| **QLoRA** | LoRA + quantized base model (4-bit) | ~0.1% + 4-bit base |
| **Prefix Tuning** | Learn soft prompt prefix tokens | Tiny |
| **Adapter** | Small bottleneck layers between transformer blocks | ~1% |

```python
# QLoRA with Hugging Face (recommended for consumer GPUs)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig

# 4-bit quantized base
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="bfloat16")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B", quantization_config=bnb_config)

# LoRA config
lora_config = LoraConfig(
    r=16,           # rank (higher = more params, more expressive)
    lora_alpha=32,  # scaling factor
    target_modules=["q_proj", "v_proj"],   # which layers to adapt
    lora_dropout=0.05
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # → ~0.1% trainable
```

### Training Data Format (SFT)
```json
{"messages": [
  {"role": "system", "content": "You are a SQL expert."},
  {"role": "user",   "content": "Write a query to find top 5 customers."},
  {"role": "assistant", "content": "SELECT customer_id, SUM(amount)..."}
]}
```

### RLHF & DPO
```
RLHF (Reinforcement Learning from Human Feedback):
  1. Supervised Fine-Tuning (SFT) on demonstrations
  2. Train Reward Model (RM) on human preference pairs
  3. PPO to optimize policy toward RM: complex, expensive

DPO (Direct Preference Optimization): simpler alternative:
  - Skip RM entirely
  - Train directly on (chosen, rejected) pairs
  - More stable, easier to implement
  - Used in: Llama 2 Chat, Mistral Instruct

Dataset format for DPO:
  {"prompt": "...", "chosen": "good response", "rejected": "bad response"}
```

---

## 7. LangChain: LLM Application Framework

```python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Basic chain (LCEL - LangChain Expression Language)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}")
])
chain = prompt | llm | StrOutputParser()
result = chain.invoke({"input": "What is RAG?"})

# RAG Chain with retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_retrieval_chain

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever   = vectorstore.as_retriever(search_kwargs={"k": 3})

# Conversational chain with memory
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
qa_chain = ConversationalRetrievalChain.from_llm(llm, retriever, memory=memory)
result = qa_chain({"question": "Explain transformers"})

# Tools
from langchain.tools import tool

@tool
def get_weather(city: str) -> str:
    """Get current weather for a city."""
    return f"Weather in {city}: Sunny, 28°C"

agent = llm.bind_tools([get_weather])
```

---

## 8. LangGraph: Stateful Agentic Workflows

```python
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

# Define state
class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    next_action: str

# Define nodes (functions that transform state)
def call_llm(state: AgentState):
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

def run_tool(state: AgentState):
    # Execute tool call from last message
    tool_result = execute_tool(state["messages"][-1])
    return {"messages": [tool_result]}

def should_continue(state: AgentState):
    last = state["messages"][-1]
    if last.tool_calls:
        return "tool"    # → run_tool node
    return END           # → finish

# Build graph
graph = StateGraph(AgentState)
graph.add_node("llm",  call_llm)
graph.add_node("tool", run_tool)
graph.set_entry_point("llm")
graph.add_conditional_edges("llm", should_continue, {"tool": "tool", END: END})
graph.add_edge("tool", "llm")   # after tool, go back to LLM

app = graph.compile()
result = app.invoke({"messages": [HumanMessage("What is 25 * 37?")]})
```

### LangGraph Key Concepts

| Concept | Meaning |
| :--- | :--- |
| **State** | Typed dict passed between all nodes |
| **Node** | Function that reads + updates state |
| **Edge** | Fixed transition between nodes |
| **Conditional Edge** | Router function decides next node |
| **Checkpointer** | Persist state to DB for resumable workflows |
| **Interrupt** | Pause graph for human-in-the-loop |

---

## 9. Agentic AI Patterns

### Core Agent Loop (ReAct)
```
Thought  → what do I need to do?
Action   → call tool / search / execute
Observation → result of action
... repeat until ...
Final Answer → return result
```

### Agentic Patterns

| Pattern | Description | When |
| :--- | :--- | :--- |
| **Tool Use** | LLM calls functions/APIs | Calculators, search, DB |
| **ReAct** | Reason + Act interleaved | General agents |
| **Plan-and-Execute** | Plan all steps first, then execute | Long tasks |
| **Reflection** | Agent critiques its own output | Accuracy-critical |
| **Multi-Agent** | Orchestrator delegates to specialist agents | Complex systems |
| **Human-in-the-Loop** | Pause for human approval on key steps | High-stakes |

### Multi-Agent Architecture
```
Orchestrator Agent
    ├─ Research Agent  (web search, papers)
    ├─ Code Agent      (write + run code)
    ├─ Critique Agent  (check output quality)
    └─ Output Agent    (format final answer)
```

```python
# LangGraph multi-agent handoff
from langgraph.prebuilt import create_react_agent

research_agent = create_react_agent(llm, tools=[search_tool])
code_agent     = create_react_agent(llm, tools=[python_repl])

def route_task(state):
    if "code" in state["task"]:
        return "code_agent"
    return "research_agent"
```

---

## 10. Evaluation & Safety

### LLM Evaluation Metrics

| Metric | Measures | Tool |
| :--- | :--- | :--- |
| **BLEU** | n-gram overlap vs reference | `sacrebleu` |
| **ROUGE** | Recall-oriented overlap | `rouge-score` |
| **BERTScore** | Semantic similarity | `bert-score` |
| **LLM-as-Judge** | Use GPT-4 to score outputs | Custom prompt |
| **Faithfulness** | Is answer grounded in context? | RAGAS |
| **Answer Relevancy** | Does answer address the question? | RAGAS |

```python
# RAGAS (RAG evaluation)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_recall]
)
print(results.to_pandas())
```

### Guardrails & Safety

```python
# Guardrails AI
from guardrails import Guard
from guardrails.hub import ToxicLanguage, PII

guard = Guard().use(ToxicLanguage, on_fail="fix").use(PII, on_fail="exception")
result = guard(llm_output)

# LlamaGuard (Meta): classify inputs/outputs for safety
# Constitutional AI (Anthropic): rules-based self-critique
# Prompt injection: sanitize user input, never interpolate raw user text into system prompt
```

---

## 11. Quick Reference

```
Model choice:
  Speed + cheap:  GPT-4o mini, Gemini Flash, Llama 3.1 8B
  Best quality:   GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Pro
  Self-hosted:    Llama 3.3 70B, Mistral Large via Ollama

Fine-tuning pick:
  Limited GPU:    QLoRA (4-bit + LoRA)
  Preference:     DPO
  Production:     OpenAI fine-tuning API, Together AI

Frameworks:
  Single-agent:   LangChain + LCEL
  Multi-step:     LangGraph
  Evals:          RAGAS, Promptfoo
  Serving:        vLLM, Ollama, LiteLLM

Embedding models:
  OpenAI:         text-embedding-3-small (fast, cheap)
  Local:          nomic-embed-text, bge-m3 (multilingual)
```