# GenAI, LLMs & Agentic AI > Covers GenAI concepts, LLM internals, fine-tuning techniques, prompt engineering, RAG, embeddings, LangChain, LangGraph, and Agentic AI patterns. --- ## 1. LLM Core Concepts | Term | Definition | | :--- | :--- | | **Token** | Smallest unit LLMs process (~4 chars). "ChatGPT" = 2 tokens | | **Context window** | Max tokens in one call (input + output). GPT-4o = 128k | | **Temperature** | Randomness (0 = deterministic, 1 = creative) | | **Top-p (nucleus)** | Cumulative probability cutoff for token sampling | | **Top-k** | Sample only from top-k likely tokens | | **Logprobs** | Log-probabilities of each output token | | **Hallucination** | Model generates plausible but factually wrong content | | **Grounding** | Anchoring outputs to verified data (RAG, tools) | ### Model Families (2025-2026) | Model | Provider | Strength | | :--- | :--- | :--- | | GPT-4o / o3 | OpenAI | Best overall, reasoning | | Claude 3.5 Sonnet / Opus | Anthropic | Long context, coding | | Gemini 2.0 Flash / Pro | Google | Multimodal, speed | | Llama 3.3 70B | Meta | Open weights, self-host | | Mistral Large | Mistral | European, efficient | | Qwen 2.5 | Alibaba | Multilingual, coding | --- ## 2. Calling LLMs via API ```python # OpenAI SDK (also works with Azure, Together, Groq) from openai import OpenAI client = OpenAI(api_key="sk-...") response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain RAG in 2 sentences."} ], temperature=0.7, max_tokens=200 ) print(response.choices[0].message.content) # Structured output (JSON mode) response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Extract name and age from: Alice is 30"}], response_format={"type": "json_object"} ) # Streaming for chunk in client.chat.completions.create(model="gpt-4o", messages=[{"role":"user","content":"Count to 5"}], stream=True): print(chunk.choices[0].delta.content or "", end="") ``` --- ## 3. Prompt Engineering ### Core Patterns | Pattern | Template | Use Case | | :--- | :--- | :--- | | **Zero-shot** | "Classify: [text] → category" | Simple tasks | | **Few-shot** | "Q: X → A: Y\nQ: [new]" | Consistent format | | **Chain-of-Thought** | "Think step by step: ..." | Math, reasoning | | **Self-Consistency** | Run 5x, take majority | Accuracy boost | | **Role prompting** | "You are an expert in..." | Domain tasks | | **ReAct** | Reason + Act interleaved | Agents | ```python # Few-shot prompt prompt = """ Classify sentiment as POSITIVE or NEGATIVE. Text: "I love this product!" → POSITIVE Text: "Terrible experience." → NEGATIVE Text: "{user_input}" → """.format(user_input=user_text) # Chain-of-Thought cot_prompt = """ Solve step by step, then give final answer. Q: If a train travels 120km in 2h, how long for 300km? A: Let me think step by step... """ ``` --- ## 4. Embeddings & Vector Search ```python from openai import OpenAI import numpy as np client = OpenAI() # Generate embeddings def embed(text): res = client.embeddings.create(model="text-embedding-3-small", input=text) return np.array(res.data[0].embedding) # 1536-dim vector # Cosine similarity def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) query_vec = embed("machine learning tutorial") doc_vecs = [embed(doc) for doc in docs] scores = [cosine_sim(query_vec, d) for d in doc_vecs] top_match = docs[np.argmax(scores)] # Vector DBs # Chroma (local) import chromadb client_db = chromadb.Client() col = client_db.create_collection("docs") col.add(documents=docs, ids=[str(i) for i in range(len(docs))]) results = col.query(query_texts=["my query"], n_results=3) # Alternatives: Pinecone, Qdrant, Weaviate, pgvector (Postgres) ``` --- ## 5. RAG: Retrieval-Augmented Generation ``` RAG Pipeline: [Documents] → Chunking → Embedding → Vector Store ↓ [User Query] → Embed → Similarity Search → Top-k Chunks ↓ Augment Prompt: "Context: {chunks}\nQ: {query}" ↓ LLM → Answer ``` ```python # Minimal RAG from scratch def rag(query, docs, top_k=3): # 1. Embed query q_vec = embed(query) # 2. Retrieve top-k scores = [(cosine_sim(q_vec, embed(d)), d) for d in docs] context = "\n".join([d for _, d in sorted(scores, reverse=True)[:top_k]]) # 3. Augment + generate prompt = f"Answer based on this context only:\n{context}\n\nQuestion: {query}" return client.chat.completions.create( model="gpt-4o", messages=[{"role":"user","content":prompt}] ).choices[0].message.content # Chunking strategies # Fixed-size: split every 512 tokens # Sentence-aware: split on sentences, overlap 20% # Semantic: cluster similar sentences together ``` --- ## 6. Fine-Tuning Techniques ### When to Fine-Tune vs RAG vs Prompting | Approach | When | Cost | | :--- | :--- | :--- | | Prompting | Format, style, simple tasks | Free | | RAG | Knowledge grounding, fresh data | Low | | Fine-tuning | Style/domain adaptation, speed | Medium-High | ### Full Fine-Tuning ``` - Update ALL model weights - Requires GPU cluster, large data (10k+ examples) - Risk: catastrophic forgetting - Use: base models (Llama, Mistral) + custom dataset ``` ### Parameter-Efficient Fine-Tuning (PEFT) | Method | Idea | Params Trained | | :--- | :--- | :--- | | **LoRA** | Inject low-rank matrices into attention | ~0.1–1% | | **QLoRA** | LoRA + quantized base model (4-bit) | ~0.1% + 4-bit base | | **Prefix Tuning** | Learn soft prompt prefix tokens | Tiny | | **Adapter** | Small bottleneck layers between transformer blocks | ~1% | ```python # QLoRA with Hugging Face (recommended for consumer GPUs) from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import get_peft_model, LoraConfig # 4-bit quantized base bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="bfloat16") model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B", quantization_config=bnb_config) # LoRA config lora_config = LoraConfig( r=16, # rank (higher = more params, more expressive) lora_alpha=32, # scaling factor target_modules=["q_proj", "v_proj"], # which layers to adapt lora_dropout=0.05 ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # → ~0.1% trainable ``` ### Training Data Format (SFT) ```json {"messages": [ {"role": "system", "content": "You are a SQL expert."}, {"role": "user", "content": "Write a query to find top 5 customers."}, {"role": "assistant", "content": "SELECT customer_id, SUM(amount)..."} ]} ``` ### RLHF & DPO ``` RLHF (Reinforcement Learning from Human Feedback): 1. Supervised Fine-Tuning (SFT) on demonstrations 2. Train Reward Model (RM) on human preference pairs 3. PPO to optimize policy toward RM: complex, expensive DPO (Direct Preference Optimization): simpler alternative: - Skip RM entirely - Train directly on (chosen, rejected) pairs - More stable, easier to implement - Used in: Llama 2 Chat, Mistral Instruct Dataset format for DPO: {"prompt": "...", "chosen": "good response", "rejected": "bad response"} ``` --- ## 7. LangChain: LLM Application Framework ```python from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser llm = ChatOpenAI(model="gpt-4o", temperature=0) # Basic chain (LCEL - LangChain Expression Language) prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant."), ("user", "{input}") ]) chain = prompt | llm | StrOutputParser() result = chain.invoke({"input": "What is RAG?"}) # RAG Chain with retriever from langchain_community.vectorstores import Chroma from langchain_openai import OpenAIEmbeddings from langchain.chains import create_retrieval_chain vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings()) retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # Conversational chain with memory from langchain.memory import ConversationBufferMemory from langchain.chains import ConversationalRetrievalChain memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True) qa_chain = ConversationalRetrievalChain.from_llm(llm, retriever, memory=memory) result = qa_chain({"question": "Explain transformers"}) # Tools from langchain.tools import tool @tool def get_weather(city: str) -> str: """Get current weather for a city.""" return f"Weather in {city}: Sunny, 28°C" agent = llm.bind_tools([get_weather]) ``` --- ## 8. LangGraph: Stateful Agentic Workflows ```python from langgraph.graph import StateGraph, END from typing import TypedDict, Annotated import operator # Define state class AgentState(TypedDict): messages: Annotated[list, operator.add] next_action: str # Define nodes (functions that transform state) def call_llm(state: AgentState): response = llm.invoke(state["messages"]) return {"messages": [response]} def run_tool(state: AgentState): # Execute tool call from last message tool_result = execute_tool(state["messages"][-1]) return {"messages": [tool_result]} def should_continue(state: AgentState): last = state["messages"][-1] if last.tool_calls: return "tool" # → run_tool node return END # → finish # Build graph graph = StateGraph(AgentState) graph.add_node("llm", call_llm) graph.add_node("tool", run_tool) graph.set_entry_point("llm") graph.add_conditional_edges("llm", should_continue, {"tool": "tool", END: END}) graph.add_edge("tool", "llm") # after tool, go back to LLM app = graph.compile() result = app.invoke({"messages": [HumanMessage("What is 25 * 37?")]}) ``` ### LangGraph Key Concepts | Concept | Meaning | | :--- | :--- | | **State** | Typed dict passed between all nodes | | **Node** | Function that reads + updates state | | **Edge** | Fixed transition between nodes | | **Conditional Edge** | Router function decides next node | | **Checkpointer** | Persist state to DB for resumable workflows | | **Interrupt** | Pause graph for human-in-the-loop | --- ## 9. Agentic AI Patterns ### Core Agent Loop (ReAct) ``` Thought → what do I need to do? Action → call tool / search / execute Observation → result of action ... repeat until ... Final Answer → return result ``` ### Agentic Patterns | Pattern | Description | When | | :--- | :--- | :--- | | **Tool Use** | LLM calls functions/APIs | Calculators, search, DB | | **ReAct** | Reason + Act interleaved | General agents | | **Plan-and-Execute** | Plan all steps first, then execute | Long tasks | | **Reflection** | Agent critiques its own output | Accuracy-critical | | **Multi-Agent** | Orchestrator delegates to specialist agents | Complex systems | | **Human-in-the-Loop** | Pause for human approval on key steps | High-stakes | ### Multi-Agent Architecture ``` Orchestrator Agent ├─ Research Agent (web search, papers) ├─ Code Agent (write + run code) ├─ Critique Agent (check output quality) └─ Output Agent (format final answer) ``` ```python # LangGraph multi-agent handoff from langgraph.prebuilt import create_react_agent research_agent = create_react_agent(llm, tools=[search_tool]) code_agent = create_react_agent(llm, tools=[python_repl]) def route_task(state): if "code" in state["task"]: return "code_agent" return "research_agent" ``` --- ## 10. Evaluation & Safety ### LLM Evaluation Metrics | Metric | Measures | Tool | | :--- | :--- | :--- | | **BLEU** | n-gram overlap vs reference | `sacrebleu` | | **ROUGE** | Recall-oriented overlap | `rouge-score` | | **BERTScore** | Semantic similarity | `bert-score` | | **LLM-as-Judge** | Use GPT-4 to score outputs | Custom prompt | | **Faithfulness** | Is answer grounded in context? | RAGAS | | **Answer Relevancy** | Does answer address the question? | RAGAS | ```python # RAGAS (RAG evaluation) from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_recall results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_recall] ) print(results.to_pandas()) ``` ### Guardrails & Safety ```python # Guardrails AI from guardrails import Guard from guardrails.hub import ToxicLanguage, PII guard = Guard().use(ToxicLanguage, on_fail="fix").use(PII, on_fail="exception") result = guard(llm_output) # LlamaGuard (Meta): classify inputs/outputs for safety # Constitutional AI (Anthropic): rules-based self-critique # Prompt injection: sanitize user input, never interpolate raw user text into system prompt ``` --- ## 11. Quick Reference ``` Model choice: Speed + cheap: GPT-4o mini, Gemini Flash, Llama 3.1 8B Best quality: GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Pro Self-hosted: Llama 3.3 70B, Mistral Large via Ollama Fine-tuning pick: Limited GPU: QLoRA (4-bit + LoRA) Preference: DPO Production: OpenAI fine-tuning API, Together AI Frameworks: Single-agent: LangChain + LCEL Multi-step: LangGraph Evals: RAGAS, Promptfoo Serving: vLLM, Ollama, LiteLLM Embedding models: OpenAI: text-embedding-3-small (fast, cheap) Local: nomic-embed-text, bge-m3 (multilingual) ```