# Web APIs & Scraping 

## 1. HTTP Status Codes (Exam Critical)

### Complete Status Code Reference

| Code | Meaning | Notes |
| :--- | :--- | :--- |
| **200** | OK | Standard success ✅ |
| **201** | Created | After POST (resource created) ✅ |
| **204** | No Content | Success, no response body (DELETE) ✅ |
| **206** | Partial Content | File download resume |
| **301** | Moved Permanently | Update your bookmarks |
| **302** | Found | Temporary redirect |
| **304** | Not Modified | Cached version still valid, empty body ✅ |
| **400** | Bad Request | Malformed syntax or invalid parameters |
| **401** | Unauthorized | Authentication required/failed (no/wrong API key) ✅ |
| **403** | Forbidden | Authenticated but no permission ✅ |
| **404** | Not Found | Resource does not exist |
| **408** | Request Timeout | Server timed out waiting for request |
| **409** | Conflict | Duplicate resource |
| **413** | Payload Too Large | Request body exceeds server limit |
| **422** | Unprocessable Entity | Syntactically correct but semantically wrong |
| **429** | Too Many Requests | Rate limit exceeded ✅ |
| **500** | Internal Server Error | Generic server-side error ✅ |
| **502** | Bad Gateway | Invalid response from upstream server |
| **503** | Service Unavailable | Server temporarily down ✅ |
| **504** | Gateway Timeout | Upstream server did not respond in time |

### Range Checks (Exam Critical)

```python
200 <= status < 300   # Success ✅ (correct range check, exam answer)
300 <= status < 400   # Redirect ✅
400 <= status < 500   # Client Error ✅
500 <= status < 600   # Server Error ✅

# ❌ Wrong:
status == 200          # misses 201, 204 etc.
status < 400           # includes redirects
```

### Key Distinctions

| Pair | Difference |
| :--- | :--- |
| 401 vs 403 | 401 = not authenticated (Who are you?); 403 = authenticated but no permission (You can't come in) |
| 401 vs 429 | 401 = wrong/missing key; 429 = valid key but too many requests |
| 404 vs 410 | 404 = not found (may exist elsewhere); 410 = permanently deleted |
| 301 vs 302 | 301 = permanent redirect; 302 = temporary |
| 500 vs 503 | 500 = code error; 503 = server temporarily unavailable |
| 502 vs 504 | 502 = bad response FROM upstream; 504 = upstream TIMED OUT |

---

## 2. HTTP Methods (Exam Critical)

| Method | Safe? | Idempotent? | Use Case |
| :--- | :--- | :--- | :--- |
| **GET** | ✅ | ✅ | Retrieve data, no side effects |
| **HEAD** | ✅ | ✅ | Like GET but no response body |
| **OPTIONS** | ✅ | ✅ | Get allowed methods for resource |
| **POST** | ❌ | ❌ | Create resource. Multiple calls = multiple resources ✅ |
| **PUT** | ❌ | ✅ | Replace entire resource. Multiple calls = same result ✅ |
| **PATCH** | ❌ | ❌ | Partial update |
| **DELETE** | ❌ | ✅ | Remove. Multiple calls = stays deleted ✅ |

- **Safe** = does NOT modify server state
- **Idempotent** = same result if called multiple times
- ✅ GET is safe (exam answer)
- ✅ PUT is idempotent (exam answer)
- ❌ POST is NOT idempotent (creates new resource each time)

---

## 3. `requests` Library - Complete Reference

```python
import requests
import os

# GET request
r = requests.get('https://api.example.com/data',
    params={'q': 'Delhi', 'appid': os.getenv('API_KEY')},
    headers={'Authorization': f'Bearer {token}', 'User-Agent': 'MyApp/1.0'},
    timeout=30,         # seconds before timeout ✅
    verify=True,        # SSL certificate verification
    auth=('user', 'pass')  # basic auth
)

# POST request
r = requests.post(url,
    json={'key': 'value'},     # auto-sets Content-Type: application/json ✅
    data={'key': 'value'},     # form-encoded
    headers={'Authorization': f'Bearer {token}'}
)

# PUT / PATCH / DELETE
r = requests.put(url, json={'key': 'value'})
r = requests.patch(url, json={'email': 'new@example.com'})
r = requests.delete(url)

# Response object
r.status_code          # 200, 404, etc.
r.ok                   # True if 200-299
r.raise_for_status()   # raises exception for 4xx/5xx ✅
r.text                 # response as string
r.json()               # parse JSON -> Python dict ✅
r.content              # response as bytes
r.headers              # response headers dict
r.url                  # final URL (after redirects)
r.history              # list of redirect responses
r.elapsed              # time taken
r.cookies              # response cookies
```

---

## 4. Error Handling

```python
from requests.exceptions import RequestException, ConnectionError, Timeout, HTTPError

def safe_api_call(url, params):
    try:
        response = requests.get(url, params=params, timeout=30)
        response.raise_for_status()       # ✅ raise for 4xx/5xx
        return response.json()

    except requests.exceptions.Timeout:
        print("Request timed out")
        return None

    except requests.exceptions.ConnectionError:
        print("Connection failed - check network")
        return None

    except requests.exceptions.HTTPError as e:
        code = e.response.status_code
        if code == 401:
            print("Check API key")
        elif code == 403:
            print("No permission to access this resource")
        elif code == 429:
            print("Rate limit exceeded, slow down")
        return None

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return None
```

---

## 5. Sessions - Reusing Connections

```python
# ✅ Use session for multiple requests to same API
# More efficient - reuses TCP connection, shares headers/cookies

session = requests.Session()
session.headers.update({
    'Authorization': f'Bearer {token}',
    'Content-Type': 'application/json',
    'User-Agent': 'TDS-Bot/1.0'
})

r1 = session.get(url1)
r2 = session.get(url2)

# Context manager (auto-closes)
with requests.Session() as session:
    session.headers.update({'Authorization': f'Bearer {token}'})
    r = session.get(url)
```

---

## 6. Rate Limiting & Retry Patterns (Exam Critical)

```python
import time

# Simple sleep between requests ✅ (exam answer)
for city in cities:
    response = requests.get(url, params={'q': city})
    if response.status_code == 200:
        results.append(response.json())
    elif response.status_code == 429:
        wait = int(response.headers.get('Retry-After', 60))
        time.sleep(wait)      # ✅ use Retry-After header if available
    time.sleep(1)             # ✅ 1-2 sec between all requests

# Exponential backoff
def fetch_with_retry(url, params, max_retries=3):
    for attempt in range(max_retries):
        response = requests.get(url, params=params)
        if response.status_code == 200:
            return response.json()
        elif response.status_code == 429:
            wait_time = 2 ** attempt    # 1, 2, 4 seconds
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
        elif response.status_code >= 500:
            time.sleep(2 ** attempt)    # retry server errors
        else:
            raise Exception(f"Unexpected status: {response.status_code}")
    raise Exception("Max retries exceeded")
```

- ✅ Use `time.sleep(1-2)` between requests (exam answer)
- ❌ Make all requests simultaneously -> triggers 429
- ❌ Ignore rate limits -> API blocks your key

---

## 7. API Security - Storing Keys

```python
import os
from dotenv import load_dotenv

load_dotenv()                              # loads .env file

# ✅ Read from environment variable
api_key = os.getenv('WEATHER_API_KEY')
api_key = os.environ.get('WEATHER_API_KEY', 'default')

# .env file (NEVER commit to git!)
# WEATHER_API_KEY=your_key_here
# ANTHROPIC_API_KEY=sk-ant-...

# .gitignore - always add:
# .env
# .env.*
```

- ❌ Hardcoded in script: `api_key = "abc123"` (visible in git history!)
- ❌ In code comments or README
- ❌ Printed to console (appears in logs)

---

## 8. JSON Response Parsing - Nested Fields (Exam Critical)

```python
# Typical nested API response
response_json = {
    "name": "Delhi",
    "main": {"temp": 298.15, "humidity": 60},
    "weather": [                    # <- this is a LIST
        {"description": "clear sky", "icon": "01d"}
    ],
    "wind": {"speed": 3.5}
}

# ✅ Correct access
city       = response_json['name']
temp       = response_json['main']['temp']
humidity   = response_json['main']['humidity']
desc       = response_json['weather'][0]['description']  # [0] is required! ✅
wind_speed = response_json['wind']['speed']

# ❌ Wrong:
response_json['weather']['description']    # weather is a list, not dict
response_json['main']['weather']           # wrong key path

# Safe access (no KeyError)
temp = response_json.get('main', {}).get('temp', 0)

# Convert Kelvin to Celsius
temp_celsius = response_json['main']['temp'] - 273.15
```

---

## 9. API Pagination

```python
# Page number pagination
def get_all_results(base_url, params):
    all_results = []
    page = 1
    while True:
        params['page'] = page
        params['per_page'] = 100
        response = requests.get(base_url, params=params)
        data = response.json()
        results = data.get('results', [])
        if not results:
            break            # no more pages
        all_results.extend(results)
        page += 1
        time.sleep(0.5)      # rate limiting
    return all_results

# Cursor pagination
def get_all_cursor(url, params):
    all_results = []
    cursor = None
    while True:
        if cursor:
            params['cursor'] = cursor
        response = requests.get(url, params=params)
        data = response.json()
        all_results.extend(data['items'])
        cursor = data.get('next_cursor')
        if not cursor:
            break
    return all_results

# Link header pagination (GitHub style)
def get_all_github(url, headers):
    all_results = []
    while url:
        response = requests.get(url, headers=headers)
        all_results.extend(response.json())
        link = response.headers.get('Link', '')
        url = None
        for part in link.split(','):
            if 'rel="next"' in part:
                url = part.split(';')[0].strip().strip('<>')
    return all_results
```

---

## 10. Chrome DevTools for API Debugging

```
Network Tab:
  - See ALL HTTP requests made by the page
  - Filter: XHR/Fetch -> API calls specifically ✅
  - Headers tab -> request/response headers, status code, authorization header
  - Payload tab -> request body (POST data)
  - Response tab -> actual JSON response body ✅
  - Timing tab -> DNS, connection, TTFB, download time

Application Tab:
  - Cookies -> session cookies, auth tokens ✅
  - LocalStorage / SessionStorage

Console Tab:
  - JavaScript errors
  - Run snippets to inspect/test

Throttling:
  - Simulate slow network (3G, offline) to debug timing issues
```

### Debugging Common Problems

```
401 in Network tab -> Headers -> check Authorization header present and correct format
429 in Network tab -> add time.sleep() between requests
Wrong data -> Network -> Response tab -> see exact JSON returned
Slow API -> Network -> Timing tab -> identify bottleneck (TTFB, DNS, etc.)
Session not working -> Application -> Cookies -> check cookie is being set
```

---

## 11. BeautifulSoup - HTML Parsing

```python
import requests
from bs4 import BeautifulSoup

# Fetch and parse
response = requests.get(url, headers={'User-Agent': 'MyBot/1.0'})
soup = BeautifulSoup(response.text, 'html.parser')   # ✅ built-in, no install

# Finding elements
soup.find('h1')                        # first <h1>
soup.find('p', class_='intro')         # by CSS class
soup.find(id='content')               # by ID
soup.find('a', href=True)             # any <a> with href
soup.find_all('a')                    # ALL <a> tags ✅
soup.find_all('a', href=True)         # all <a> with href
soup.find_all('li', class_='item')    # all <li class="item">
soup.find_all(['h1', 'h2', 'h3'])     # multiple tag types
soup.select('div.content > p')        # CSS selector ✅
soup.select_one('h1.title')           # first match by CSS

# Extracting data
element.get_text(strip=True)          # text content ✅
element.get('href')                   # attribute value ✅ (safe, returns None)
element['href']                       # attribute (raises KeyError if missing)
element.attrs                         # all attributes as dict

# Common patterns
links = [a.get('href') for a in soup.find_all('a', href=True)]
images = [img.get('src') for img in soup.find_all('img', src=True)]
tables = pd.read_html(response.text)[0]   # ✅ pandas parses HTML tables

# Resolve relative URLs
from urllib.parse import urljoin
absolute_urls = [urljoin(base_url, url) for url in links]
```

---

## 12. Ethical Scraping & `robots.txt`

```python
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
import time

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

if rp.can_fetch('*', url):       # ✅ check before fetching
    delay = rp.crawl_delay('*') or 1
    response = requests.get(url)
    time.sleep(delay)            # ✅ respect crawl-delay
else:
    print(f"Disallowed: {url}")
```

**Ethical checklist:**
- ✅ Parse `robots.txt` before scraping
- ✅ Follow disallowed paths
- ✅ Implement `crawl-delay` between requests
- ✅ Identify your bot in `User-Agent` header
- ✅ Check Terms of Service
- ❌ Ignore `robots.txt` even for academic purposes (exam answer)
- ❌ Make rapid-fire requests without delays

---

## 13. Tool Selection Guide

| Scenario | Tool |
| :--- | :--- |
| Simple static HTML page | `requests` + `BeautifulSoup` ✅ |
| JavaScript-rendered page | `Selenium` or `Playwright` |
| Large-scale scraping | `Scrapy` |
| REST API calls | `requests` ✅ |
| Async/concurrent scraping | `httpx` or `aiohttp` |

---

## 14. Quick Reference Card

```python
# Complete API call
import os, requests, time
from dotenv import load_dotenv

load_dotenv()
api_key = os.getenv('API_KEY')   # ✅ never hardcode

response = requests.get(
    'https://api.example.com/data',
    params={'key': api_key, 'q': 'query'},
    headers={'User-Agent': 'MyBot/1.0'},
    timeout=30
)
response.raise_for_status()       # ✅ raises on 4xx/5xx
data = response.json()
description = data['weather'][0]['description']  # ✅ list index [0]

# Post to LLM API
r = requests.post(
    'https://api.openai.com/v1/chat/completions',
    headers={'Authorization': f'Bearer {os.getenv("OPENAI_KEY")}'},
    json={
        'model': 'gpt-4o-mini',
        'messages': [{'role': 'user', 'content': 'Classify: Great product!'}]
    }
)
result = r.json()['choices'][0]['message']['content']
```

---

## 15. Exam Scenario Answers

| Scenario | Answer |
| :--- | :--- |
| Check success range | `200 <= status < 300` ✅ |
| Authenticated but no permission | 403 Forbidden ✅ |
| Missing/wrong API key | 401 Unauthorized ✅ |
| Too many requests | 429 ✅ |
| Cached resource, no new data | 304 Not Modified ✅ |
| POST idempotent? | No, creates new resource each call ✅ |
| Debug slow API | Network tab -> Timing ✅ |
| Check session cookie | Application tab -> Cookies ✅ |
| Filter API calls in DevTools | XHR/Fetch filter in Network tab ✅ |
| Scraping JavaScript pages | Selenium or Playwright ✅ |
| Safe retry methods | GET, PUT, DELETE (idempotent) ✅ |