# Web APIs & Scraping ## 1. HTTP Status Codes (Exam Critical) ### Complete Status Code Reference | Code | Meaning | Notes | | :--- | :--- | :--- | | **200** | OK | Standard success ✅ | | **201** | Created | After POST (resource created) ✅ | | **204** | No Content | Success, no response body (DELETE) ✅ | | **206** | Partial Content | File download resume | | **301** | Moved Permanently | Update your bookmarks | | **302** | Found | Temporary redirect | | **304** | Not Modified | Cached version still valid, empty body ✅ | | **400** | Bad Request | Malformed syntax or invalid parameters | | **401** | Unauthorized | Authentication required/failed (no/wrong API key) ✅ | | **403** | Forbidden | Authenticated but no permission ✅ | | **404** | Not Found | Resource does not exist | | **408** | Request Timeout | Server timed out waiting for request | | **409** | Conflict | Duplicate resource | | **413** | Payload Too Large | Request body exceeds server limit | | **422** | Unprocessable Entity | Syntactically correct but semantically wrong | | **429** | Too Many Requests | Rate limit exceeded ✅ | | **500** | Internal Server Error | Generic server-side error ✅ | | **502** | Bad Gateway | Invalid response from upstream server | | **503** | Service Unavailable | Server temporarily down ✅ | | **504** | Gateway Timeout | Upstream server did not respond in time | ### Range Checks (Exam Critical) ```python 200 <= status < 300 # Success ✅ (correct range check, exam answer) 300 <= status < 400 # Redirect ✅ 400 <= status < 500 # Client Error ✅ 500 <= status < 600 # Server Error ✅ # ❌ Wrong: status == 200 # misses 201, 204 etc. status < 400 # includes redirects ``` ### Key Distinctions | Pair | Difference | | :--- | :--- | | 401 vs 403 | 401 = not authenticated (Who are you?); 403 = authenticated but no permission (You can't come in) | | 401 vs 429 | 401 = wrong/missing key; 429 = valid key but too many requests | | 404 vs 410 | 404 = not found (may exist elsewhere); 410 = permanently deleted | | 301 vs 302 | 301 = permanent redirect; 302 = temporary | | 500 vs 503 | 500 = code error; 503 = server temporarily unavailable | | 502 vs 504 | 502 = bad response FROM upstream; 504 = upstream TIMED OUT | --- ## 2. HTTP Methods (Exam Critical) | Method | Safe? | Idempotent? | Use Case | | :--- | :--- | :--- | :--- | | **GET** | ✅ | ✅ | Retrieve data, no side effects | | **HEAD** | ✅ | ✅ | Like GET but no response body | | **OPTIONS** | ✅ | ✅ | Get allowed methods for resource | | **POST** | ❌ | ❌ | Create resource. Multiple calls = multiple resources ✅ | | **PUT** | ❌ | ✅ | Replace entire resource. Multiple calls = same result ✅ | | **PATCH** | ❌ | ❌ | Partial update | | **DELETE** | ❌ | ✅ | Remove. Multiple calls = stays deleted ✅ | - **Safe** = does NOT modify server state - **Idempotent** = same result if called multiple times - ✅ GET is safe (exam answer) - ✅ PUT is idempotent (exam answer) - ❌ POST is NOT idempotent (creates new resource each time) --- ## 3. `requests` Library - Complete Reference ```python import requests import os # GET request r = requests.get('https://api.example.com/data', params={'q': 'Delhi', 'appid': os.getenv('API_KEY')}, headers={'Authorization': f'Bearer {token}', 'User-Agent': 'MyApp/1.0'}, timeout=30, # seconds before timeout ✅ verify=True, # SSL certificate verification auth=('user', 'pass') # basic auth ) # POST request r = requests.post(url, json={'key': 'value'}, # auto-sets Content-Type: application/json ✅ data={'key': 'value'}, # form-encoded headers={'Authorization': f'Bearer {token}'} ) # PUT / PATCH / DELETE r = requests.put(url, json={'key': 'value'}) r = requests.patch(url, json={'email': 'new@example.com'}) r = requests.delete(url) # Response object r.status_code # 200, 404, etc. r.ok # True if 200-299 r.raise_for_status() # raises exception for 4xx/5xx ✅ r.text # response as string r.json() # parse JSON -> Python dict ✅ r.content # response as bytes r.headers # response headers dict r.url # final URL (after redirects) r.history # list of redirect responses r.elapsed # time taken r.cookies # response cookies ``` --- ## 4. Error Handling ```python from requests.exceptions import RequestException, ConnectionError, Timeout, HTTPError def safe_api_call(url, params): try: response = requests.get(url, params=params, timeout=30) response.raise_for_status() # ✅ raise for 4xx/5xx return response.json() except requests.exceptions.Timeout: print("Request timed out") return None except requests.exceptions.ConnectionError: print("Connection failed - check network") return None except requests.exceptions.HTTPError as e: code = e.response.status_code if code == 401: print("Check API key") elif code == 403: print("No permission to access this resource") elif code == 429: print("Rate limit exceeded, slow down") return None except requests.exceptions.RequestException as e: print(f"Request failed: {e}") return None ``` --- ## 5. Sessions - Reusing Connections ```python # ✅ Use session for multiple requests to same API # More efficient - reuses TCP connection, shares headers/cookies session = requests.Session() session.headers.update({ 'Authorization': f'Bearer {token}', 'Content-Type': 'application/json', 'User-Agent': 'TDS-Bot/1.0' }) r1 = session.get(url1) r2 = session.get(url2) # Context manager (auto-closes) with requests.Session() as session: session.headers.update({'Authorization': f'Bearer {token}'}) r = session.get(url) ``` --- ## 6. Rate Limiting & Retry Patterns (Exam Critical) ```python import time # Simple sleep between requests ✅ (exam answer) for city in cities: response = requests.get(url, params={'q': city}) if response.status_code == 200: results.append(response.json()) elif response.status_code == 429: wait = int(response.headers.get('Retry-After', 60)) time.sleep(wait) # ✅ use Retry-After header if available time.sleep(1) # ✅ 1-2 sec between all requests # Exponential backoff def fetch_with_retry(url, params, max_retries=3): for attempt in range(max_retries): response = requests.get(url, params=params) if response.status_code == 200: return response.json() elif response.status_code == 429: wait_time = 2 ** attempt # 1, 2, 4 seconds print(f"Rate limited. Waiting {wait_time}s...") time.sleep(wait_time) elif response.status_code >= 500: time.sleep(2 ** attempt) # retry server errors else: raise Exception(f"Unexpected status: {response.status_code}") raise Exception("Max retries exceeded") ``` - ✅ Use `time.sleep(1-2)` between requests (exam answer) - ❌ Make all requests simultaneously -> triggers 429 - ❌ Ignore rate limits -> API blocks your key --- ## 7. API Security - Storing Keys ```python import os from dotenv import load_dotenv load_dotenv() # loads .env file # ✅ Read from environment variable api_key = os.getenv('WEATHER_API_KEY') api_key = os.environ.get('WEATHER_API_KEY', 'default') # .env file (NEVER commit to git!) # WEATHER_API_KEY=your_key_here # ANTHROPIC_API_KEY=sk-ant-... # .gitignore - always add: # .env # .env.* ``` - ❌ Hardcoded in script: `api_key = "abc123"` (visible in git history!) - ❌ In code comments or README - ❌ Printed to console (appears in logs) --- ## 8. JSON Response Parsing - Nested Fields (Exam Critical) ```python # Typical nested API response response_json = { "name": "Delhi", "main": {"temp": 298.15, "humidity": 60}, "weather": [ # <- this is a LIST {"description": "clear sky", "icon": "01d"} ], "wind": {"speed": 3.5} } # ✅ Correct access city = response_json['name'] temp = response_json['main']['temp'] humidity = response_json['main']['humidity'] desc = response_json['weather'][0]['description'] # [0] is required! ✅ wind_speed = response_json['wind']['speed'] # ❌ Wrong: response_json['weather']['description'] # weather is a list, not dict response_json['main']['weather'] # wrong key path # Safe access (no KeyError) temp = response_json.get('main', {}).get('temp', 0) # Convert Kelvin to Celsius temp_celsius = response_json['main']['temp'] - 273.15 ``` --- ## 9. API Pagination ```python # Page number pagination def get_all_results(base_url, params): all_results = [] page = 1 while True: params['page'] = page params['per_page'] = 100 response = requests.get(base_url, params=params) data = response.json() results = data.get('results', []) if not results: break # no more pages all_results.extend(results) page += 1 time.sleep(0.5) # rate limiting return all_results # Cursor pagination def get_all_cursor(url, params): all_results = [] cursor = None while True: if cursor: params['cursor'] = cursor response = requests.get(url, params=params) data = response.json() all_results.extend(data['items']) cursor = data.get('next_cursor') if not cursor: break return all_results # Link header pagination (GitHub style) def get_all_github(url, headers): all_results = [] while url: response = requests.get(url, headers=headers) all_results.extend(response.json()) link = response.headers.get('Link', '') url = None for part in link.split(','): if 'rel="next"' in part: url = part.split(';')[0].strip().strip('<>') return all_results ``` --- ## 10. Chrome DevTools for API Debugging ``` Network Tab: - See ALL HTTP requests made by the page - Filter: XHR/Fetch -> API calls specifically ✅ - Headers tab -> request/response headers, status code, authorization header - Payload tab -> request body (POST data) - Response tab -> actual JSON response body ✅ - Timing tab -> DNS, connection, TTFB, download time Application Tab: - Cookies -> session cookies, auth tokens ✅ - LocalStorage / SessionStorage Console Tab: - JavaScript errors - Run snippets to inspect/test Throttling: - Simulate slow network (3G, offline) to debug timing issues ``` ### Debugging Common Problems ``` 401 in Network tab -> Headers -> check Authorization header present and correct format 429 in Network tab -> add time.sleep() between requests Wrong data -> Network -> Response tab -> see exact JSON returned Slow API -> Network -> Timing tab -> identify bottleneck (TTFB, DNS, etc.) Session not working -> Application -> Cookies -> check cookie is being set ``` --- ## 11. BeautifulSoup - HTML Parsing ```python import requests from bs4 import BeautifulSoup # Fetch and parse response = requests.get(url, headers={'User-Agent': 'MyBot/1.0'}) soup = BeautifulSoup(response.text, 'html.parser') # ✅ built-in, no install # Finding elements soup.find('h1') # first

soup.find('p', class_='intro') # by CSS class soup.find(id='content') # by ID soup.find('a', href=True) # any with href soup.find_all('a') # ALL tags ✅ soup.find_all('a', href=True) # all with href soup.find_all('li', class_='item') # all
  • soup.find_all(['h1', 'h2', 'h3']) # multiple tag types soup.select('div.content > p') # CSS selector ✅ soup.select_one('h1.title') # first match by CSS # Extracting data element.get_text(strip=True) # text content ✅ element.get('href') # attribute value ✅ (safe, returns None) element['href'] # attribute (raises KeyError if missing) element.attrs # all attributes as dict # Common patterns links = [a.get('href') for a in soup.find_all('a', href=True)] images = [img.get('src') for img in soup.find_all('img', src=True)] tables = pd.read_html(response.text)[0] # ✅ pandas parses HTML tables # Resolve relative URLs from urllib.parse import urljoin absolute_urls = [urljoin(base_url, url) for url in links] ``` --- ## 12. Ethical Scraping & `robots.txt` ```python from urllib.robotparser import RobotFileParser from urllib.parse import urlparse import time rp = RobotFileParser() rp.set_url('https://example.com/robots.txt') rp.read() if rp.can_fetch('*', url): # ✅ check before fetching delay = rp.crawl_delay('*') or 1 response = requests.get(url) time.sleep(delay) # ✅ respect crawl-delay else: print(f"Disallowed: {url}") ``` **Ethical checklist:** - ✅ Parse `robots.txt` before scraping - ✅ Follow disallowed paths - ✅ Implement `crawl-delay` between requests - ✅ Identify your bot in `User-Agent` header - ✅ Check Terms of Service - ❌ Ignore `robots.txt` even for academic purposes (exam answer) - ❌ Make rapid-fire requests without delays --- ## 13. Tool Selection Guide | Scenario | Tool | | :--- | :--- | | Simple static HTML page | `requests` + `BeautifulSoup` ✅ | | JavaScript-rendered page | `Selenium` or `Playwright` | | Large-scale scraping | `Scrapy` | | REST API calls | `requests` ✅ | | Async/concurrent scraping | `httpx` or `aiohttp` | --- ## 14. Quick Reference Card ```python # Complete API call import os, requests, time from dotenv import load_dotenv load_dotenv() api_key = os.getenv('API_KEY') # ✅ never hardcode response = requests.get( 'https://api.example.com/data', params={'key': api_key, 'q': 'query'}, headers={'User-Agent': 'MyBot/1.0'}, timeout=30 ) response.raise_for_status() # ✅ raises on 4xx/5xx data = response.json() description = data['weather'][0]['description'] # ✅ list index [0] # Post to LLM API r = requests.post( 'https://api.openai.com/v1/chat/completions', headers={'Authorization': f'Bearer {os.getenv("OPENAI_KEY")}'}, json={ 'model': 'gpt-4o-mini', 'messages': [{'role': 'user', 'content': 'Classify: Great product!'}] } ) result = r.json()['choices'][0]['message']['content'] ``` --- ## 15. Exam Scenario Answers | Scenario | Answer | | :--- | :--- | | Check success range | `200 <= status < 300` ✅ | | Authenticated but no permission | 403 Forbidden ✅ | | Missing/wrong API key | 401 Unauthorized ✅ | | Too many requests | 429 ✅ | | Cached resource, no new data | 304 Not Modified ✅ | | POST idempotent? | No, creates new resource each call ✅ | | Debug slow API | Network tab -> Timing ✅ | | Check session cookie | Application tab -> Cookies ✅ | | Filter API calls in DevTools | XHR/Fetch filter in Network tab ✅ | | Scraping JavaScript pages | Selenium or Playwright ✅ | | Safe retry methods | GET, PUT, DELETE (idempotent) ✅ |