# Web APIs & Scraping ## 1. HTTP Status Codes (Exam Critical) ### Complete Status Code Reference | Code | Meaning | Notes | | :--- | :--- | :--- | | **200** | OK | Standard success ✅ | | **201** | Created | After POST (resource created) ✅ | | **204** | No Content | Success, no response body (DELETE) ✅ | | **206** | Partial Content | File download resume | | **301** | Moved Permanently | Update your bookmarks | | **302** | Found | Temporary redirect | | **304** | Not Modified | Cached version still valid, empty body ✅ | | **400** | Bad Request | Malformed syntax or invalid parameters | | **401** | Unauthorized | Authentication required/failed (no/wrong API key) ✅ | | **403** | Forbidden | Authenticated but no permission ✅ | | **404** | Not Found | Resource does not exist | | **408** | Request Timeout | Server timed out waiting for request | | **409** | Conflict | Duplicate resource | | **413** | Payload Too Large | Request body exceeds server limit | | **422** | Unprocessable Entity | Syntactically correct but semantically wrong | | **429** | Too Many Requests | Rate limit exceeded ✅ | | **500** | Internal Server Error | Generic server-side error ✅ | | **502** | Bad Gateway | Invalid response from upstream server | | **503** | Service Unavailable | Server temporarily down ✅ | | **504** | Gateway Timeout | Upstream server did not respond in time | ### Range Checks (Exam Critical) ```python 200 <= status < 300 # Success ✅ (correct range check, exam answer) 300 <= status < 400 # Redirect ✅ 400 <= status < 500 # Client Error ✅ 500 <= status < 600 # Server Error ✅ # ❌ Wrong: status == 200 # misses 201, 204 etc. status < 400 # includes redirects ``` ### Key Distinctions | Pair | Difference | | :--- | :--- | | 401 vs 403 | 401 = not authenticated (Who are you?); 403 = authenticated but no permission (You can't come in) | | 401 vs 429 | 401 = wrong/missing key; 429 = valid key but too many requests | | 404 vs 410 | 404 = not found (may exist elsewhere); 410 = permanently deleted | | 301 vs 302 | 301 = permanent redirect; 302 = temporary | | 500 vs 503 | 500 = code error; 503 = server temporarily unavailable | | 502 vs 504 | 502 = bad response FROM upstream; 504 = upstream TIMED OUT | --- ## 2. HTTP Methods (Exam Critical) | Method | Safe? | Idempotent? | Use Case | | :--- | :--- | :--- | :--- | | **GET** | ✅ | ✅ | Retrieve data, no side effects | | **HEAD** | ✅ | ✅ | Like GET but no response body | | **OPTIONS** | ✅ | ✅ | Get allowed methods for resource | | **POST** | ❌ | ❌ | Create resource. Multiple calls = multiple resources ✅ | | **PUT** | ❌ | ✅ | Replace entire resource. Multiple calls = same result ✅ | | **PATCH** | ❌ | ❌ | Partial update | | **DELETE** | ❌ | ✅ | Remove. Multiple calls = stays deleted ✅ | - **Safe** = does NOT modify server state - **Idempotent** = same result if called multiple times - ✅ GET is safe (exam answer) - ✅ PUT is idempotent (exam answer) - ❌ POST is NOT idempotent (creates new resource each time) --- ## 3. `requests` Library - Complete Reference ```python import requests import os # GET request r = requests.get('https://api.example.com/data', params={'q': 'Delhi', 'appid': os.getenv('API_KEY')}, headers={'Authorization': f'Bearer {token}', 'User-Agent': 'MyApp/1.0'}, timeout=30, # seconds before timeout ✅ verify=True, # SSL certificate verification auth=('user', 'pass') # basic auth ) # POST request r = requests.post(url, json={'key': 'value'}, # auto-sets Content-Type: application/json ✅ data={'key': 'value'}, # form-encoded headers={'Authorization': f'Bearer {token}'} ) # PUT / PATCH / DELETE r = requests.put(url, json={'key': 'value'}) r = requests.patch(url, json={'email': 'new@example.com'}) r = requests.delete(url) # Response object r.status_code # 200, 404, etc. r.ok # True if 200-299 r.raise_for_status() # raises exception for 4xx/5xx ✅ r.text # response as string r.json() # parse JSON -> Python dict ✅ r.content # response as bytes r.headers # response headers dict r.url # final URL (after redirects) r.history # list of redirect responses r.elapsed # time taken r.cookies # response cookies ``` --- ## 4. Error Handling ```python from requests.exceptions import RequestException, ConnectionError, Timeout, HTTPError def safe_api_call(url, params): try: response = requests.get(url, params=params, timeout=30) response.raise_for_status() # ✅ raise for 4xx/5xx return response.json() except requests.exceptions.Timeout: print("Request timed out") return None except requests.exceptions.ConnectionError: print("Connection failed - check network") return None except requests.exceptions.HTTPError as e: code = e.response.status_code if code == 401: print("Check API key") elif code == 403: print("No permission to access this resource") elif code == 429: print("Rate limit exceeded, slow down") return None except requests.exceptions.RequestException as e: print(f"Request failed: {e}") return None ``` --- ## 5. Sessions - Reusing Connections ```python # ✅ Use session for multiple requests to same API # More efficient - reuses TCP connection, shares headers/cookies session = requests.Session() session.headers.update({ 'Authorization': f'Bearer {token}', 'Content-Type': 'application/json', 'User-Agent': 'TDS-Bot/1.0' }) r1 = session.get(url1) r2 = session.get(url2) # Context manager (auto-closes) with requests.Session() as session: session.headers.update({'Authorization': f'Bearer {token}'}) r = session.get(url) ``` --- ## 6. Rate Limiting & Retry Patterns (Exam Critical) ```python import time # Simple sleep between requests ✅ (exam answer) for city in cities: response = requests.get(url, params={'q': city}) if response.status_code == 200: results.append(response.json()) elif response.status_code == 429: wait = int(response.headers.get('Retry-After', 60)) time.sleep(wait) # ✅ use Retry-After header if available time.sleep(1) # ✅ 1-2 sec between all requests # Exponential backoff def fetch_with_retry(url, params, max_retries=3): for attempt in range(max_retries): response = requests.get(url, params=params) if response.status_code == 200: return response.json() elif response.status_code == 429: wait_time = 2 ** attempt # 1, 2, 4 seconds print(f"Rate limited. Waiting {wait_time}s...") time.sleep(wait_time) elif response.status_code >= 500: time.sleep(2 ** attempt) # retry server errors else: raise Exception(f"Unexpected status: {response.status_code}") raise Exception("Max retries exceeded") ``` - ✅ Use `time.sleep(1-2)` between requests (exam answer) - ❌ Make all requests simultaneously -> triggers 429 - ❌ Ignore rate limits -> API blocks your key --- ## 7. API Security - Storing Keys ```python import os from dotenv import load_dotenv load_dotenv() # loads .env file # ✅ Read from environment variable api_key = os.getenv('WEATHER_API_KEY') api_key = os.environ.get('WEATHER_API_KEY', 'default') # .env file (NEVER commit to git!) # WEATHER_API_KEY=your_key_here # ANTHROPIC_API_KEY=sk-ant-... # .gitignore - always add: # .env # .env.* ``` - ❌ Hardcoded in script: `api_key = "abc123"` (visible in git history!) - ❌ In code comments or README - ❌ Printed to console (appears in logs) --- ## 8. JSON Response Parsing - Nested Fields (Exam Critical) ```python # Typical nested API response response_json = { "name": "Delhi", "main": {"temp": 298.15, "humidity": 60}, "weather": [ # <- this is a LIST {"description": "clear sky", "icon": "01d"} ], "wind": {"speed": 3.5} } # ✅ Correct access city = response_json['name'] temp = response_json['main']['temp'] humidity = response_json['main']['humidity'] desc = response_json['weather'][0]['description'] # [0] is required! ✅ wind_speed = response_json['wind']['speed'] # ❌ Wrong: response_json['weather']['description'] # weather is a list, not dict response_json['main']['weather'] # wrong key path # Safe access (no KeyError) temp = response_json.get('main', {}).get('temp', 0) # Convert Kelvin to Celsius temp_celsius = response_json['main']['temp'] - 273.15 ``` --- ## 9. API Pagination ```python # Page number pagination def get_all_results(base_url, params): all_results = [] page = 1 while True: params['page'] = page params['per_page'] = 100 response = requests.get(base_url, params=params) data = response.json() results = data.get('results', []) if not results: break # no more pages all_results.extend(results) page += 1 time.sleep(0.5) # rate limiting return all_results # Cursor pagination def get_all_cursor(url, params): all_results = [] cursor = None while True: if cursor: params['cursor'] = cursor response = requests.get(url, params=params) data = response.json() all_results.extend(data['items']) cursor = data.get('next_cursor') if not cursor: break return all_results # Link header pagination (GitHub style) def get_all_github(url, headers): all_results = [] while url: response = requests.get(url, headers=headers) all_results.extend(response.json()) link = response.headers.get('Link', '') url = None for part in link.split(','): if 'rel="next"' in part: url = part.split(';')[0].strip().strip('<>') return all_results ``` --- ## 10. Chrome DevTools for API Debugging ``` Network Tab: - See ALL HTTP requests made by the page - Filter: XHR/Fetch -> API calls specifically ✅ - Headers tab -> request/response headers, status code, authorization header - Payload tab -> request body (POST data) - Response tab -> actual JSON response body ✅ - Timing tab -> DNS, connection, TTFB, download time Application Tab: - Cookies -> session cookies, auth tokens ✅ - LocalStorage / SessionStorage Console Tab: - JavaScript errors - Run snippets to inspect/test Throttling: - Simulate slow network (3G, offline) to debug timing issues ``` ### Debugging Common Problems ``` 401 in Network tab -> Headers -> check Authorization header present and correct format 429 in Network tab -> add time.sleep() between requests Wrong data -> Network -> Response tab -> see exact JSON returned Slow API -> Network -> Timing tab -> identify bottleneck (TTFB, DNS, etc.) Session not working -> Application -> Cookies -> check cookie is being set ``` --- ## 11. BeautifulSoup - HTML Parsing ```python import requests from bs4 import BeautifulSoup # Fetch and parse response = requests.get(url, headers={'User-Agent': 'MyBot/1.0'}) soup = BeautifulSoup(response.text, 'html.parser') # ✅ built-in, no install # Finding elements soup.find('h1') # first