# Pandas & Data Manipulation > **Focus**: Data cleaning, efficient aggregation, and correlation analysis. ## 1. Core Concepts Pandas is a columnar data processing engine. Efficient use involves avoiding loops and using vectorized operations. | Concept | Description | | :--- | :--- | | **Series vs DataFrame** | Series is a 1D column; DataFrame is a 2D table. | | **Vectorization** | Applying operations to entire columns at once (C-speed). | | **Broadcasting** | Matching shapes of different-sized data for operations. | | **Tidy Data** | Each variable is a column, each observation is a row. | --- ## 2. Key Operations (Cleaning) | Operation | Command | Use Case | | :--- | :--- | :--- | | **Count Missing** | `df.isnull().sum()` | Identify which columns need cleaning. | | **Remove Missing** | `df.dropna(subset=['col'])` | Target specific columns for cleanup. ✅ | | **Fill Missing** | `df.fillna(df.median())` | Impute missing data (Median is robust to outliers). | | **Distinct Count** | `df['col'].nunique()` | Count unique categories or IDs. | | **Aggregation** | `df.groupby('A')['B'].mean()` | Calculate average B for each category in A. | ### 3. Advanced Aggregation & Inspection | Method | Syntax | Use Case | | :--- | :--- | :--- | | **Value Counts** | `df['col'].value_counts()` | Frequency distribution of categorical data. | | **Pivot Table** | `df.pivot_table(index='A', columns='B', values='C')` | Reshaping data for matrix analysis. | | **Correlation** | `df.corr()` | Finding linear relationships between numeric cols. | | **Targeted Count** | `df['col'].nunique()` | Efficiently counting distinct values only. | | **Memory Check** | `df.info(memory_usage='deep')` | Detailed breakdown of RAM consumption. | ### 4. Advanced Cleaning Patterns #### Targeted NaN Removal Don't drop the whole row if only a non-critical column is missing. Use `subset`. ```python # Only drop if the 'review_text' is missing (essential for NLP) df.dropna(subset=['review_text'], inplace=True) ``` #### targeted Duplicate Removal ```python # Keep the LATEST record for each customer df.sort_values('timestamp').drop_duplicates(subset=['customer_id'], keep='last') ``` #### Similarity Analysis (Pivot + Corr) A standard pattern for recommendation engines or finding similar entities. ```python # 1. Create a User-Product Matrix matrix = df.pivot_table(index='user_id', columns='product_id', values='rating').fillna(0) # 2. Compute Item-to-Item Similarity similarity_matrix = matrix.corr() ``` ### 5. Common Mistakes & Interview Traps * **`count()` vs `size()`**: `count()` excludes NaNs; `size()` includes them. * **`inplace=True`**: Often discouraged in modern Pandas (use assignment `df = df.op()` instead) as it doesn't always save memory and can break method chaining. * **SettingWithCopyWarning**: Occurs when you modify a slice of a DataFrame. Always use `.loc[mask, 'col'] = val` for assignments. * **Memory Efficiency**: Use `df['col'].astype('category')` for low-cardinality strings to reduce memory usage by up to 90%. --- ## 🛠️ Reusable Patterns ### The "Data Quality Audit" ```python def audit_data(df): report = { "missing": df.isnull().sum(), "duplicates": df.duplicated().sum(), "types": df.dtypes, "unique_counts": df.nunique() } return pd.DataFrame(report) ``` ### The "High-Speed Frequency Counter" For very large text lists where Pandas might be overkill: ```python from collections import Counter # Get top 5 most common words in a column words = " ".join(df['text_col']).split() top_words = Counter(words).most_common(5) ``` --- ## 5. Edge Cases - **Boolean Masking**: `df[(cond1) & (cond2)]` requires parentheses. - **SettingWithCopyWarning**: Occurs when modifying a slice. Fix: Use `.loc` for assignment or `.copy()`. - **String Ops on Mixed Types**: `df['col'].str.lower()` fails if there are non-strings. Fix: `df['col'].astype(str).str.lower()`. --- ## 6. Common Mistakes - ❌ **Iterating with `for index, row in df.iterrows()`**: Extremely slow. Use vectorized functions or `.apply()`. - ❌ **Imputing with Mean for skewed data**: Mean is pulled by outliers. Use **Median**. - ❌ **Not checking data types**: `object` type for numbers prevents math. Fix: `pd.to_numeric(df['col'])`. --- ## 7. Task Mapping | Task | Approach | Command | | :--- | :--- | :--- | | **Count frequency of categories** | Value counts | `df['col'].value_counts()` | | **Filter by list of IDs** | Use `.isin()` | `df[df['id'].isin(id_list)]` | | **Deduplicate by email** | targeted drop_duplicates | `df.drop_duplicates(subset=['email'])` | --- ## 8. Deep Insights (Interview Traps) - **`count()` vs `len()`**: `count()` excludes `NaN`; `len()` includes all rows. - **Sparse Data**: Large datasets with many zeros/NaNs. Use `df.astype('Sparse')` to save memory. - **Categorical Dtype**: Use `df['col'].astype('category')` for low-cardinality strings to reduce memory usage significantly. - **Memory Optimization**: Use `pd.to_numeric(df['col'], downcast='integer')` to force smaller byte sizes. - **Coercing Errors**: `pd.to_datetime(df['date'], errors='coerce')` forces invalid parsing to `NaT` instead of crashing. --- ## 9. Quick Reference Card ```python # PREVIEW DATA df.info() # types and non-null counts df.describe() # stats (mean, std, min, max) # MULTI-AGGREGATION df.groupby('city')['temp'].agg(['mean', 'max', 'min']) # STRING SEARCH df[df['name'].str.contains('Apple', na=False)] # EXPORT TO EXCEL FIX (NO INDEX) df.to_csv('output.csv', index=False) ```