# Pandas & Data Manipulation

> **Focus**: Data cleaning, efficient aggregation, and correlation analysis.

## 1. Core Concepts
Pandas is a columnar data processing engine. Efficient use involves avoiding loops and using vectorized operations.

| Concept | Description |
| :--- | :--- |
| **Series vs DataFrame** | Series is a 1D column; DataFrame is a 2D table. |
| **Vectorization** | Applying operations to entire columns at once (C-speed). |
| **Broadcasting** | Matching shapes of different-sized data for operations. |
| **Tidy Data** | Each variable is a column, each observation is a row. |

---

## 2. Key Operations (Cleaning)
| Operation | Command | Use Case |
| :--- | :--- | :--- |
| **Count Missing** | `df.isnull().sum()` | Identify which columns need cleaning. |
| **Remove Missing** | `df.dropna(subset=['col'])` | Target specific columns for cleanup. ✅ |
| **Fill Missing** | `df.fillna(df.median())` | Impute missing data (Median is robust to outliers). |
| **Distinct Count** | `df['col'].nunique()` | Count unique categories or IDs. |
| **Aggregation** | `df.groupby('A')['B'].mean()` | Calculate average B for each category in A. |

### 3. Advanced Aggregation & Inspection
| Method | Syntax | Use Case |
| :--- | :--- | :--- |
| **Value Counts** | `df['col'].value_counts()` | Frequency distribution of categorical data. |
| **Pivot Table** | `df.pivot_table(index='A', columns='B', values='C')` | Reshaping data for matrix analysis. |
| **Correlation** | `df.corr()` | Finding linear relationships between numeric cols. |
| **Targeted Count** | `df['col'].nunique()` | Efficiently counting distinct values only. |
| **Memory Check** | `df.info(memory_usage='deep')` | Detailed breakdown of RAM consumption. |

### 4. Advanced Cleaning Patterns
#### Targeted NaN Removal
Don't drop the whole row if only a non-critical column is missing. Use `subset`.
```python
# Only drop if the 'review_text' is missing (essential for NLP)
df.dropna(subset=['review_text'], inplace=True)
```

#### targeted Duplicate Removal
```python
# Keep the LATEST record for each customer
df.sort_values('timestamp').drop_duplicates(subset=['customer_id'], keep='last')
```

#### Similarity Analysis (Pivot + Corr)
A standard pattern for recommendation engines or finding similar entities.
```python
# 1. Create a User-Product Matrix
matrix = df.pivot_table(index='user_id', columns='product_id', values='rating').fillna(0)
# 2. Compute Item-to-Item Similarity
similarity_matrix = matrix.corr()
```

### 5. Common Mistakes & Interview Traps
*   **`count()` vs `size()`**: `count()` excludes NaNs; `size()` includes them.
*   **`inplace=True`**: Often discouraged in modern Pandas (use assignment `df = df.op()` instead) as it doesn't always save memory and can break method chaining.
*   **SettingWithCopyWarning**: Occurs when you modify a slice of a DataFrame. Always use `.loc[mask, 'col'] = val` for assignments.
*   **Memory Efficiency**: Use `df['col'].astype('category')` for low-cardinality strings to reduce memory usage by up to 90%.

---

## 🛠️ Reusable Patterns
### The "Data Quality Audit"
```python
def audit_data(df):
    report = {
        "missing": df.isnull().sum(),
        "duplicates": df.duplicated().sum(),
        "types": df.dtypes,
        "unique_counts": df.nunique()
    }
    return pd.DataFrame(report)
```

### The "High-Speed Frequency Counter"
For very large text lists where Pandas might be overkill:
```python
from collections import Counter
# Get top 5 most common words in a column
words = " ".join(df['text_col']).split()
top_words = Counter(words).most_common(5)
```

---

## 5. Edge Cases
- **Boolean Masking**: `df[(cond1) & (cond2)]` requires parentheses.
- **SettingWithCopyWarning**: Occurs when modifying a slice. Fix: Use `.loc` for assignment or `.copy()`.
- **String Ops on Mixed Types**: `df['col'].str.lower()` fails if there are non-strings. Fix: `df['col'].astype(str).str.lower()`.

---

## 6. Common Mistakes
- ❌ **Iterating with `for index, row in df.iterrows()`**: Extremely slow. Use vectorized functions or `.apply()`.
- ❌ **Imputing with Mean for skewed data**: Mean is pulled by outliers. Use **Median**.
- ❌ **Not checking data types**: `object` type for numbers prevents math. Fix: `pd.to_numeric(df['col'])`.

---

## 7. Task Mapping
| Task | Approach | Command |
| :--- | :--- | :--- |
| **Count frequency of categories** | Value counts | `df['col'].value_counts()` |
| **Filter by list of IDs** | Use `.isin()` | `df[df['id'].isin(id_list)]` |
| **Deduplicate by email** | targeted drop_duplicates | `df.drop_duplicates(subset=['email'])` |

---

## 8. Deep Insights (Interview Traps)
- **`count()` vs `len()`**: `count()` excludes `NaN`; `len()` includes all rows.
- **Sparse Data**: Large datasets with many zeros/NaNs. Use `df.astype('Sparse')` to save memory.
- **Categorical Dtype**: Use `df['col'].astype('category')` for low-cardinality strings to reduce memory usage significantly.
- **Memory Optimization**: Use `pd.to_numeric(df['col'], downcast='integer')` to force smaller byte sizes.
- **Coercing Errors**: `pd.to_datetime(df['date'], errors='coerce')` forces invalid parsing to `NaT` instead of crashing.

---

## 9. Quick Reference Card
```python
# PREVIEW DATA
df.info()     # types and non-null counts
df.describe() # stats (mean, std, min, max)

# MULTI-AGGREGATION
df.groupby('city')['temp'].agg(['mean', 'max', 'min'])

# STRING SEARCH
df[df['name'].str.contains('Apple', na=False)]

# EXPORT TO EXCEL FIX (NO INDEX)
df.to_csv('output.csv', index=False)
```