# Sklearn Guide

> **Coverage:** Intuition · Math · Parameters · Methods · Exam Traps
> **Format:** Dense reference  every model, every parameter

---

##  PART 0  NumPy, Pandas & Data Viz Quick Reference

### NumPy Essentials

| Operation | Code | Notes |
|-----------|------|-------|
| Array creation | `np.array([1,2,3])`, `np.zeros((m,n))`, `np.ones`, `np.eye(n)` | eye = identity matrix |
| Ranges | `np.arange(start,stop,step)`, `np.linspace(start,stop,n)` | linspace includes endpoint |
| Shape ops | `a.reshape(m,n)`, `a.T`, `a.flatten()`, `np.expand_dims(a,axis)` | reshape(-1) = flatten |
| Math | `np.dot(A,B)`, `np.matmul`, `A @ B`, `np.linalg.inv`, `np.linalg.norm` | @ is matmul operator |
| Stats | `np.mean(a,axis=)`, `np.std`, `np.var`, `np.median`, `np.percentile` | axis=0→col, axis=1→row |
| Boolean | `np.where(cond,x,y)`, `np.any`, `np.all`, `np.argmax`, `np.argmin` | argmax returns index |
| Random | `np.random.seed(n)`, `np.random.rand(m,n)`, `np.random.randn`, `np.random.randint` | rand=uniform, randn=normal |
| Stacking | `np.hstack([a,b])`, `np.vstack`, `np.concatenate(axis=)` | hstack=column-wise |
| Unique | `np.unique(y)`, `np.unique(y, return_counts=True)` | for class labels |

### Pandas Quick Reference

| Operation | Code |
|-----------|------|
| Read | `pd.read_csv(path, sep=, header=, index_col=, usecols=, dtype=, na_values=)` |
| Info | `df.shape`, `df.dtypes`, `df.info()`, `df.describe()`, `df.head(n)`, `df.tail(n)` |
| Select | `df['col']`, `df[['c1','c2']]`, `df.loc[rows,cols]`, `df.iloc[i,j]` |
| Filter | `df[df['col']>5]`, `df.query('col > 5')`, `df[df['col'].isin([1,2])]` |
| Missing | `df.isnull().sum()`, `df.dropna()`, `df.fillna(val)`, `df.interpolate()` |
| Apply | `df['col'].apply(func)`, `df.apply(func, axis=1)`, `df.map(dict)` |
| GroupBy | `df.groupby('col').agg({'c2':'mean','c3':'sum'})` |
| Merge | `pd.merge(df1,df2,on='key',how='left/right/inner/outer')` |
| Concat | `pd.concat([df1,df2], axis=0/1, ignore_index=True)` |
| Pivot | `df.pivot_table(values=, index=, columns=, aggfunc=)` |
| Sort | `df.sort_values('col', ascending=False)`, `df.sort_index()` |
| Rename | `df.rename(columns={'old':'new'})` |
| Drop | `df.drop('col',axis=1)`, `df.drop(index=[0,1])` |
| Dummies | `pd.get_dummies(df, columns=['cat_col'], drop_first=True)` |

---

##  PART 1  Sklearn Preprocessing & Transformers

### 1.1 Data Splitting

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,          # float=fraction, int=count
    random_state=42,        # reproducibility seed
    stratify=y,             # preserve class proportions (use for classification!)
    shuffle=True            # default True
)
```

**Exam trap:** Always use `stratify=y` for imbalanced classification problems.

### 1.2 Feature Scaling

| Scaler | Formula | Use When | Sensitive to Outliers |
|--------|---------|----------|----------------------|
| `StandardScaler` | (x−μ)/σ | Gaussian-like, SVM, LR, NN | Yes |
| `MinMaxScaler` | (x−min)/(max−min) | Bounded output [0,1], NN | Yes |
| `RobustScaler` | (x−median)/IQR | Outliers present | No |
| `MaxAbsScaler` | x/max(\|x\|) | Sparse data, [-1,1] | Yes |
| `Normalizer` | x/‖x‖ | Per-sample normalization (rows) |  |

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)   # fit+transform on train
X_test_sc  = scaler.transform(X_test)        # ONLY transform on test (no fit!)
# Key attributes:
scaler.mean_       # learned mean per feature
scaler.scale_      # learned std per feature
scaler.var_        # learned variance
scaler.n_features_in_
```

**Exam trap:** NEVER `fit_transform` on test set  causes data leakage!

### 1.3 Encoders

#### LabelEncoder
```python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_enc = le.fit_transform(y)         # encodes to 0,1,2...
le.classes_                          # original class names
le.inverse_transform([0,1,2])        # decode back
```

#### OrdinalEncoder
```python
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(
    categories='auto',               # or list of lists per feature
    handle_unknown='use_encoded_value',
    unknown_value=-1
)
X_enc = oe.fit_transform(X[['col']])
oe.categories_                       # list of arrays, one per feature
```

#### OneHotEncoder
```python
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(
    categories='auto',               # or list of lists
    drop=None,                       # None, 'first', 'if_binary'
    sparse_output=True,              # returns sparse matrix by default
    handle_unknown='error',          # or 'ignore'
    dtype=np.float64
)
X_enc = ohe.fit_transform(X[['col']])
# Key attrs:
ohe.categories_                      # list of arrays → categories per feature
ohe.get_feature_names_out()          # column names like ['col_A','col_B']
ohe.drop_idx_                        # which index was dropped (if drop='first')
```

**Exam calculation:** For OHE with drop=None on column with 3 values [A,B,C]:
- Input A → [1,0,0]; B → [0,1,0]; C → [0,0,1]
- `categories_` = [array(['A','B','C'])]
- n_output_cols = 3 (or 2 if drop='first')

#### TargetEncoder
```python
from sklearn.preprocessing import TargetEncoder
te = TargetEncoder(target_type='auto', smooth='auto', cv=5)
```

### 1.4 Handling Missing Values

```python
from sklearn.impute import SimpleImputer, KNNImputer
# SimpleImputer
imp = SimpleImputer(
    missing_values=np.nan,
    strategy='mean',        # 'mean','median','most_frequent','constant'
    fill_value=None         # used when strategy='constant'
)
# KNNImputer
kimp = KNNImputer(n_neighbors=5, weights='uniform')  # weights: 'uniform','distance'
```

### 1.5 Feature Selection

#### Filter Methods (no model)
```python
from sklearn.feature_selection import SelectKBest, f_classif, f_regression, chi2, mutual_info_classif
sel = SelectKBest(score_func=f_classif, k=10)    # k='all' to see all scores
sel.fit(X_train, y_train)
sel.scores_           # score per feature
sel.pvalues_          # p-value per feature
sel.get_support()     # boolean mask of selected features
X_new = sel.transform(X_train)
```

| Score Func | Use For |
|------------|---------|
| `f_classif` | Classification, continuous features, ANOVA F-test |
| `f_regression` | Regression, linear correlation F-test |
| `chi2` | Classification, non-negative features (counts) |
| `mutual_info_classif` | Classification, any type, non-linear |
| `mutual_info_regression` | Regression, any type, non-linear |

#### Wrapper Methods (use model)
```python
from sklearn.feature_selection import RFE, RFECV, SequentialFeatureSelector
# RFE: Recursive Feature Elimination
rfe = RFE(
    estimator=LogisticRegression(),
    n_features_to_select=5,
    step=1                  # features removed per iteration
)
rfe.fit(X_train, y_train)
rfe.ranking_               # 1 = selected
rfe.support_               # boolean mask

# RFECV: RFE with cross-validation to find best n
rfecv = RFECV(estimator=LogisticRegression(), cv=5, scoring='accuracy')
rfecv.n_features_           # optimal number found
```

#### Embedded Methods
```python
from sklearn.feature_selection import SelectFromModel
sel = SelectFromModel(
    estimator=RandomForestClassifier(),
    threshold='mean',        # 'mean','median',float,or 'Xmean'
    max_features=None
)
sel.estimator_.feature_importances_
```

### 1.6 Dimensionality Reduction: PCA

**Math:** Find orthogonal directions of maximum variance.
SVD: X = UΣVᵀ → principal components = columns of V (eigenvectors of XᵀX)

```python
from sklearn.decomposition import PCA
pca = PCA(
    n_components=2,          # int=components, float=variance ratio (e.g. 0.95), 'mle'
    whiten=False,            # divide by sqrt(eigenvalue) → unit variance
    random_state=42,
    svd_solver='auto'        # 'auto','full','arpack','randomized'
)
pca.fit(X_scaled)
# Key attributes:
pca.explained_variance_ratio_        # variance fraction per component
pca.explained_variance_              # absolute variance per component
pca.components_                      # eigenvectors shape (n_comp, n_features)
pca.singular_values_                 # singular values
pca.n_components_                    # actual n components used
np.cumsum(pca.explained_variance_ratio_)  # cumulative variance
```

**Exam trap:** Always scale before PCA! PCA is sensitive to feature magnitudes.

---

##  PART 2  Pipelines & Composite Transformers

### 2.1 Pipeline

```python
from sklearn.pipeline import Pipeline, make_pipeline
pipe = Pipeline(steps=[
    ('imputer',  SimpleImputer(strategy='mean')),
    ('scaler',   StandardScaler()),
    ('model',    LogisticRegression())
], verbose=False, memory=None)

# Access steps:
pipe['scaler']                  # by name
pipe.named_steps['scaler']      # same
pipe.steps[1][1]                # by index
# Methods (delegates to final estimator):
pipe.fit(X_train, y_train)
pipe.predict(X_test)
pipe.score(X_test, y_test)
pipe.fit_transform(X_train)     # only if last step is transformer
# Pipeline with param grid:
# Use __ to access nested params:
param_grid = {'model__C': [0.1,1,10], 'scaler__with_mean': [True,False]}
```

### 2.2 ColumnTransformer

```python
from sklearn.compose import ColumnTransformer, make_column_transformer
ct = ColumnTransformer(transformers=[
    ('num', StandardScaler(), ['age','income']),          # numeric cols
    ('cat', OneHotEncoder(), ['gender','city']),          # categorical cols
    ('pass', 'passthrough', ['id_col']),                  # unchanged
    ('drop', 'drop', ['useless_col']),                    # dropped
], remainder='drop',           # or 'passthrough' for unspecified columns
   verbose_feature_names_out=True,
   n_jobs=None)
ct.fit_transform(X_train)
ct.get_feature_names_out()
```

### 2.3 FeatureUnion (parallel transforms)

```python
from sklearn.pipeline import FeatureUnion
fu = FeatureUnion([
    ('pca', PCA(n_components=3)),
    ('kbest', SelectKBest(k=5))
])
```

### 2.4 Cross-Validation

```python
from sklearn.model_selection import cross_val_score, cross_validate, KFold, StratifiedKFold

# Basic CV
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy', n_jobs=-1)

# Multiple metrics
results = cross_validate(model, X, y, cv=5,
    scoring=['accuracy','f1_macro'],
    return_train_score=True)

# Custom CV splitters:
kf  = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)  # preserves class ratio
```

### 2.5 Hyperparameter Tuning

```python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# GridSearch: exhaustive
gs = GridSearchCV(
    estimator=pipe,
    param_grid={'model__C': [0.01,0.1,1,10], 'model__penalty': ['l1','l2']},
    cv=5,
    scoring='f1',
    refit=True,          # refit best model on full training data
    n_jobs=-1,
    return_train_score=True,
    verbose=2
)
gs.fit(X_train, y_train)
gs.best_params_
gs.best_score_
gs.best_estimator_
gs.cv_results_           # dict with all results

# RandomizedSearch: samples from distributions
from scipy.stats import randint, uniform
rs = RandomizedSearchCV(
    estimator=model,
    param_distributions={'C': uniform(0.01,10), 'max_iter': randint(100,500)},
    n_iter=50,           # number of random samples
    cv=5, scoring='f1', random_state=42, refit=True
)
```

---

##  PART 3  Regression Models

### 3.1 Linear Regression

**Intuition:** Find hyperplane minimizing sum of squared residuals.

**Math:**
- Model: ŷ = θ₀ + θ₁x₁ + ... + θₙxₙ = Xθ
- Cost: MSE = (1/m)Σ(ŷᵢ−yᵢ)²
- Closed form (Normal Equation): θ = (XᵀX)⁻¹Xᵀy
- OLS assumptions: linearity, homoscedasticity, no multicollinearity, normality of residuals

```python
from sklearn.linear_model import LinearRegression
lr = LinearRegression(
    fit_intercept=True,     # add bias term
    copy_X=True,
    n_jobs=None,
    positive=False          # constrain coefs to be positive
)
lr.fit(X_train, y_train)
lr.coef_                    # shape (n_features,) or (n_targets, n_features)
lr.intercept_               # bias term
lr.rank_                    # rank of X
lr.singular_                # singular values of X
```

**Evaluation Metrics:**

| Metric | Formula | sklearn |
|--------|---------|---------|
| MSE | (1/m)Σ(ŷ−y)² | `mean_squared_error(y,ŷ)` |
| RMSE | √MSE | `mean_squared_error(y,ŷ, squared=False)` |
| MAE | (1/m)Σ\|ŷ−y\| | `mean_absolute_error(y,ŷ)` |
| R² | 1−SS_res/SS_tot | `r2_score(y,ŷ)` or `lr.score(X,y)` |
| MAPE | (1/m)Σ\|ŷ−y\|/\|y\| | `mean_absolute_percentage_error` |

### 3.2 Gradient Descent (Math)

**Batch GD:** θⱼ ← θⱼ − η·(∂J/∂θⱼ) using ALL data per step

**SGD:** θⱼ ← θⱼ − η·(∂J/∂θⱼ)ᵢ using ONE sample per step → noisy but fast

**Mini-batch GD:** batch of k samples  compromise

| Type | Convergence | Memory | Noise |
|------|-------------|--------|-------|
| Batch | Smooth, slow | High | None |
| SGD | Noisy, fast | Low | High |
| Mini-batch | Moderate | Medium | Medium |

**Learning rate schedules:**

| Schedule | Formula | Effect |
|----------|---------|--------|
| Constant | η = η₀ | Fixed step |
| Time decay | η = η₀/(1+t·d) | Decreasing |
| Optimal | η = 1/(α(t₀+t)) | sklearn default |
| Inverse scaling | η = η₀/t^power | Power decay |
| Adaptive | per-param | Adam, RMSprop |

### 3.3 SGDRegressor

```python
from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor(
    loss='squared_error',    # 'squared_error','huber','epsilon_insensitive'
    penalty='l2',            # 'l1','l2','elasticnet',None
    alpha=0.0001,            # regularization strength
    l1_ratio=0.15,           # for elasticnet: mix of l1/l2
    fit_intercept=True,
    max_iter=1000,           # passes over data (epochs)
    tol=1e-3,                # stopping tolerance
    shuffle=True,            # shuffle each epoch
    verbose=0,
    epsilon=0.1,             # for Huber/epsilon_insensitive
    random_state=None,
    learning_rate='invscaling', # 'constant','optimal','invscaling','adaptive'
    eta0=0.01,               # initial learning rate
    power_t=0.25,            # for invscaling: η=η₀/t^power_t
    early_stopping=False,
    validation_fraction=0.1,
    n_iter_no_change=5,
    warm_start=False,        # reuse previous fit as init
    average=False            # average SGD weights
)
sgd.coef_; sgd.intercept_; sgd.n_iter_; sgd.t_
```

### 3.4 Polynomial Regression

**Concept:** Transform features → add polynomial terms → apply linear regression on transformed features.

```python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(
    degree=2,               # polynomial degree
    interaction_only=False, # only cross terms if True (no x², x³)
    include_bias=True       # add column of 1s
)
X_poly = poly.fit_transform(X)
poly.n_output_features_     # total output features
poly.get_feature_names_out() # e.g. ['1','x0','x1','x0^2','x0 x1','x1^2']
# For degree=2, 2 features: (2+2)! / (2! 2!) = 6 features (with bias)
```

**Feature count formula:** With n features, degree d: C(n+d, d) features (with bias)

### 3.5 Regularized Models

**Intuition:** Add penalty to loss to shrink coefficients → prevents overfitting.

#### Ridge (L2 Regularization)
**Loss:** MSE + α·Σθⱼ²
- Shrinks all coefficients toward 0, never exactly 0
- Handles multicollinearity well
- Closed form: θ = (XᵀX + αI)⁻¹Xᵀy

```python
from sklearn.linear_model import Ridge, RidgeCV
ridge = Ridge(
    alpha=1.0,              # regularization strength (λ); larger = more shrinkage
    fit_intercept=True,
    solver='auto',          # 'auto','svd','cholesky','lsqr','sparse_cg','sag','saga','lbfgs'
    max_iter=None,
    tol=1e-4,
    random_state=None
)
# RidgeCV: built-in cross-validation for alpha
rcv = RidgeCV(alphas=[0.1,1.0,10.0], cv=5, scoring='neg_mean_squared_error')
rcv.alpha_                  # best alpha found
rcv.coef_; rcv.intercept_
```

#### Lasso (L1 Regularization)
**Loss:** MSE + α·Σ|θⱼ|
- Can shrink coefficients to EXACTLY 0 → feature selection
- Coordinate descent solver
- Not differentiable at 0 (subgradient)

```python
from sklearn.linear_model import Lasso, LassoCV
lasso = Lasso(
    alpha=1.0,
    fit_intercept=True,
    max_iter=1000,
    tol=1e-4,
    warm_start=False,
    positive=False,
    selection='cyclic'      # 'cyclic' or 'random' (faster convergence)
)
lasso.n_iter_               # iterations run
lasso.sparse_coef_          # sparse representation of coef_
```

#### ElasticNet (L1 + L2)
**Loss:** MSE + α·l1_ratio·Σ|θⱼ| + α·(1−l1_ratio)/2·Σθⱼ²

```python
from sklearn.linear_model import ElasticNet, ElasticNetCV
en = ElasticNet(
    alpha=1.0,
    l1_ratio=0.5,           # 0=Ridge, 1=Lasso, (0,1)=ElasticNet
    fit_intercept=True,
    max_iter=1000,
    tol=1e-4,
    warm_start=False,
    selection='cyclic'
)
```

| Model | Penalty | Coefs→0? | Use When |
|-------|---------|----------|----------|
| Ridge | α·Σθ² | Never | Multicollinearity, all features useful |
| Lasso | α·Σ\|θ\| | Yes (sparse) | Feature selection needed |
| ElasticNet | Both | Yes | Many correlated features |

---

##  PART 4  Classification Models

### 4.1 Logistic Regression

**Intuition:** Linear model + sigmoid → output probability.

**Math:**
- logit(p) = log(p/(1-p)) = θᵀx (log-odds / linear output)
- σ(z) = 1/(1+e⁻ᶻ) → sigmoid
- P(y=1|x) = σ(θᵀx)
- Loss: Binary Cross-Entropy = −(1/m)Σ[yᵢlog(ŷᵢ) + (1−yᵢ)log(1−ŷᵢ)]
- Gradient: ∂J/∂θⱼ = (1/m)·Xᵀ(σ(Xθ)−y)
- Predict: ŷ=1 if P≥0.5 else 0 (threshold adjustable)

```python
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(
    penalty='l2',           # 'l1','l2','elasticnet',None
    dual=False,             # dual formulation (for l2, n>p use False)
    tol=1e-4,
    C=1.0,                  # INVERSE of regularization: smaller C = stronger reg
    fit_intercept=True,
    intercept_scaling=1,
    class_weight=None,      # None, 'balanced', or dict {0:w0,1:w1}
    random_state=None,
    solver='lbfgs',         # see table below
    max_iter=100,
    multi_class='auto',     # 'auto','ovr','multinomial'
    verbose=0,
    warm_start=False,
    n_jobs=None,
    l1_ratio=None           # for elasticnet
)
lr.coef_                    # shape (1,n_feat) binary or (n_class,n_feat) multiclass
lr.intercept_               # shape (1,) or (n_class,)
lr.classes_                 # class labels
lr.n_iter_                  # iterations per class
lr.predict_proba(X)         # shape (n_samples, n_classes)
lr.predict_log_proba(X)     # log probabilities
lr.decision_function(X)     # raw logit scores
```

**Solver compatibility:**

| Solver | Penalty | Multi-class | Notes |
|--------|---------|-------------|-------|
| `lbfgs` | l2, None | ovr, multinomial | Default, good general solver |
| `liblinear` | l1, l2 | ovr only | Good for small datasets |
| `saga` | l1, l2, elasticnet, None | ovr, multinomial | Large datasets |
| `sag` | l2, None | ovr, multinomial | Large datasets |
| `newton-cg` | l2, None | ovr, multinomial | |
| `newton-cholesky` | l2, None | ovr only | |

**Exam trap:** `C` is INVERSE regularization  C=0.01 is stronger than C=100!

### 4.2 Perceptron

**Math:** Output = sign(wᵀx + b); weight update only on misclassified: w ← w + η·yᵢ·xᵢ

```python
from sklearn.linear_model import Perceptron
per = Perceptron(
    penalty=None,           # 'l1','l2','elasticnet'
    alpha=0.0001,           # regularization (if penalty set)
    fit_intercept=True,
    max_iter=1000,
    tol=1e-3,
    shuffle=True,
    verbose=0,
    eta0=1,                 # learning rate (constant for Perceptron)
    n_jobs=None,
    random_state=0,
    early_stopping=False,
    validation_fraction=0.1,
    n_iter_no_change=5,
    class_weight=None,
    warm_start=False
)
per.coef_; per.intercept_; per.n_iter_
```

**Note:** Perceptron = SGDClassifier(loss='perceptron', eta0=1, learning_rate='constant', penalty=None)

### 4.3 SGDClassifier

```python
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(
    loss='hinge',           # see table below
    penalty='l2',
    alpha=0.0001,
    l1_ratio=0.15,
    fit_intercept=True,
    max_iter=1000,
    tol=1e-3,
    shuffle=True,
    verbose=0,
    epsilon=0.1,
    n_jobs=None,
    random_state=None,
    learning_rate='optimal',# 'constant','optimal','invscaling','adaptive'
    eta0=0.0,               # initial LR (required if learning_rate='constant')
    power_t=0.5,
    early_stopping=False,
    validation_fraction=0.1,
    n_iter_no_change=5,
    class_weight=None,
    warm_start=False,
    average=False           # int or True → average weights after n_samples
)
# Partial fit (online learning, multi-epoch):
sgd.partial_fit(X_batch, y_batch, classes=np.unique(y_train))
# Must pass classes= on FIRST call only (but safe to always pass)
```

**Loss functions:**

| Loss | Equivalent Model | Use |
|------|-----------------|-----|
| `hinge` | Linear SVM | Binary classification |
| `modified_huber` | Smooth hinge + probabilistic | Binary, can use predict_proba |
| `log_loss` | Logistic Regression | Binary, probabilistic |
| `perceptron` | Perceptron |  |
| `squared_hinge` | Squared SVM |  |
| `huber` | Robust regression | Regression |
| `squared_error` | Linear Regression | Regression |
| `epsilon_insensitive` | SVR | Regression |

**Computing Log Loss manually:**
Loss = −(1/m)Σ[yᵢ·log(pᵢ) + (1−yᵢ)·log(1−pᵢ)]

```python
from sklearn.metrics import log_loss
log_loss(y_true, y_prob)           # y_prob shape (n, n_classes)
```

---

##  PART 5  Classification Metrics

### 5.1 Confusion Matrix

```
              Predicted 0    Predicted 1
Actual 0        TN              FP
Actual 1        FN              TP
```

```python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_true, y_pred)          # labels= to specify order
ConfusionMatrixDisplay(cm, display_labels=['neg','pos']).plot()
# Or directly:
ConfusionMatrixDisplay.from_predictions(y_true, y_pred)
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test)
```

### 5.2 Classification Metrics Formulas

| Metric | Formula | Focus |
|--------|---------|-------|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Overall correct |
| Precision | TP/(TP+FP) | Of positives predicted, how many correct |
| Recall (Sensitivity) | TP/(TP+FN) | Of actual positives, how many found |
| F1 | 2·P·R/(P+R) | Harmonic mean of P and R |
| Specificity | TN/(TN+FP) | True negative rate |
| FPR | FP/(FP+TN) | False alarm rate |
| MCC | (TP·TN−FP·FN)/√(...) | Balanced, even for imbalanced |

```python
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, classification_report, roc_auc_score,
    average_precision_score, matthews_corrcoef
)
# average options for multiclass:
# 'binary' (default), 'micro', 'macro', 'weighted', 'samples'
# macro: mean of per-class, no weighting
# weighted: weighted by support (class count)
# micro: globally count TP/FP/FN across all classes

precision_score(y, ŷ, average='weighted', zero_division=0)
f1_score(y, ŷ, average='macro')
classification_report(y, ŷ, target_names=['cls0','cls1'])
```

### 5.3 ROC-AUC & PR Curves

```python
from sklearn.metrics import roc_curve, auc, RocCurveDisplay, precision_recall_curve
fpr, tpr, thresholds = roc_curve(y_true, y_score)   # y_score = probabilities
roc_auc = auc(fpr, tpr)
RocCurveDisplay.from_estimator(model, X_test, y_test)
RocCurveDisplay.from_predictions(y_true, y_score)

# PR Curve (better for imbalanced)
prec, rec, thresh = precision_recall_curve(y_true, y_score)
```

**AUC:** 0.5 = random, 1.0 = perfect. Use when classes balanced.
**PR AUC:** Better for highly imbalanced datasets.

---

##  PART 6  Naive Bayes

### 6.1 Math (Exam Calculation Focus)

**Bayes Theorem:** P(y|x) = P(x|y)·P(y) / P(x)

**Naive assumption:** Features are conditionally independent given class.
P(y|x₁,...,xₙ) ∝ P(y)·∏P(xᵢ|y)

**Steps for hand calculation:**
1. Compute prior P(y=c) = count(c)/total
2. For each feature: P(xᵢ|y=c)
   - Gaussian: use mean and variance per class
   - Bernoulli: P(xᵢ=1|y=c) with Laplace smoothing
   - Multinomial: P(xᵢ|y=c) = (count + α)/(total_count + α·n_features)
3. Multiply all: score(c) = P(c)·∏P(xᵢ|c)
4. Predict: argmax_c score(c)

**Gaussian NB:** P(xᵢ|y=c) = (1/√(2πσ²))·exp(−(xᵢ−μ)²/2σ²)

### 6.2 Sklearn Implementation

```python
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB

# Gaussian NB (continuous features)
gnb = GaussianNB(
    priors=None,            # class priors; if None, estimated from data
    var_smoothing=1e-9      # variance smoothing to prevent 0-variance
)
gnb.fit(X_train, y_train)
gnb.class_prior_            # P(y) per class
gnb.theta_                  # mean per class per feature (n_classes, n_features)
gnb.var_                    # variance per class per feature
gnb.classes_

# Multinomial NB (word counts, text)
mnb = MultinomialNB(
    alpha=1.0,              # Laplace smoothing (0=no smoothing)
    fit_prior=True,         # learn class priors; False → uniform
    class_prior=None        # override priors
)
mnb.feature_log_prob_       # log P(feature|class) shape (n_classes, n_features)
mnb.class_log_prior_        # log P(class)

# Bernoulli NB (binary features)
bnb = BernoulliNB(
    alpha=1.0,
    binarize=0.0,           # threshold to binarize; None if already binary
    fit_prior=True
)

# Complement NB (imbalanced text classification)
cnb = ComplementNB(alpha=1.0, fit_prior=True, norm=False)
```

**Exam calculation example (Laplace smoothing):**
If word "spam" appears 3 times in spam class (100 total words, 1000 unique vocab):
P(spam|spam_class) = (3+1)/(100+1·1000) = 4/1100 ≈ 0.00364

---

##  PART 7  K-Nearest Neighbors

### 7.1 Intuition & Math

**Algorithm:** For prediction, find k closest training points → vote (classify) or average (regress).

**Distance metrics:**
- Euclidean: d = √(Σ(xᵢ−yᵢ)²) → p=2
- Manhattan: d = Σ|xᵢ−yᵢ| → p=1
- Minkowski: d = (Σ|xᵢ−yᵢ|ᵖ)^(1/p) → general
- Chebyshev: max|xᵢ−yᵢ| → p=∞

**Bias-Variance:** Small k → low bias, high variance (overfit). Large k → high bias, low variance (underfit).

### 7.2 Implementation

```python
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
knn = KNeighborsClassifier(
    n_neighbors=5,          # k
    weights='uniform',      # 'uniform'=equal, 'distance'=1/d weighting
    algorithm='auto',       # 'auto','ball_tree','kd_tree','brute'
    leaf_size=30,           # for BallTree/KDTree (memory/speed tradeoff)
    p=2,                    # power for Minkowski: 1=Manhattan, 2=Euclidean
    metric='minkowski',     # distance metric (overrides p if not minkowski)
    metric_params=None,
    n_jobs=None
)
knn.fit(X_train, y_train)
knn.predict(X_test)
knn.predict_proba(X_test)           # fraction of k neighbors per class
knn.kneighbors(X, n_neighbors=5)    # returns (distances, indices)
knn.kneighbors_graph(X)             # adjacency graph

# Regressor: same params + weights
knnr = KNeighborsRegressor(n_neighbors=5, weights='distance')
```

**Exam trap:** KNN requires feature scaling  different scales dominate distance!

---

##  PART 8  Support Vector Machines

### 8.1 Intuition & Math

**Hard SVM:** Find hyperplane that maximizes margin = 2/‖w‖ with no misclassifications.
- Decision boundary: wᵀx + b = 0
- Constraints: yᵢ(wᵀxᵢ + b) ≥ 1
- Minimize: ½‖w‖² (maximize margin)

**Soft SVM (C-SVM):** Allow misclassifications via slack variables ξᵢ
- Minimize: ½‖w‖² + C·Σξᵢ
- C large → smaller margin, fewer violations (may overfit)
- C small → larger margin, more violations (may underfit)

**Support Vectors:** Training points on or inside the margin → fully determine the hyperplane.

**Kernel Trick:** Map to high-dim space implicitly using kernel function K(x,z)=φ(x)·φ(z)

| Kernel | Formula | Use |
|--------|---------|-----|
| Linear | xᵀz | Linearly separable |
| RBF/Gaussian | exp(−γ‖x−z‖²) | Non-linear, general |
| Polynomial | (γxᵀz+r)^d | Polynomial boundaries |
| Sigmoid | tanh(γxᵀz+r) | Like NN |

**Dual formulation:** Σαᵢ − ½ΣΣαᵢαⱼyᵢyⱼK(xᵢ,xⱼ) → max over α

### 8.2 SVC / SVR Implementation

```python
from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR, NuSVC
svc = SVC(
    C=1.0,                  # regularization; smaller=wider margin
    kernel='rbf',           # 'linear','poly','rbf','sigmoid','precomputed'
    degree=3,               # for poly kernel
    gamma='scale',          # RBF/poly/sigmoid kernel coef
                            # 'scale'=1/(n_feat·var(X)), 'auto'=1/n_feat, or float
    coef0=0.0,              # independent term in poly/sigmoid
    shrinking=True,         # heuristic to speed up
    probability=False,      # enables predict_proba (uses Platt scaling, slower)
    tol=1e-3,
    cache_size=200,         # kernel cache in MB
    class_weight=None,      # 'balanced' or dict
    verbose=False,
    max_iter=-1,            # -1=unlimited
    decision_function_shape='ovr',  # 'ovr' or 'ovo' for multiclass
    break_ties=False,
    random_state=None
)
svc.fit(X_train, y_train)
svc.support_                # indices of support vectors
svc.support_vectors_        # support vector coordinates
svc.n_support_              # count per class
svc.dual_coef_              # αᵢyᵢ per support vector
svc.coef_                   # weights (linear kernel only)
svc.intercept_              # bias
svc.decision_function(X)    # raw margin scores
svc.predict_proba(X)        # only if probability=True

# SVR for regression
svr = SVR(
    kernel='rbf', C=1.0, epsilon=0.1,  # epsilon-tube: no penalty inside
    gamma='scale', degree=3, coef0=0.0,
    tol=1e-3, cache_size=200, verbose=False, max_iter=-1, shrinking=True
)

# LinearSVC (faster for large datasets, only linear kernel)
lsvc = LinearSVC(
    penalty='l2',           # 'l1' or 'l2'
    loss='squared_hinge',   # 'hinge' or 'squared_hinge'
    dual='auto',            # prefer dual=True when n_samples < n_features
    tol=1e-4,
    C=1.0,
    multi_class='ovr',      # 'ovr' or 'crammer_singer'
    fit_intercept=True,
    intercept_scaling=1,
    class_weight=None,
    verbose=0,
    random_state=None,
    max_iter=1000
)
# LinearSVC has no predict_proba! Use CalibratedClassifierCV to get probabilities
from sklearn.calibration import CalibratedClassifierCV
cal = CalibratedClassifierCV(LinearSVC(), method='sigmoid', cv=5)
```

**gamma effect:** Large γ → tight RBF → may overfit. Small γ → wide RBF → may underfit.

---

##  PART 9  Decision Trees

### 9.1 Intuition & Math

**Algorithm:** Recursively split on feature that best separates classes/reduces error.

**Impurity measures:**

| Measure | Formula | Used In |
|---------|---------|---------|
| Gini | 1−Σpᵢ² | Classification (default) |
| Entropy | −Σpᵢ·log₂(pᵢ) | Classification |
| Log Loss | same as entropy | Classification |
| MSE | Σ(y−ȳ)²/n | Regression (default) |
| MAE | Σ\|y−ȳ\|/n | Regression |
| Poisson | Σ(y·log(y/ȳ)−y+ȳ) | Count data |

**Information Gain:** IG(parent, split) = impurity(parent) − weighted_avg(impurity(children))

**Gini hand calculation example:**
Node with 10 samples: 6 class A, 4 class B
Gini = 1 − (6/10)² − (4/10)² = 1 − 0.36 − 0.16 = 0.48

**Entropy example:**
= −(6/10)·log₂(6/10) − (4/10)·log₂(4/10) = 0.971 bits

### 9.2 Implementation

```python
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree, export_text
dt = DecisionTreeClassifier(
    criterion='gini',       # 'gini','entropy','log_loss'
    splitter='best',        # 'best','random'
    max_depth=None,         # None=unlimited (overfit); set to prune
    min_samples_split=2,    # min samples to split a node
    min_samples_leaf=1,     # min samples in a leaf
    min_weight_fraction_leaf=0.0,
    max_features=None,      # 'sqrt','log2',int,float → features per split
    random_state=None,
    max_leaf_nodes=None,    # limit total leaves
    min_impurity_decrease=0.0,  # split only if improvement >= this
    class_weight=None,
    ccp_alpha=0.0           # cost-complexity pruning parameter
)
dt.fit(X_train, y_train)
dt.tree_                    # internal Tree object
dt.feature_importances_     # Gini importance (sum=1)
dt.max_features_            # actual n features used
dt.n_classes_
dt.n_features_in_
dt.get_depth()              # max depth
dt.get_n_leaves()           # total leaves

# Visualization
plot_tree(dt,
    feature_names=X.columns,
    class_names=['neg','pos'],
    filled=True,            # color by class
    rounded=True,
    max_depth=3,            # display depth limit
    fontsize=10,
    impurity=True,          # show impurity at each node
    proportion=False        # show counts not proportions
)
print(export_text(dt, feature_names=list(X.columns)))

# Cost-complexity pruning
path = dt.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
```

**Exam trap:** `feature_importances_` = total impurity decrease weighted by sample fraction, not split count.

---

##  PART 10  Ensemble Methods

### 10.1 Bagging (Bootstrap Aggregating)

**Concept:** Train B models on random bootstrap samples → aggregate (vote/average). Reduces variance.

```python
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),  # base estimator
    n_estimators=10,        # number of models
    max_samples=1.0,        # fraction/int of samples per model
    max_features=1.0,       # fraction/int of features per model
    bootstrap=True,         # sample with replacement
    bootstrap_features=False, # sample features with replacement
    oob_score=False,        # use out-of-bag samples for evaluation
    warm_start=False,
    n_jobs=None,
    random_state=None,
    verbose=0
)
bag.fit(X_train, y_train)
bag.oob_score_              # OOB accuracy (if oob_score=True)
bag.oob_decision_function_  # OOB probabilities
bag.estimators_             # list of fitted base estimators
bag.estimators_samples_     # bootstrap sample indices
bag.estimators_features_    # selected feature indices
```

### 10.2 Random Forest

**Concept:** Bagging of decision trees + random feature subsets at each split → decorrelated trees.

```python
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
rf = RandomForestClassifier(
    n_estimators=100,       # number of trees
    criterion='gini',       # 'gini','entropy','log_loss'
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',    # features per split: 'sqrt'(default clf),'log2',int,float,None
                            # Regressor default: 1.0 (all features)
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    bootstrap=True,
    oob_score=False,
    n_jobs=None,
    random_state=None,
    verbose=0,
    warm_start=False,
    class_weight=None,
    ccp_alpha=0.0,
    max_samples=None        # bootstrap sample size (if bootstrap=True)
)
rf.feature_importances_     # mean impurity decrease across trees
rf.estimators_              # list of individual decision trees
rf.oob_score_
rf.oob_decision_function_
```

**Random Forest vs Bagging:** RF forces random feature subset AT EACH SPLIT, Bagging selects features once per tree.

### 10.3 ExtraTreesClassifier (Extremely Randomized Trees)

```python
from sklearn.ensemble import ExtraTreesClassifier
# Same params as RF, BUT:
# - uses random thresholds for splits (not best threshold)
# - splitter='random' at each node
# - faster, potentially lower variance but higher bias
et = ExtraTreesClassifier(n_estimators=100, max_features='sqrt', bootstrap=False)
```

### 10.4 Voting Classifier/Regressor

```python
from sklearn.ensemble import VotingClassifier, VotingRegressor
vc = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('svc', SVC(probability=True)),
        ('rf', RandomForestClassifier())
    ],
    voting='soft',          # 'hard'=majority vote, 'soft'=argmax(avg proba)
    weights=None,           # weight per estimator
    n_jobs=None,
    flatten_transform=True,
    verbose=False
)
vc.estimators_              # list of fitted estimators
vc.named_estimators_        # dict access by name
# Note: soft voting needs predict_proba → SVC needs probability=True
```

**Exam trap:** `voting='soft'` uses averaged probabilities  needs all estimators to have `predict_proba`.

### 10.5 AdaBoost

**Concept:** Sequential: each model focuses on previous model's errors. Increases weight of misclassified samples.

**Math:**
1. Initialize weights wᵢ = 1/m
2. Fit weak learner hₜ(x)
3. Error rate: εₜ = Σwᵢ·𝟙[hₜ(xᵢ)≠yᵢ] / Σwᵢ
4. Estimator weight: αₜ = η·log((1−εₜ)/εₜ)
5. Update: wᵢ ← wᵢ·exp(αₜ·𝟙[hₜ(xᵢ)≠yᵢ])
6. Final: H(x) = sign(Σαₜhₜ(x))

```python
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # 'stump' default
    n_estimators=50,
    learning_rate=1.0,      # shrinks each tree's contribution
    algorithm='SAMME',      # 'SAMME' (multi-class discrete)
    random_state=None
)
ada.fit(X_train, y_train)
ada.estimators_             # list of fitted trees
ada.estimator_weights_      # αₜ per tree
ada.estimator_errors_       # εₜ per tree
ada.feature_importances_    # weighted sum of DT importances
# Staged predictions (iterative):
for i, y_pred in enumerate(ada.staged_predict(X_test)):
    print(f"n={i+1}: acc={accuracy_score(y_test, y_pred):.3f}")
```

### 10.6 Gradient Boosting

**Concept:** Sequential: each tree fits RESIDUALS (negative gradients) of previous model.

**Math:**
1. Initialize: F₀(x) = argmin_γ Σ L(yᵢ, γ) [constant prediction]
2. For m=1 to M:
   - Compute residuals: rᵢₘ = −∂L(yᵢ,F(xᵢ))/∂F(xᵢ)
   - Fit tree hₘ to residuals
   - Line search: γₘ = argmin_γ ΣL(yᵢ, Fₘ₋₁(xᵢ)+γhₘ(xᵢ))
   - Update: Fₘ(x) = Fₘ₋₁(x) + η·γₘ·hₘ(x)

```python
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
gbc = GradientBoostingClassifier(
    loss='log_loss',        # 'log_loss','exponential'
    learning_rate=0.1,      # η: shrinks each tree contribution
    n_estimators=100,       # number of trees (boosting stages)
    subsample=1.0,          # fraction of samples per tree (stochastic GB if <1)
    criterion='friedman_mse', # split quality measure
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_depth=3,            # shallow trees typical for boosting
    min_impurity_decrease=0.0,
    init=None,              # initial estimator
    random_state=None,
    max_features=None,
    verbose=0,
    max_leaf_nodes=None,
    warm_start=False,
    validation_fraction=0.1,
    n_iter_no_change=None,
    tol=1e-4,
    ccp_alpha=0.0
)
gbc.feature_importances_
gbc.n_estimators_           # actual estimators used
gbc.train_score_            # loss per stage on training data
gbc.oob_improvement_        # if subsample < 1
for score in gbc.staged_predict(X_test): ...  # predictions after each stage
for pred in gbc.staged_predict_proba(X_test): ...
```

### 10.7 HistGradientBoosting (Fast GB)

```python
from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor
hgbc = HistGradientBoostingClassifier(
    max_iter=100,           # n_estimators
    learning_rate=0.1,
    max_depth=None,
    min_samples_leaf=20,
    l2_regularization=0,
    max_bins=255,           # histogram bins (speed/accuracy tradeoff)
    monotonic_cst=None,
    interaction_cst=None,
    warm_start=False,
    early_stopping='auto',
    scoring='loss',
    validation_fraction=0.1,
    n_iter_no_change=10,
    tol=1e-7,
    verbose=0,
    random_state=None,
    class_weight=None
)
# Native support for NaN: no imputation needed!
# Much faster on large datasets than GradientBoostingClassifier
```

---

##  PART 11  Neural Networks (MLPClassifier/Regressor)

### 11.1 Intuition & Math

**Layers:** Input → [Hidden layers with activation] → Output

**Forward pass:** a⁽ˡ⁾ = activation(W⁽ˡ⁾·a⁽ˡ⁻¹⁾ + b⁽ˡ⁾)

**Activation functions:**

| Name | Formula | Use |
|------|---------|-----|
| ReLU | max(0,x) | Hidden layers (default, avoids vanishing gradient) |
| Sigmoid | 1/(1+e⁻ˣ) | Binary output |
| Tanh | (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ) | Hidden layers, zero-centered |
| Logistic | same as sigmoid | sklearn name |
| Identity | x | Regression output |
| Softmax | eˣᵢ/Σeˣⱼ | Multiclass output |

**Backpropagation:** Compute ∂L/∂W via chain rule → update with GD/Adam/etc.

### 11.2 Implementation

```python
from sklearn.neural_network import MLPClassifier, MLPRegressor
mlp = MLPClassifier(
    hidden_layer_sizes=(100,),      # tuple: one int per hidden layer
                                    # (100,50) = 2 hidden layers: 100 and 50 neurons
    activation='relu',              # 'identity','logistic','tanh','relu'
    solver='adam',                  # 'lbfgs','sgd','adam'
    alpha=0.0001,                   # L2 regularization term
    batch_size='auto',              # 'auto'=min(200, n_samples); or int
    learning_rate='constant',       # 'constant','invscaling','adaptive' (only for SGD)
    learning_rate_init=0.001,       # initial learning rate
    power_t=0.5,                    # for invscaling
    max_iter=200,                   # max epochs
    shuffle=True,
    random_state=None,
    tol=1e-4,
    verbose=False,
    warm_start=False,
    momentum=0.9,                   # for SGD
    nesterovs_momentum=True,        # Nesterov momentum for SGD
    early_stopping=False,
    validation_fraction=0.1,
    beta_1=0.9,                     # Adam: exp decay rate for 1st moment
    beta_2=0.999,                   # Adam: exp decay rate for 2nd moment
    epsilon=1e-8,                   # Adam: numerical stability
    n_iter_no_change=10,
    max_fun=15000                   # for lbfgs: max function evaluations
)
mlp.fit(X_train, y_train)
mlp.coefs_                          # list of weight matrices per layer
mlp.intercepts_                     # list of bias vectors per layer
mlp.n_iter_                         # actual iterations
mlp.loss_                           # final loss value
mlp.loss_curve_                     # loss per iteration (if verbose or solver=sgd/adam)
mlp.out_activation_                 # output activation: 'logistic','softmax','identity'
mlp.n_layers_                       # total layers including input/output
mlp.n_outputs_
```

**Solver comparison:**

| Solver | Best For | Notes |
|--------|---------|-------|
| `lbfgs` | Small datasets | Quasi-Newton, fast convergence |
| `sgd` | Large datasets | More control (momentum, learning rate) |
| `adam` | Large/noisy | Default, usually best, adaptive LR |

---

##  PART 12  Unsupervised Learning

### 12.1 KMeans

**Algorithm:**
1. Initialize k centroids (randomly or k-means++)
2. Assign each point to nearest centroid
3. Recompute centroids as mean of assigned points
4. Repeat 2-3 until convergence

**Inertia (WCSS):** Σ min_μ ‖xᵢ−μ‖²

```python
from sklearn.cluster import KMeans, MiniBatchKMeans
km = KMeans(
    n_clusters=8,
    init='k-means++',       # 'k-means++' or 'random' or ndarray
    n_init='auto',          # n times to run with diff seeds; 'auto'=10
    max_iter=300,
    tol=1e-4,
    verbose=0,
    random_state=None,
    copy_x=True,
    algorithm='lloyd'       # 'lloyd','elkan'
)
km.fit(X)
km.cluster_centers_         # centroid coordinates (k, n_features)
km.labels_                  # cluster label per sample
km.inertia_                 # WCSS (lower = tighter clusters)
km.n_iter_                  # iterations to converge
km.predict(X_new)           # assign new points

# Elbow method:
inertias = [KMeans(n_clusters=k, random_state=42).fit(X).inertia_ for k in range(1,11)]
```

### 12.2 DBSCAN

**Concept:** Density-based; finds clusters of arbitrary shape; marks low-density points as noise.

- **Core point:** has ≥ min_samples within eps radius
- **Border point:** within eps of core point, but fewer than min_samples neighbors
- **Noise:** not within eps of any core point → label = -1

```python
from sklearn.cluster import DBSCAN
db = DBSCAN(
    eps=0.5,                # neighborhood radius
    min_samples=5,          # min points to form core point
    metric='euclidean',
    algorithm='auto',       # 'auto','ball_tree','kd_tree','brute'
    leaf_size=30,
    n_jobs=None
)
db.fit(X)
db.labels_                  # -1 for noise
db.core_sample_indices_     # indices of core points
db.components_              # copy of core samples
```

### 12.3 Agglomerative Clustering (Hierarchical)

```python
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(
    n_clusters=2,           # None if distance_threshold set
    metric='euclidean',     # or 'cosine','manhattan','l1','l2','precomputed'
    linkage='ward',         # 'ward','complete','average','single'
    distance_threshold=None, # cut dendrogram here if n_clusters=None
    compute_full_tree='auto',
    compute_distances=False
)
agg.fit(X)
agg.labels_
agg.n_clusters_
agg.n_leaves_
agg.n_connected_components_
agg.children_               # merge history (for dendrogram)
```

**Linkage methods:**

| Method | Distance Between Clusters | Shape |
|--------|--------------------------|-------|
| `single` | Closest pair | Elongated, chains |
| `complete` | Furthest pair | Compact, spherical |
| `average` | Mean pairwise | Between single/complete |
| `ward` | Minimizes variance increase | Compact, equal-size |

### 12.4 Cluster Evaluation Metrics

```python
from sklearn.metrics import silhouette_score, silhouette_samples, calinski_harabasz_score, davies_bouldin_score

# Silhouette: (b-a)/max(a,b); range [-1,1], higher=better
sil = silhouette_score(X, labels, metric='euclidean')
per_sample = silhouette_samples(X, labels)

# Calinski-Harabasz (Variance Ratio): higher = better defined clusters
ch = calinski_harabasz_score(X, labels)

# Davies-Bouldin: lower = better (avg similarity of each cluster to most similar)
db_score = davies_bouldin_score(X, labels)

# If true labels known:
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
ari = adjusted_rand_score(y_true, y_pred)   # -1 to 1, 1=perfect
nmi = normalized_mutual_info_score(y_true, y_pred)  # 0 to 1
```

---

##  PART 13  Class Imbalance Handling

```python
# 1. class_weight='balanced' (in model params)
# sklearn computes: n_samples / (n_classes * np.bincount(y))
LogisticRegression(class_weight='balanced')
SVC(class_weight={0: 1, 1: 5})     # custom weights

# 2. Resampling (imbalanced-learn)
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
smote = SMOTE(k_neighbors=5, random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)

# 3. Use appropriate metrics
# Avoid accuracy for imbalanced → use F1, AUC-ROC, AUC-PR, MCC

# 4. Threshold adjustment
y_prob = model.predict_proba(X_test)[:, 1]
threshold = 0.3     # lower threshold → more positives
y_pred = (y_prob >= threshold).astype(int)
```

---

##  PART 14  Model Persistence & Utilities

```python
# Save/load models
import joblib
joblib.dump(model, 'model.pkl')
model = joblib.load('model.pkl')

import pickle
with open('model.pkl','wb') as f: pickle.dump(model, f)
with open('model.pkl','rb') as f: model = pickle.load(f)

# Clone a model (unfitted copy)
from sklearn.base import clone
model_copy = clone(fitted_model)

# Get params
model.get_params(deep=True)
model.set_params(C=0.1, max_iter=500)

# Check estimator type
from sklearn.utils.estimator_checks import check_estimator
from sklearn.base import is_classifier, is_regressor
is_classifier(model)   # True/False

# Scoring strings (for cv, GridSearch)
# 'accuracy','f1','f1_macro','f1_weighted','precision','recall',
# 'roc_auc','neg_mean_squared_error','r2','neg_log_loss'...
```

---

##  PART 15  Multiclass Strategies

```python
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
# OvR: n_classes binary classifiers, highest score wins
ovr = OneVsRestClassifier(SVC(), n_jobs=-1)
# OvO: n*(n-1)/2 binary classifiers, pairwise voting
ovo = OneVsOneClassifier(SVC())

# For multiclass output (multilabel):
from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
mo = MultiOutputClassifier(LogisticRegression(), n_jobs=-1)
cc = ClassifierChain(LogisticRegression(), order='random', random_state=42)
```

**When models natively support multiclass:**
- Logistic Regression: yes (multinomial/ovr)
- Decision Trees: yes
- Random Forest: yes
- KNN: yes
- Naive Bayes: yes
- SVC: OvO by default, OvR via decision_function_shape='ovr'
- SGDClassifier: OvR by default

---

##  PART 16  Key Exam Traps & Quick Facts

| Question Type | Key Answer |
|--------------|-----------|
| What does `C` control in SVM/LR? | INVERSE regularization: C↑ = less reg = tighter fit |
| What does `alpha` control in Ridge/Lasso/NB? | Regularization strength: alpha↑ = more shrinkage |
| `fit_transform` on test? | NEVER  causes data leakage |
| `stratify=y` when? | Always in classification splits |
| KNN needs scaling? | Yes  distance-sensitive |
| SVM needs scaling? | Yes  very sensitive to scale |
| Decision Tree needs scaling? | No |
| Naive Bayes needs scaling? | No |
| Random Forest needs scaling? | No |
| `partial_fit` needs `classes=`? | First call only (but safe always) |
| Lasso vs Ridge: which gives sparse? | Lasso (L1) → exact zeros |
| Default `max_features` RF clf? | `'sqrt'` |
| Default `max_features` RF reg? | `1.0` (all features) |
| `voting='soft'` requires? | All models have `predict_proba` |
| DBSCAN noise label | -1 |
| OHE drop='first' with 3 categories? | 2 output columns |
| `coef_` shape for binary LR? | (1, n_features) |
| `coef_` shape for multiclass LR? | (n_classes, n_features) |
| Pipeline param access syntax? | `step_name__param_name` |
| `warm_start=True` effect? | Reuse previous fit; build incrementally |
| `oob_score=True` in RF? | Use out-of-bag samples as validation |
| AdaBoost default base estimator? | DecisionTreeClassifier(max_depth=1) |
| Gradient Boosting fits trees on? | Residuals (negative gradients) |
| Silhouette score range? | -1 to 1 (1 = perfect clusters) |
| Davies-Bouldin: lower or higher? | Lower is better |
| Calinski-Harabasz: lower or higher? | Higher is better |

---

##  PART 17  Common Patterns & Code Templates

### Train-Eval-Tune Full Template

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report

num_pipe = Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())])
cat_pipe = Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('ohe', OneHotEncoder(handle_unknown='ignore'))])
ct = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)])

pipe = Pipeline([('prep', ct), ('model', RandomForestClassifier(random_state=42))])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
gs = GridSearchCV(pipe, {'model__n_estimators':[50,100], 'model__max_depth':[5,10,None]},
                  cv=cv, scoring='f1_weighted', n_jobs=-1, refit=True)
gs.fit(X_train, y_train)
print(gs.best_params_, gs.best_score_)
print(classification_report(y_test, gs.predict(X_test)))
```

### OHE Manual Calculation

```python
# For exam: given categories ['cat','dog','fish'] with drop=None:
# cat → [1,0,0], dog → [0,1,0], fish → [0,0,1]
# With drop='first': cat=dropped → dog → [1,0], fish → [0,1]
# n_output_features = n_categories - (1 if drop='first' else 0)
```

### Naive Bayes Hand Calculation

```python
# Given training data: predict class for new sample
# 1. Count classes: P(c) = count(c)/total
# 2. Count feature|class: for each unique value of each feature
# 3. Laplace: P(x|c) = (count(x,c)+1) / (count(c) + n_unique_vals)
# 4. Posterior ∝ P(c) * ∏P(xᵢ|c) [work in log space to avoid underflow]
# 5. log_posterior = log_prior + Σlog_likelihoods
# 6. Predict: argmax(log_posterior)
```

### SGD Partial Fit Log Loss Pattern (Exam Typical)

```python
sgd = SGDClassifier(loss='log_loss', penalty='l2', eta0=0.001,
    alpha=0, learning_rate='constant', random_state=1729,
    warm_start=True, shuffle=False)
classes = np.unique(y_train)
for i in range(1, 6):
    sgd.partial_fit(X_train, y_train, classes=classes)
    y_prob = sgd.predict_proba(X_train)
    print(f"Iter {i}: Log Loss = {log_loss(y_train, y_prob):.4f}")
```

---

*Guide covers: sklearn v1.x · All models, parameters, math, and exam patterns*