# Sklearn Guide > **Coverage:** Intuition · Math · Parameters · Methods · Exam Traps > **Format:** Dense reference every model, every parameter --- ## PART 0 NumPy, Pandas & Data Viz Quick Reference ### NumPy Essentials | Operation | Code | Notes | |-----------|------|-------| | Array creation | `np.array([1,2,3])`, `np.zeros((m,n))`, `np.ones`, `np.eye(n)` | eye = identity matrix | | Ranges | `np.arange(start,stop,step)`, `np.linspace(start,stop,n)` | linspace includes endpoint | | Shape ops | `a.reshape(m,n)`, `a.T`, `a.flatten()`, `np.expand_dims(a,axis)` | reshape(-1) = flatten | | Math | `np.dot(A,B)`, `np.matmul`, `A @ B`, `np.linalg.inv`, `np.linalg.norm` | @ is matmul operator | | Stats | `np.mean(a,axis=)`, `np.std`, `np.var`, `np.median`, `np.percentile` | axis=0→col, axis=1→row | | Boolean | `np.where(cond,x,y)`, `np.any`, `np.all`, `np.argmax`, `np.argmin` | argmax returns index | | Random | `np.random.seed(n)`, `np.random.rand(m,n)`, `np.random.randn`, `np.random.randint` | rand=uniform, randn=normal | | Stacking | `np.hstack([a,b])`, `np.vstack`, `np.concatenate(axis=)` | hstack=column-wise | | Unique | `np.unique(y)`, `np.unique(y, return_counts=True)` | for class labels | ### Pandas Quick Reference | Operation | Code | |-----------|------| | Read | `pd.read_csv(path, sep=, header=, index_col=, usecols=, dtype=, na_values=)` | | Info | `df.shape`, `df.dtypes`, `df.info()`, `df.describe()`, `df.head(n)`, `df.tail(n)` | | Select | `df['col']`, `df[['c1','c2']]`, `df.loc[rows,cols]`, `df.iloc[i,j]` | | Filter | `df[df['col']>5]`, `df.query('col > 5')`, `df[df['col'].isin([1,2])]` | | Missing | `df.isnull().sum()`, `df.dropna()`, `df.fillna(val)`, `df.interpolate()` | | Apply | `df['col'].apply(func)`, `df.apply(func, axis=1)`, `df.map(dict)` | | GroupBy | `df.groupby('col').agg({'c2':'mean','c3':'sum'})` | | Merge | `pd.merge(df1,df2,on='key',how='left/right/inner/outer')` | | Concat | `pd.concat([df1,df2], axis=0/1, ignore_index=True)` | | Pivot | `df.pivot_table(values=, index=, columns=, aggfunc=)` | | Sort | `df.sort_values('col', ascending=False)`, `df.sort_index()` | | Rename | `df.rename(columns={'old':'new'})` | | Drop | `df.drop('col',axis=1)`, `df.drop(index=[0,1])` | | Dummies | `pd.get_dummies(df, columns=['cat_col'], drop_first=True)` | --- ## PART 1 Sklearn Preprocessing & Transformers ### 1.1 Data Splitting ```python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # float=fraction, int=count random_state=42, # reproducibility seed stratify=y, # preserve class proportions (use for classification!) shuffle=True # default True ) ``` **Exam trap:** Always use `stratify=y` for imbalanced classification problems. ### 1.2 Feature Scaling | Scaler | Formula | Use When | Sensitive to Outliers | |--------|---------|----------|----------------------| | `StandardScaler` | (x−μ)/σ | Gaussian-like, SVM, LR, NN | Yes | | `MinMaxScaler` | (x−min)/(max−min) | Bounded output [0,1], NN | Yes | | `RobustScaler` | (x−median)/IQR | Outliers present | No | | `MaxAbsScaler` | x/max(\|x\|) | Sparse data, [-1,1] | Yes | | `Normalizer` | x/‖x‖ | Per-sample normalization (rows) | | ```python from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler scaler = StandardScaler() X_train_sc = scaler.fit_transform(X_train) # fit+transform on train X_test_sc = scaler.transform(X_test) # ONLY transform on test (no fit!) # Key attributes: scaler.mean_ # learned mean per feature scaler.scale_ # learned std per feature scaler.var_ # learned variance scaler.n_features_in_ ``` **Exam trap:** NEVER `fit_transform` on test set causes data leakage! ### 1.3 Encoders #### LabelEncoder ```python from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y_enc = le.fit_transform(y) # encodes to 0,1,2... le.classes_ # original class names le.inverse_transform([0,1,2]) # decode back ``` #### OrdinalEncoder ```python from sklearn.preprocessing import OrdinalEncoder oe = OrdinalEncoder( categories='auto', # or list of lists per feature handle_unknown='use_encoded_value', unknown_value=-1 ) X_enc = oe.fit_transform(X[['col']]) oe.categories_ # list of arrays, one per feature ``` #### OneHotEncoder ```python from sklearn.preprocessing import OneHotEncoder ohe = OneHotEncoder( categories='auto', # or list of lists drop=None, # None, 'first', 'if_binary' sparse_output=True, # returns sparse matrix by default handle_unknown='error', # or 'ignore' dtype=np.float64 ) X_enc = ohe.fit_transform(X[['col']]) # Key attrs: ohe.categories_ # list of arrays → categories per feature ohe.get_feature_names_out() # column names like ['col_A','col_B'] ohe.drop_idx_ # which index was dropped (if drop='first') ``` **Exam calculation:** For OHE with drop=None on column with 3 values [A,B,C]: - Input A → [1,0,0]; B → [0,1,0]; C → [0,0,1] - `categories_` = [array(['A','B','C'])] - n_output_cols = 3 (or 2 if drop='first') #### TargetEncoder ```python from sklearn.preprocessing import TargetEncoder te = TargetEncoder(target_type='auto', smooth='auto', cv=5) ``` ### 1.4 Handling Missing Values ```python from sklearn.impute import SimpleImputer, KNNImputer # SimpleImputer imp = SimpleImputer( missing_values=np.nan, strategy='mean', # 'mean','median','most_frequent','constant' fill_value=None # used when strategy='constant' ) # KNNImputer kimp = KNNImputer(n_neighbors=5, weights='uniform') # weights: 'uniform','distance' ``` ### 1.5 Feature Selection #### Filter Methods (no model) ```python from sklearn.feature_selection import SelectKBest, f_classif, f_regression, chi2, mutual_info_classif sel = SelectKBest(score_func=f_classif, k=10) # k='all' to see all scores sel.fit(X_train, y_train) sel.scores_ # score per feature sel.pvalues_ # p-value per feature sel.get_support() # boolean mask of selected features X_new = sel.transform(X_train) ``` | Score Func | Use For | |------------|---------| | `f_classif` | Classification, continuous features, ANOVA F-test | | `f_regression` | Regression, linear correlation F-test | | `chi2` | Classification, non-negative features (counts) | | `mutual_info_classif` | Classification, any type, non-linear | | `mutual_info_regression` | Regression, any type, non-linear | #### Wrapper Methods (use model) ```python from sklearn.feature_selection import RFE, RFECV, SequentialFeatureSelector # RFE: Recursive Feature Elimination rfe = RFE( estimator=LogisticRegression(), n_features_to_select=5, step=1 # features removed per iteration ) rfe.fit(X_train, y_train) rfe.ranking_ # 1 = selected rfe.support_ # boolean mask # RFECV: RFE with cross-validation to find best n rfecv = RFECV(estimator=LogisticRegression(), cv=5, scoring='accuracy') rfecv.n_features_ # optimal number found ``` #### Embedded Methods ```python from sklearn.feature_selection import SelectFromModel sel = SelectFromModel( estimator=RandomForestClassifier(), threshold='mean', # 'mean','median',float,or 'Xmean' max_features=None ) sel.estimator_.feature_importances_ ``` ### 1.6 Dimensionality Reduction: PCA **Math:** Find orthogonal directions of maximum variance. SVD: X = UΣVᵀ → principal components = columns of V (eigenvectors of XᵀX) ```python from sklearn.decomposition import PCA pca = PCA( n_components=2, # int=components, float=variance ratio (e.g. 0.95), 'mle' whiten=False, # divide by sqrt(eigenvalue) → unit variance random_state=42, svd_solver='auto' # 'auto','full','arpack','randomized' ) pca.fit(X_scaled) # Key attributes: pca.explained_variance_ratio_ # variance fraction per component pca.explained_variance_ # absolute variance per component pca.components_ # eigenvectors shape (n_comp, n_features) pca.singular_values_ # singular values pca.n_components_ # actual n components used np.cumsum(pca.explained_variance_ratio_) # cumulative variance ``` **Exam trap:** Always scale before PCA! PCA is sensitive to feature magnitudes. --- ## PART 2 Pipelines & Composite Transformers ### 2.1 Pipeline ```python from sklearn.pipeline import Pipeline, make_pipeline pipe = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()), ('model', LogisticRegression()) ], verbose=False, memory=None) # Access steps: pipe['scaler'] # by name pipe.named_steps['scaler'] # same pipe.steps[1][1] # by index # Methods (delegates to final estimator): pipe.fit(X_train, y_train) pipe.predict(X_test) pipe.score(X_test, y_test) pipe.fit_transform(X_train) # only if last step is transformer # Pipeline with param grid: # Use __ to access nested params: param_grid = {'model__C': [0.1,1,10], 'scaler__with_mean': [True,False]} ``` ### 2.2 ColumnTransformer ```python from sklearn.compose import ColumnTransformer, make_column_transformer ct = ColumnTransformer(transformers=[ ('num', StandardScaler(), ['age','income']), # numeric cols ('cat', OneHotEncoder(), ['gender','city']), # categorical cols ('pass', 'passthrough', ['id_col']), # unchanged ('drop', 'drop', ['useless_col']), # dropped ], remainder='drop', # or 'passthrough' for unspecified columns verbose_feature_names_out=True, n_jobs=None) ct.fit_transform(X_train) ct.get_feature_names_out() ``` ### 2.3 FeatureUnion (parallel transforms) ```python from sklearn.pipeline import FeatureUnion fu = FeatureUnion([ ('pca', PCA(n_components=3)), ('kbest', SelectKBest(k=5)) ]) ``` ### 2.4 Cross-Validation ```python from sklearn.model_selection import cross_val_score, cross_validate, KFold, StratifiedKFold # Basic CV scores = cross_val_score(model, X, y, cv=5, scoring='accuracy', n_jobs=-1) # Multiple metrics results = cross_validate(model, X, y, cv=5, scoring=['accuracy','f1_macro'], return_train_score=True) # Custom CV splitters: kf = KFold(n_splits=5, shuffle=True, random_state=42) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # preserves class ratio ``` ### 2.5 Hyperparameter Tuning ```python from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # GridSearch: exhaustive gs = GridSearchCV( estimator=pipe, param_grid={'model__C': [0.01,0.1,1,10], 'model__penalty': ['l1','l2']}, cv=5, scoring='f1', refit=True, # refit best model on full training data n_jobs=-1, return_train_score=True, verbose=2 ) gs.fit(X_train, y_train) gs.best_params_ gs.best_score_ gs.best_estimator_ gs.cv_results_ # dict with all results # RandomizedSearch: samples from distributions from scipy.stats import randint, uniform rs = RandomizedSearchCV( estimator=model, param_distributions={'C': uniform(0.01,10), 'max_iter': randint(100,500)}, n_iter=50, # number of random samples cv=5, scoring='f1', random_state=42, refit=True ) ``` --- ## PART 3 Regression Models ### 3.1 Linear Regression **Intuition:** Find hyperplane minimizing sum of squared residuals. **Math:** - Model: ŷ = θ₀ + θ₁x₁ + ... + θₙxₙ = Xθ - Cost: MSE = (1/m)Σ(ŷᵢ−yᵢ)² - Closed form (Normal Equation): θ = (XᵀX)⁻¹Xᵀy - OLS assumptions: linearity, homoscedasticity, no multicollinearity, normality of residuals ```python from sklearn.linear_model import LinearRegression lr = LinearRegression( fit_intercept=True, # add bias term copy_X=True, n_jobs=None, positive=False # constrain coefs to be positive ) lr.fit(X_train, y_train) lr.coef_ # shape (n_features,) or (n_targets, n_features) lr.intercept_ # bias term lr.rank_ # rank of X lr.singular_ # singular values of X ``` **Evaluation Metrics:** | Metric | Formula | sklearn | |--------|---------|---------| | MSE | (1/m)Σ(ŷ−y)² | `mean_squared_error(y,ŷ)` | | RMSE | √MSE | `mean_squared_error(y,ŷ, squared=False)` | | MAE | (1/m)Σ\|ŷ−y\| | `mean_absolute_error(y,ŷ)` | | R² | 1−SS_res/SS_tot | `r2_score(y,ŷ)` or `lr.score(X,y)` | | MAPE | (1/m)Σ\|ŷ−y\|/\|y\| | `mean_absolute_percentage_error` | ### 3.2 Gradient Descent (Math) **Batch GD:** θⱼ ← θⱼ − η·(∂J/∂θⱼ) using ALL data per step **SGD:** θⱼ ← θⱼ − η·(∂J/∂θⱼ)ᵢ using ONE sample per step → noisy but fast **Mini-batch GD:** batch of k samples compromise | Type | Convergence | Memory | Noise | |------|-------------|--------|-------| | Batch | Smooth, slow | High | None | | SGD | Noisy, fast | Low | High | | Mini-batch | Moderate | Medium | Medium | **Learning rate schedules:** | Schedule | Formula | Effect | |----------|---------|--------| | Constant | η = η₀ | Fixed step | | Time decay | η = η₀/(1+t·d) | Decreasing | | Optimal | η = 1/(α(t₀+t)) | sklearn default | | Inverse scaling | η = η₀/t^power | Power decay | | Adaptive | per-param | Adam, RMSprop | ### 3.3 SGDRegressor ```python from sklearn.linear_model import SGDRegressor sgd = SGDRegressor( loss='squared_error', # 'squared_error','huber','epsilon_insensitive' penalty='l2', # 'l1','l2','elasticnet',None alpha=0.0001, # regularization strength l1_ratio=0.15, # for elasticnet: mix of l1/l2 fit_intercept=True, max_iter=1000, # passes over data (epochs) tol=1e-3, # stopping tolerance shuffle=True, # shuffle each epoch verbose=0, epsilon=0.1, # for Huber/epsilon_insensitive random_state=None, learning_rate='invscaling', # 'constant','optimal','invscaling','adaptive' eta0=0.01, # initial learning rate power_t=0.25, # for invscaling: η=η₀/t^power_t early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False, # reuse previous fit as init average=False # average SGD weights ) sgd.coef_; sgd.intercept_; sgd.n_iter_; sgd.t_ ``` ### 3.4 Polynomial Regression **Concept:** Transform features → add polynomial terms → apply linear regression on transformed features. ```python from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures( degree=2, # polynomial degree interaction_only=False, # only cross terms if True (no x², x³) include_bias=True # add column of 1s ) X_poly = poly.fit_transform(X) poly.n_output_features_ # total output features poly.get_feature_names_out() # e.g. ['1','x0','x1','x0^2','x0 x1','x1^2'] # For degree=2, 2 features: (2+2)! / (2! 2!) = 6 features (with bias) ``` **Feature count formula:** With n features, degree d: C(n+d, d) features (with bias) ### 3.5 Regularized Models **Intuition:** Add penalty to loss to shrink coefficients → prevents overfitting. #### Ridge (L2 Regularization) **Loss:** MSE + α·Σθⱼ² - Shrinks all coefficients toward 0, never exactly 0 - Handles multicollinearity well - Closed form: θ = (XᵀX + αI)⁻¹Xᵀy ```python from sklearn.linear_model import Ridge, RidgeCV ridge = Ridge( alpha=1.0, # regularization strength (λ); larger = more shrinkage fit_intercept=True, solver='auto', # 'auto','svd','cholesky','lsqr','sparse_cg','sag','saga','lbfgs' max_iter=None, tol=1e-4, random_state=None ) # RidgeCV: built-in cross-validation for alpha rcv = RidgeCV(alphas=[0.1,1.0,10.0], cv=5, scoring='neg_mean_squared_error') rcv.alpha_ # best alpha found rcv.coef_; rcv.intercept_ ``` #### Lasso (L1 Regularization) **Loss:** MSE + α·Σ|θⱼ| - Can shrink coefficients to EXACTLY 0 → feature selection - Coordinate descent solver - Not differentiable at 0 (subgradient) ```python from sklearn.linear_model import Lasso, LassoCV lasso = Lasso( alpha=1.0, fit_intercept=True, max_iter=1000, tol=1e-4, warm_start=False, positive=False, selection='cyclic' # 'cyclic' or 'random' (faster convergence) ) lasso.n_iter_ # iterations run lasso.sparse_coef_ # sparse representation of coef_ ``` #### ElasticNet (L1 + L2) **Loss:** MSE + α·l1_ratio·Σ|θⱼ| + α·(1−l1_ratio)/2·Σθⱼ² ```python from sklearn.linear_model import ElasticNet, ElasticNetCV en = ElasticNet( alpha=1.0, l1_ratio=0.5, # 0=Ridge, 1=Lasso, (0,1)=ElasticNet fit_intercept=True, max_iter=1000, tol=1e-4, warm_start=False, selection='cyclic' ) ``` | Model | Penalty | Coefs→0? | Use When | |-------|---------|----------|----------| | Ridge | α·Σθ² | Never | Multicollinearity, all features useful | | Lasso | α·Σ\|θ\| | Yes (sparse) | Feature selection needed | | ElasticNet | Both | Yes | Many correlated features | --- ## PART 4 Classification Models ### 4.1 Logistic Regression **Intuition:** Linear model + sigmoid → output probability. **Math:** - logit(p) = log(p/(1-p)) = θᵀx (log-odds / linear output) - σ(z) = 1/(1+e⁻ᶻ) → sigmoid - P(y=1|x) = σ(θᵀx) - Loss: Binary Cross-Entropy = −(1/m)Σ[yᵢlog(ŷᵢ) + (1−yᵢ)log(1−ŷᵢ)] - Gradient: ∂J/∂θⱼ = (1/m)·Xᵀ(σ(Xθ)−y) - Predict: ŷ=1 if P≥0.5 else 0 (threshold adjustable) ```python from sklearn.linear_model import LogisticRegression lr = LogisticRegression( penalty='l2', # 'l1','l2','elasticnet',None dual=False, # dual formulation (for l2, n>p use False) tol=1e-4, C=1.0, # INVERSE of regularization: smaller C = stronger reg fit_intercept=True, intercept_scaling=1, class_weight=None, # None, 'balanced', or dict {0:w0,1:w1} random_state=None, solver='lbfgs', # see table below max_iter=100, multi_class='auto', # 'auto','ovr','multinomial' verbose=0, warm_start=False, n_jobs=None, l1_ratio=None # for elasticnet ) lr.coef_ # shape (1,n_feat) binary or (n_class,n_feat) multiclass lr.intercept_ # shape (1,) or (n_class,) lr.classes_ # class labels lr.n_iter_ # iterations per class lr.predict_proba(X) # shape (n_samples, n_classes) lr.predict_log_proba(X) # log probabilities lr.decision_function(X) # raw logit scores ``` **Solver compatibility:** | Solver | Penalty | Multi-class | Notes | |--------|---------|-------------|-------| | `lbfgs` | l2, None | ovr, multinomial | Default, good general solver | | `liblinear` | l1, l2 | ovr only | Good for small datasets | | `saga` | l1, l2, elasticnet, None | ovr, multinomial | Large datasets | | `sag` | l2, None | ovr, multinomial | Large datasets | | `newton-cg` | l2, None | ovr, multinomial | | | `newton-cholesky` | l2, None | ovr only | | **Exam trap:** `C` is INVERSE regularization C=0.01 is stronger than C=100! ### 4.2 Perceptron **Math:** Output = sign(wᵀx + b); weight update only on misclassified: w ← w + η·yᵢ·xᵢ ```python from sklearn.linear_model import Perceptron per = Perceptron( penalty=None, # 'l1','l2','elasticnet' alpha=0.0001, # regularization (if penalty set) fit_intercept=True, max_iter=1000, tol=1e-3, shuffle=True, verbose=0, eta0=1, # learning rate (constant for Perceptron) n_jobs=None, random_state=0, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False ) per.coef_; per.intercept_; per.n_iter_ ``` **Note:** Perceptron = SGDClassifier(loss='perceptron', eta0=1, learning_rate='constant', penalty=None) ### 4.3 SGDClassifier ```python from sklearn.linear_model import SGDClassifier sgd = SGDClassifier( loss='hinge', # see table below penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=1e-3, shuffle=True, verbose=0, epsilon=0.1, n_jobs=None, random_state=None, learning_rate='optimal',# 'constant','optimal','invscaling','adaptive' eta0=0.0, # initial LR (required if learning_rate='constant') power_t=0.5, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, class_weight=None, warm_start=False, average=False # int or True → average weights after n_samples ) # Partial fit (online learning, multi-epoch): sgd.partial_fit(X_batch, y_batch, classes=np.unique(y_train)) # Must pass classes= on FIRST call only (but safe to always pass) ``` **Loss functions:** | Loss | Equivalent Model | Use | |------|-----------------|-----| | `hinge` | Linear SVM | Binary classification | | `modified_huber` | Smooth hinge + probabilistic | Binary, can use predict_proba | | `log_loss` | Logistic Regression | Binary, probabilistic | | `perceptron` | Perceptron | | | `squared_hinge` | Squared SVM | | | `huber` | Robust regression | Regression | | `squared_error` | Linear Regression | Regression | | `epsilon_insensitive` | SVR | Regression | **Computing Log Loss manually:** Loss = −(1/m)Σ[yᵢ·log(pᵢ) + (1−yᵢ)·log(1−pᵢ)] ```python from sklearn.metrics import log_loss log_loss(y_true, y_prob) # y_prob shape (n, n_classes) ``` --- ## PART 5 Classification Metrics ### 5.1 Confusion Matrix ``` Predicted 0 Predicted 1 Actual 0 TN FP Actual 1 FN TP ``` ```python from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay cm = confusion_matrix(y_true, y_pred) # labels= to specify order ConfusionMatrixDisplay(cm, display_labels=['neg','pos']).plot() # Or directly: ConfusionMatrixDisplay.from_predictions(y_true, y_pred) ConfusionMatrixDisplay.from_estimator(model, X_test, y_test) ``` ### 5.2 Classification Metrics Formulas | Metric | Formula | Focus | |--------|---------|-------| | Accuracy | (TP+TN)/(TP+TN+FP+FN) | Overall correct | | Precision | TP/(TP+FP) | Of positives predicted, how many correct | | Recall (Sensitivity) | TP/(TP+FN) | Of actual positives, how many found | | F1 | 2·P·R/(P+R) | Harmonic mean of P and R | | Specificity | TN/(TN+FP) | True negative rate | | FPR | FP/(FP+TN) | False alarm rate | | MCC | (TP·TN−FP·FN)/√(...) | Balanced, even for imbalanced | ```python from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_auc_score, average_precision_score, matthews_corrcoef ) # average options for multiclass: # 'binary' (default), 'micro', 'macro', 'weighted', 'samples' # macro: mean of per-class, no weighting # weighted: weighted by support (class count) # micro: globally count TP/FP/FN across all classes precision_score(y, ŷ, average='weighted', zero_division=0) f1_score(y, ŷ, average='macro') classification_report(y, ŷ, target_names=['cls0','cls1']) ``` ### 5.3 ROC-AUC & PR Curves ```python from sklearn.metrics import roc_curve, auc, RocCurveDisplay, precision_recall_curve fpr, tpr, thresholds = roc_curve(y_true, y_score) # y_score = probabilities roc_auc = auc(fpr, tpr) RocCurveDisplay.from_estimator(model, X_test, y_test) RocCurveDisplay.from_predictions(y_true, y_score) # PR Curve (better for imbalanced) prec, rec, thresh = precision_recall_curve(y_true, y_score) ``` **AUC:** 0.5 = random, 1.0 = perfect. Use when classes balanced. **PR AUC:** Better for highly imbalanced datasets. --- ## PART 6 Naive Bayes ### 6.1 Math (Exam Calculation Focus) **Bayes Theorem:** P(y|x) = P(x|y)·P(y) / P(x) **Naive assumption:** Features are conditionally independent given class. P(y|x₁,...,xₙ) ∝ P(y)·∏P(xᵢ|y) **Steps for hand calculation:** 1. Compute prior P(y=c) = count(c)/total 2. For each feature: P(xᵢ|y=c) - Gaussian: use mean and variance per class - Bernoulli: P(xᵢ=1|y=c) with Laplace smoothing - Multinomial: P(xᵢ|y=c) = (count + α)/(total_count + α·n_features) 3. Multiply all: score(c) = P(c)·∏P(xᵢ|c) 4. Predict: argmax_c score(c) **Gaussian NB:** P(xᵢ|y=c) = (1/√(2πσ²))·exp(−(xᵢ−μ)²/2σ²) ### 6.2 Sklearn Implementation ```python from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB # Gaussian NB (continuous features) gnb = GaussianNB( priors=None, # class priors; if None, estimated from data var_smoothing=1e-9 # variance smoothing to prevent 0-variance ) gnb.fit(X_train, y_train) gnb.class_prior_ # P(y) per class gnb.theta_ # mean per class per feature (n_classes, n_features) gnb.var_ # variance per class per feature gnb.classes_ # Multinomial NB (word counts, text) mnb = MultinomialNB( alpha=1.0, # Laplace smoothing (0=no smoothing) fit_prior=True, # learn class priors; False → uniform class_prior=None # override priors ) mnb.feature_log_prob_ # log P(feature|class) shape (n_classes, n_features) mnb.class_log_prior_ # log P(class) # Bernoulli NB (binary features) bnb = BernoulliNB( alpha=1.0, binarize=0.0, # threshold to binarize; None if already binary fit_prior=True ) # Complement NB (imbalanced text classification) cnb = ComplementNB(alpha=1.0, fit_prior=True, norm=False) ``` **Exam calculation example (Laplace smoothing):** If word "spam" appears 3 times in spam class (100 total words, 1000 unique vocab): P(spam|spam_class) = (3+1)/(100+1·1000) = 4/1100 ≈ 0.00364 --- ## PART 7 K-Nearest Neighbors ### 7.1 Intuition & Math **Algorithm:** For prediction, find k closest training points → vote (classify) or average (regress). **Distance metrics:** - Euclidean: d = √(Σ(xᵢ−yᵢ)²) → p=2 - Manhattan: d = Σ|xᵢ−yᵢ| → p=1 - Minkowski: d = (Σ|xᵢ−yᵢ|ᵖ)^(1/p) → general - Chebyshev: max|xᵢ−yᵢ| → p=∞ **Bias-Variance:** Small k → low bias, high variance (overfit). Large k → high bias, low variance (underfit). ### 7.2 Implementation ```python from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor knn = KNeighborsClassifier( n_neighbors=5, # k weights='uniform', # 'uniform'=equal, 'distance'=1/d weighting algorithm='auto', # 'auto','ball_tree','kd_tree','brute' leaf_size=30, # for BallTree/KDTree (memory/speed tradeoff) p=2, # power for Minkowski: 1=Manhattan, 2=Euclidean metric='minkowski', # distance metric (overrides p if not minkowski) metric_params=None, n_jobs=None ) knn.fit(X_train, y_train) knn.predict(X_test) knn.predict_proba(X_test) # fraction of k neighbors per class knn.kneighbors(X, n_neighbors=5) # returns (distances, indices) knn.kneighbors_graph(X) # adjacency graph # Regressor: same params + weights knnr = KNeighborsRegressor(n_neighbors=5, weights='distance') ``` **Exam trap:** KNN requires feature scaling different scales dominate distance! --- ## PART 8 Support Vector Machines ### 8.1 Intuition & Math **Hard SVM:** Find hyperplane that maximizes margin = 2/‖w‖ with no misclassifications. - Decision boundary: wᵀx + b = 0 - Constraints: yᵢ(wᵀxᵢ + b) ≥ 1 - Minimize: ½‖w‖² (maximize margin) **Soft SVM (C-SVM):** Allow misclassifications via slack variables ξᵢ - Minimize: ½‖w‖² + C·Σξᵢ - C large → smaller margin, fewer violations (may overfit) - C small → larger margin, more violations (may underfit) **Support Vectors:** Training points on or inside the margin → fully determine the hyperplane. **Kernel Trick:** Map to high-dim space implicitly using kernel function K(x,z)=φ(x)·φ(z) | Kernel | Formula | Use | |--------|---------|-----| | Linear | xᵀz | Linearly separable | | RBF/Gaussian | exp(−γ‖x−z‖²) | Non-linear, general | | Polynomial | (γxᵀz+r)^d | Polynomial boundaries | | Sigmoid | tanh(γxᵀz+r) | Like NN | **Dual formulation:** Σαᵢ − ½ΣΣαᵢαⱼyᵢyⱼK(xᵢ,xⱼ) → max over α ### 8.2 SVC / SVR Implementation ```python from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR, NuSVC svc = SVC( C=1.0, # regularization; smaller=wider margin kernel='rbf', # 'linear','poly','rbf','sigmoid','precomputed' degree=3, # for poly kernel gamma='scale', # RBF/poly/sigmoid kernel coef # 'scale'=1/(n_feat·var(X)), 'auto'=1/n_feat, or float coef0=0.0, # independent term in poly/sigmoid shrinking=True, # heuristic to speed up probability=False, # enables predict_proba (uses Platt scaling, slower) tol=1e-3, cache_size=200, # kernel cache in MB class_weight=None, # 'balanced' or dict verbose=False, max_iter=-1, # -1=unlimited decision_function_shape='ovr', # 'ovr' or 'ovo' for multiclass break_ties=False, random_state=None ) svc.fit(X_train, y_train) svc.support_ # indices of support vectors svc.support_vectors_ # support vector coordinates svc.n_support_ # count per class svc.dual_coef_ # αᵢyᵢ per support vector svc.coef_ # weights (linear kernel only) svc.intercept_ # bias svc.decision_function(X) # raw margin scores svc.predict_proba(X) # only if probability=True # SVR for regression svr = SVR( kernel='rbf', C=1.0, epsilon=0.1, # epsilon-tube: no penalty inside gamma='scale', degree=3, coef0=0.0, tol=1e-3, cache_size=200, verbose=False, max_iter=-1, shrinking=True ) # LinearSVC (faster for large datasets, only linear kernel) lsvc = LinearSVC( penalty='l2', # 'l1' or 'l2' loss='squared_hinge', # 'hinge' or 'squared_hinge' dual='auto', # prefer dual=True when n_samples < n_features tol=1e-4, C=1.0, multi_class='ovr', # 'ovr' or 'crammer_singer' fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000 ) # LinearSVC has no predict_proba! Use CalibratedClassifierCV to get probabilities from sklearn.calibration import CalibratedClassifierCV cal = CalibratedClassifierCV(LinearSVC(), method='sigmoid', cv=5) ``` **gamma effect:** Large γ → tight RBF → may overfit. Small γ → wide RBF → may underfit. --- ## PART 9 Decision Trees ### 9.1 Intuition & Math **Algorithm:** Recursively split on feature that best separates classes/reduces error. **Impurity measures:** | Measure | Formula | Used In | |---------|---------|---------| | Gini | 1−Σpᵢ² | Classification (default) | | Entropy | −Σpᵢ·log₂(pᵢ) | Classification | | Log Loss | same as entropy | Classification | | MSE | Σ(y−ȳ)²/n | Regression (default) | | MAE | Σ\|y−ȳ\|/n | Regression | | Poisson | Σ(y·log(y/ȳ)−y+ȳ) | Count data | **Information Gain:** IG(parent, split) = impurity(parent) − weighted_avg(impurity(children)) **Gini hand calculation example:** Node with 10 samples: 6 class A, 4 class B Gini = 1 − (6/10)² − (4/10)² = 1 − 0.36 − 0.16 = 0.48 **Entropy example:** = −(6/10)·log₂(6/10) − (4/10)·log₂(4/10) = 0.971 bits ### 9.2 Implementation ```python from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree, export_text dt = DecisionTreeClassifier( criterion='gini', # 'gini','entropy','log_loss' splitter='best', # 'best','random' max_depth=None, # None=unlimited (overfit); set to prune min_samples_split=2, # min samples to split a node min_samples_leaf=1, # min samples in a leaf min_weight_fraction_leaf=0.0, max_features=None, # 'sqrt','log2',int,float → features per split random_state=None, max_leaf_nodes=None, # limit total leaves min_impurity_decrease=0.0, # split only if improvement >= this class_weight=None, ccp_alpha=0.0 # cost-complexity pruning parameter ) dt.fit(X_train, y_train) dt.tree_ # internal Tree object dt.feature_importances_ # Gini importance (sum=1) dt.max_features_ # actual n features used dt.n_classes_ dt.n_features_in_ dt.get_depth() # max depth dt.get_n_leaves() # total leaves # Visualization plot_tree(dt, feature_names=X.columns, class_names=['neg','pos'], filled=True, # color by class rounded=True, max_depth=3, # display depth limit fontsize=10, impurity=True, # show impurity at each node proportion=False # show counts not proportions ) print(export_text(dt, feature_names=list(X.columns))) # Cost-complexity pruning path = dt.cost_complexity_pruning_path(X_train, y_train) ccp_alphas = path.ccp_alphas ``` **Exam trap:** `feature_importances_` = total impurity decrease weighted by sample fraction, not split count. --- ## PART 10 Ensemble Methods ### 10.1 Bagging (Bootstrap Aggregating) **Concept:** Train B models on random bootstrap samples → aggregate (vote/average). Reduces variance. ```python from sklearn.ensemble import BaggingClassifier, BaggingRegressor bag = BaggingClassifier( estimator=DecisionTreeClassifier(), # base estimator n_estimators=10, # number of models max_samples=1.0, # fraction/int of samples per model max_features=1.0, # fraction/int of features per model bootstrap=True, # sample with replacement bootstrap_features=False, # sample features with replacement oob_score=False, # use out-of-bag samples for evaluation warm_start=False, n_jobs=None, random_state=None, verbose=0 ) bag.fit(X_train, y_train) bag.oob_score_ # OOB accuracy (if oob_score=True) bag.oob_decision_function_ # OOB probabilities bag.estimators_ # list of fitted base estimators bag.estimators_samples_ # bootstrap sample indices bag.estimators_features_ # selected feature indices ``` ### 10.2 Random Forest **Concept:** Bagging of decision trees + random feature subsets at each split → decorrelated trees. ```python from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor rf = RandomForestClassifier( n_estimators=100, # number of trees criterion='gini', # 'gini','entropy','log_loss' max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', # features per split: 'sqrt'(default clf),'log2',int,float,None # Regressor default: 1.0 (all features) max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None # bootstrap sample size (if bootstrap=True) ) rf.feature_importances_ # mean impurity decrease across trees rf.estimators_ # list of individual decision trees rf.oob_score_ rf.oob_decision_function_ ``` **Random Forest vs Bagging:** RF forces random feature subset AT EACH SPLIT, Bagging selects features once per tree. ### 10.3 ExtraTreesClassifier (Extremely Randomized Trees) ```python from sklearn.ensemble import ExtraTreesClassifier # Same params as RF, BUT: # - uses random thresholds for splits (not best threshold) # - splitter='random' at each node # - faster, potentially lower variance but higher bias et = ExtraTreesClassifier(n_estimators=100, max_features='sqrt', bootstrap=False) ``` ### 10.4 Voting Classifier/Regressor ```python from sklearn.ensemble import VotingClassifier, VotingRegressor vc = VotingClassifier( estimators=[ ('lr', LogisticRegression()), ('svc', SVC(probability=True)), ('rf', RandomForestClassifier()) ], voting='soft', # 'hard'=majority vote, 'soft'=argmax(avg proba) weights=None, # weight per estimator n_jobs=None, flatten_transform=True, verbose=False ) vc.estimators_ # list of fitted estimators vc.named_estimators_ # dict access by name # Note: soft voting needs predict_proba → SVC needs probability=True ``` **Exam trap:** `voting='soft'` uses averaged probabilities needs all estimators to have `predict_proba`. ### 10.5 AdaBoost **Concept:** Sequential: each model focuses on previous model's errors. Increases weight of misclassified samples. **Math:** 1. Initialize weights wᵢ = 1/m 2. Fit weak learner hₜ(x) 3. Error rate: εₜ = Σwᵢ·𝟙[hₜ(xᵢ)≠yᵢ] / Σwᵢ 4. Estimator weight: αₜ = η·log((1−εₜ)/εₜ) 5. Update: wᵢ ← wᵢ·exp(αₜ·𝟙[hₜ(xᵢ)≠yᵢ]) 6. Final: H(x) = sign(Σαₜhₜ(x)) ```python from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor ada = AdaBoostClassifier( estimator=DecisionTreeClassifier(max_depth=1), # 'stump' default n_estimators=50, learning_rate=1.0, # shrinks each tree's contribution algorithm='SAMME', # 'SAMME' (multi-class discrete) random_state=None ) ada.fit(X_train, y_train) ada.estimators_ # list of fitted trees ada.estimator_weights_ # αₜ per tree ada.estimator_errors_ # εₜ per tree ada.feature_importances_ # weighted sum of DT importances # Staged predictions (iterative): for i, y_pred in enumerate(ada.staged_predict(X_test)): print(f"n={i+1}: acc={accuracy_score(y_test, y_pred):.3f}") ``` ### 10.6 Gradient Boosting **Concept:** Sequential: each tree fits RESIDUALS (negative gradients) of previous model. **Math:** 1. Initialize: F₀(x) = argmin_γ Σ L(yᵢ, γ) [constant prediction] 2. For m=1 to M: - Compute residuals: rᵢₘ = −∂L(yᵢ,F(xᵢ))/∂F(xᵢ) - Fit tree hₘ to residuals - Line search: γₘ = argmin_γ ΣL(yᵢ, Fₘ₋₁(xᵢ)+γhₘ(xᵢ)) - Update: Fₘ(x) = Fₘ₋₁(x) + η·γₘ·hₘ(x) ```python from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor gbc = GradientBoostingClassifier( loss='log_loss', # 'log_loss','exponential' learning_rate=0.1, # η: shrinks each tree contribution n_estimators=100, # number of trees (boosting stages) subsample=1.0, # fraction of samples per tree (stochastic GB if <1) criterion='friedman_mse', # split quality measure min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, # shallow trees typical for boosting min_impurity_decrease=0.0, init=None, # initial estimator random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=1e-4, ccp_alpha=0.0 ) gbc.feature_importances_ gbc.n_estimators_ # actual estimators used gbc.train_score_ # loss per stage on training data gbc.oob_improvement_ # if subsample < 1 for score in gbc.staged_predict(X_test): ... # predictions after each stage for pred in gbc.staged_predict_proba(X_test): ... ``` ### 10.7 HistGradientBoosting (Fast GB) ```python from sklearn.ensemble import HistGradientBoostingClassifier, HistGradientBoostingRegressor hgbc = HistGradientBoostingClassifier( max_iter=100, # n_estimators learning_rate=0.1, max_depth=None, min_samples_leaf=20, l2_regularization=0, max_bins=255, # histogram bins (speed/accuracy tradeoff) monotonic_cst=None, interaction_cst=None, warm_start=False, early_stopping='auto', scoring='loss', validation_fraction=0.1, n_iter_no_change=10, tol=1e-7, verbose=0, random_state=None, class_weight=None ) # Native support for NaN: no imputation needed! # Much faster on large datasets than GradientBoostingClassifier ``` --- ## PART 11 Neural Networks (MLPClassifier/Regressor) ### 11.1 Intuition & Math **Layers:** Input → [Hidden layers with activation] → Output **Forward pass:** a⁽ˡ⁾ = activation(W⁽ˡ⁾·a⁽ˡ⁻¹⁾ + b⁽ˡ⁾) **Activation functions:** | Name | Formula | Use | |------|---------|-----| | ReLU | max(0,x) | Hidden layers (default, avoids vanishing gradient) | | Sigmoid | 1/(1+e⁻ˣ) | Binary output | | Tanh | (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ) | Hidden layers, zero-centered | | Logistic | same as sigmoid | sklearn name | | Identity | x | Regression output | | Softmax | eˣᵢ/Σeˣⱼ | Multiclass output | **Backpropagation:** Compute ∂L/∂W via chain rule → update with GD/Adam/etc. ### 11.2 Implementation ```python from sklearn.neural_network import MLPClassifier, MLPRegressor mlp = MLPClassifier( hidden_layer_sizes=(100,), # tuple: one int per hidden layer # (100,50) = 2 hidden layers: 100 and 50 neurons activation='relu', # 'identity','logistic','tanh','relu' solver='adam', # 'lbfgs','sgd','adam' alpha=0.0001, # L2 regularization term batch_size='auto', # 'auto'=min(200, n_samples); or int learning_rate='constant', # 'constant','invscaling','adaptive' (only for SGD) learning_rate_init=0.001, # initial learning rate power_t=0.5, # for invscaling max_iter=200, # max epochs shuffle=True, random_state=None, tol=1e-4, verbose=False, warm_start=False, momentum=0.9, # for SGD nesterovs_momentum=True, # Nesterov momentum for SGD early_stopping=False, validation_fraction=0.1, beta_1=0.9, # Adam: exp decay rate for 1st moment beta_2=0.999, # Adam: exp decay rate for 2nd moment epsilon=1e-8, # Adam: numerical stability n_iter_no_change=10, max_fun=15000 # for lbfgs: max function evaluations ) mlp.fit(X_train, y_train) mlp.coefs_ # list of weight matrices per layer mlp.intercepts_ # list of bias vectors per layer mlp.n_iter_ # actual iterations mlp.loss_ # final loss value mlp.loss_curve_ # loss per iteration (if verbose or solver=sgd/adam) mlp.out_activation_ # output activation: 'logistic','softmax','identity' mlp.n_layers_ # total layers including input/output mlp.n_outputs_ ``` **Solver comparison:** | Solver | Best For | Notes | |--------|---------|-------| | `lbfgs` | Small datasets | Quasi-Newton, fast convergence | | `sgd` | Large datasets | More control (momentum, learning rate) | | `adam` | Large/noisy | Default, usually best, adaptive LR | --- ## PART 12 Unsupervised Learning ### 12.1 KMeans **Algorithm:** 1. Initialize k centroids (randomly or k-means++) 2. Assign each point to nearest centroid 3. Recompute centroids as mean of assigned points 4. Repeat 2-3 until convergence **Inertia (WCSS):** Σ min_μ ‖xᵢ−μ‖² ```python from sklearn.cluster import KMeans, MiniBatchKMeans km = KMeans( n_clusters=8, init='k-means++', # 'k-means++' or 'random' or ndarray n_init='auto', # n times to run with diff seeds; 'auto'=10 max_iter=300, tol=1e-4, verbose=0, random_state=None, copy_x=True, algorithm='lloyd' # 'lloyd','elkan' ) km.fit(X) km.cluster_centers_ # centroid coordinates (k, n_features) km.labels_ # cluster label per sample km.inertia_ # WCSS (lower = tighter clusters) km.n_iter_ # iterations to converge km.predict(X_new) # assign new points # Elbow method: inertias = [KMeans(n_clusters=k, random_state=42).fit(X).inertia_ for k in range(1,11)] ``` ### 12.2 DBSCAN **Concept:** Density-based; finds clusters of arbitrary shape; marks low-density points as noise. - **Core point:** has ≥ min_samples within eps radius - **Border point:** within eps of core point, but fewer than min_samples neighbors - **Noise:** not within eps of any core point → label = -1 ```python from sklearn.cluster import DBSCAN db = DBSCAN( eps=0.5, # neighborhood radius min_samples=5, # min points to form core point metric='euclidean', algorithm='auto', # 'auto','ball_tree','kd_tree','brute' leaf_size=30, n_jobs=None ) db.fit(X) db.labels_ # -1 for noise db.core_sample_indices_ # indices of core points db.components_ # copy of core samples ``` ### 12.3 Agglomerative Clustering (Hierarchical) ```python from sklearn.cluster import AgglomerativeClustering agg = AgglomerativeClustering( n_clusters=2, # None if distance_threshold set metric='euclidean', # or 'cosine','manhattan','l1','l2','precomputed' linkage='ward', # 'ward','complete','average','single' distance_threshold=None, # cut dendrogram here if n_clusters=None compute_full_tree='auto', compute_distances=False ) agg.fit(X) agg.labels_ agg.n_clusters_ agg.n_leaves_ agg.n_connected_components_ agg.children_ # merge history (for dendrogram) ``` **Linkage methods:** | Method | Distance Between Clusters | Shape | |--------|--------------------------|-------| | `single` | Closest pair | Elongated, chains | | `complete` | Furthest pair | Compact, spherical | | `average` | Mean pairwise | Between single/complete | | `ward` | Minimizes variance increase | Compact, equal-size | ### 12.4 Cluster Evaluation Metrics ```python from sklearn.metrics import silhouette_score, silhouette_samples, calinski_harabasz_score, davies_bouldin_score # Silhouette: (b-a)/max(a,b); range [-1,1], higher=better sil = silhouette_score(X, labels, metric='euclidean') per_sample = silhouette_samples(X, labels) # Calinski-Harabasz (Variance Ratio): higher = better defined clusters ch = calinski_harabasz_score(X, labels) # Davies-Bouldin: lower = better (avg similarity of each cluster to most similar) db_score = davies_bouldin_score(X, labels) # If true labels known: from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score ari = adjusted_rand_score(y_true, y_pred) # -1 to 1, 1=perfect nmi = normalized_mutual_info_score(y_true, y_pred) # 0 to 1 ``` --- ## PART 13 Class Imbalance Handling ```python # 1. class_weight='balanced' (in model params) # sklearn computes: n_samples / (n_classes * np.bincount(y)) LogisticRegression(class_weight='balanced') SVC(class_weight={0: 1, 1: 5}) # custom weights # 2. Resampling (imbalanced-learn) from imblearn.over_sampling import SMOTE, RandomOverSampler from imblearn.under_sampling import RandomUnderSampler smote = SMOTE(k_neighbors=5, random_state=42) X_res, y_res = smote.fit_resample(X_train, y_train) # 3. Use appropriate metrics # Avoid accuracy for imbalanced → use F1, AUC-ROC, AUC-PR, MCC # 4. Threshold adjustment y_prob = model.predict_proba(X_test)[:, 1] threshold = 0.3 # lower threshold → more positives y_pred = (y_prob >= threshold).astype(int) ``` --- ## PART 14 Model Persistence & Utilities ```python # Save/load models import joblib joblib.dump(model, 'model.pkl') model = joblib.load('model.pkl') import pickle with open('model.pkl','wb') as f: pickle.dump(model, f) with open('model.pkl','rb') as f: model = pickle.load(f) # Clone a model (unfitted copy) from sklearn.base import clone model_copy = clone(fitted_model) # Get params model.get_params(deep=True) model.set_params(C=0.1, max_iter=500) # Check estimator type from sklearn.utils.estimator_checks import check_estimator from sklearn.base import is_classifier, is_regressor is_classifier(model) # True/False # Scoring strings (for cv, GridSearch) # 'accuracy','f1','f1_macro','f1_weighted','precision','recall', # 'roc_auc','neg_mean_squared_error','r2','neg_log_loss'... ``` --- ## PART 15 Multiclass Strategies ```python from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier # OvR: n_classes binary classifiers, highest score wins ovr = OneVsRestClassifier(SVC(), n_jobs=-1) # OvO: n*(n-1)/2 binary classifiers, pairwise voting ovo = OneVsOneClassifier(SVC()) # For multiclass output (multilabel): from sklearn.multioutput import MultiOutputClassifier, ClassifierChain mo = MultiOutputClassifier(LogisticRegression(), n_jobs=-1) cc = ClassifierChain(LogisticRegression(), order='random', random_state=42) ``` **When models natively support multiclass:** - Logistic Regression: yes (multinomial/ovr) - Decision Trees: yes - Random Forest: yes - KNN: yes - Naive Bayes: yes - SVC: OvO by default, OvR via decision_function_shape='ovr' - SGDClassifier: OvR by default --- ## PART 16 Key Exam Traps & Quick Facts | Question Type | Key Answer | |--------------|-----------| | What does `C` control in SVM/LR? | INVERSE regularization: C↑ = less reg = tighter fit | | What does `alpha` control in Ridge/Lasso/NB? | Regularization strength: alpha↑ = more shrinkage | | `fit_transform` on test? | NEVER causes data leakage | | `stratify=y` when? | Always in classification splits | | KNN needs scaling? | Yes distance-sensitive | | SVM needs scaling? | Yes very sensitive to scale | | Decision Tree needs scaling? | No | | Naive Bayes needs scaling? | No | | Random Forest needs scaling? | No | | `partial_fit` needs `classes=`? | First call only (but safe always) | | Lasso vs Ridge: which gives sparse? | Lasso (L1) → exact zeros | | Default `max_features` RF clf? | `'sqrt'` | | Default `max_features` RF reg? | `1.0` (all features) | | `voting='soft'` requires? | All models have `predict_proba` | | DBSCAN noise label | -1 | | OHE drop='first' with 3 categories? | 2 output columns | | `coef_` shape for binary LR? | (1, n_features) | | `coef_` shape for multiclass LR? | (n_classes, n_features) | | Pipeline param access syntax? | `step_name__param_name` | | `warm_start=True` effect? | Reuse previous fit; build incrementally | | `oob_score=True` in RF? | Use out-of-bag samples as validation | | AdaBoost default base estimator? | DecisionTreeClassifier(max_depth=1) | | Gradient Boosting fits trees on? | Residuals (negative gradients) | | Silhouette score range? | -1 to 1 (1 = perfect clusters) | | Davies-Bouldin: lower or higher? | Lower is better | | Calinski-Harabasz: lower or higher? | Higher is better | --- ## PART 17 Common Patterns & Code Templates ### Train-Eval-Tune Full Template ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import GridSearchCV, StratifiedKFold from sklearn.metrics import classification_report num_pipe = Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())]) cat_pipe = Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('ohe', OneHotEncoder(handle_unknown='ignore'))]) ct = ColumnTransformer([('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols)]) pipe = Pipeline([('prep', ct), ('model', RandomForestClassifier(random_state=42))]) cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) gs = GridSearchCV(pipe, {'model__n_estimators':[50,100], 'model__max_depth':[5,10,None]}, cv=cv, scoring='f1_weighted', n_jobs=-1, refit=True) gs.fit(X_train, y_train) print(gs.best_params_, gs.best_score_) print(classification_report(y_test, gs.predict(X_test))) ``` ### OHE Manual Calculation ```python # For exam: given categories ['cat','dog','fish'] with drop=None: # cat → [1,0,0], dog → [0,1,0], fish → [0,0,1] # With drop='first': cat=dropped → dog → [1,0], fish → [0,1] # n_output_features = n_categories - (1 if drop='first' else 0) ``` ### Naive Bayes Hand Calculation ```python # Given training data: predict class for new sample # 1. Count classes: P(c) = count(c)/total # 2. Count feature|class: for each unique value of each feature # 3. Laplace: P(x|c) = (count(x,c)+1) / (count(c) + n_unique_vals) # 4. Posterior ∝ P(c) * ∏P(xᵢ|c) [work in log space to avoid underflow] # 5. log_posterior = log_prior + Σlog_likelihoods # 6. Predict: argmax(log_posterior) ``` ### SGD Partial Fit Log Loss Pattern (Exam Typical) ```python sgd = SGDClassifier(loss='log_loss', penalty='l2', eta0=0.001, alpha=0, learning_rate='constant', random_state=1729, warm_start=True, shuffle=False) classes = np.unique(y_train) for i in range(1, 6): sgd.partial_fit(X_train, y_train, classes=classes) y_prob = sgd.predict_proba(X_train) print(f"Iter {i}: Log Loss = {log_loss(y_train, y_prob):.4f}") ``` --- *Guide covers: sklearn v1.x · All models, parameters, math, and exam patterns*