🤖 Machine Learning Algorithms Quick Reference
Regression: Linear Regression, Ridge, Lasso, XGBoost
Dimensionality Reduction: PCA, t-SNE, Autoencoders
Regression Algorithms
Linear Regression
Supervised RegressionRidge Regression (L2 Regularization)
Supervised RegressionLasso Regression (L1 Regularization)
Supervised RegressionPolynomial Regression
Supervised RegressionClassification Algorithms
Logistic Regression
Supervised ClassificationDecision Trees
Supervised Classification- Gini Impurity: 1 - Σpᵢ² (classification)
- Entropy: -Σpᵢlog₂(pᵢ) (information gain)
- MSE: Mean squared error (regression)
Random Forest
Ensemble ClassificationSupport Vector Machine (SVM)
Supervised Classification- Linear: K(x, y) = x·y (linearly separable data)
- Polynomial: K(x, y) = (x·y + c)ᵈ (non-linear)
- RBF (Gaussian): K(x, y) = exp(-γ||x-y||²) (most common)
- Sigmoid: K(x, y) = tanh(αx·y + c)
Gradient Boosting (XGBoost, LightGBM, CatBoost)
Ensemble Classification- XGBoost: Most popular, handles missing values, L1/L2 regularization
- LightGBM: Fastest, leaf-wise growth, best for large datasets
- CatBoost: Best for categorical features, robust to overfitting
K-Nearest Neighbors (KNN)
Supervised Classification- Euclidean: √Σ(xᵢ - yᵢ)² (most common)
- Manhattan: Σ|xᵢ - yᵢ|
- Minkowski: (Σ|xᵢ - yᵢ|ᵖ)^(1/p)
- Cosine: 1 - (x·y)/(||x|| ||y||)
Naive Bayes
Supervised Classification- Gaussian NB: Continuous features (assumes normal distribution)
- Multinomial NB: Discrete counts (text, word frequencies)
- Bernoulli NB: Binary features (presence/absence)
Neural Networks
Feedforward Neural Network (MLP)
Supervised Deep Learning- ReLU: max(0, x) - Most common for hidden layers
- Sigmoid: 1/(1+e⁻ˣ) - Binary classification output
- Softmax: eˣⁱ/Σeˣʲ - Multi-class classification output
- Tanh: (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) - Alternative to sigmoid
- Leaky ReLU: max(0.01x, x) - Fixes dying ReLU problem
- Dropout: Randomly disable neurons during training
- L1/L2: Weight penalty in loss function
- Batch Normalization: Normalize layer inputs
- Early Stopping: Stop when validation loss stops improving
Convolutional Neural Network (CNN)
Supervised Deep Learning- Convolutional: Feature extraction with filters
- Pooling: Downsampling (Max/Average pooling)
- Fully Connected: Classification layer
Recurrent Neural Network (RNN/LSTM/GRU)
Supervised Deep Learning- Simple RNN: Vanishing gradient problem
- LSTM: Long Short-Term Memory (solves vanishing gradient)
- GRU: Gated Recurrent Unit (faster than LSTM)
Clustering Algorithms
K-Means Clustering
Unsupervised Clustering- Initialize k random centroids
- Assign each point to nearest centroid
- Update centroids to mean of assigned points
- Repeat until convergence
WCSS = ΣΣ||x - μₖ||²
- Silhouette Score: (-1 to 1, higher is better)
- Gap Statistic: Compare WCSS to random data
- Davies-Bouldin Index: Lower is better
Hierarchical Clustering
Unsupervised Clustering- Agglomerative (bottom-up): Start with individual points, merge clusters
- Divisive (top-down): Start with one cluster, split recursively
- Single: Minimum distance between clusters
- Complete: Maximum distance between clusters
- Average: Average distance between all pairs
- Ward: Minimize variance (most common)
DBSCAN (Density-Based Spatial Clustering)
Unsupervised Clustering- ε (epsilon): Maximum distance for neighborhood
- MinPts: Minimum points to form dense region
- Core: Has ≥ MinPts within ε
- Border: Within ε of core point but has < MinPts
- Noise: Neither core nor border (outlier)
Gaussian Mixture Model (GMM)
Unsupervised Clustering- E-step: Calculate probability of each point belonging to each cluster
- M-step: Update Gaussian parameters (mean, covariance)
Dimensionality Reduction
Principal Component Analysis (PCA)
Unsupervised Dim Reduction- Standardize data (mean=0, std=1)
- Compute covariance matrix
- Calculate eigenvectors and eigenvalues
- Select top K eigenvectors (principal components)
- Transform data to new feature space
- Retain components explaining 95% variance
- Use scree plot (elbow method)
- Cross-validation with downstream task
t-SNE (t-Distributed Stochastic Neighbor Embedding)
Unsupervised Visualization- Perplexity: 5-50 (typical), balance between local/global structure
- Learning rate: 10-1000
- Iterations: 1000-5000
UMAP (Uniform Manifold Approximation and Projection)
Unsupervised Dim Reduction- Faster (10-100x)
- Preserves global structure better
- Can transform new data
- Scales to larger datasets
Autoencoders
Unsupervised Deep Learning- Encoder: Compress input to latent representation
- Bottleneck: Low-dimensional latent space
- Decoder: Reconstruct original input
- Variational (VAE): Generative model, produces probability distributions
- Denoising: Trained to reconstruct from corrupted input
- Sparse: Encourages sparse activations
Core RL Concepts
- Agent: Decision maker
- Environment: World agent interacts with
- State (s): Current situation
- Action (a): Possible moves
- Reward (r): Feedback signal
- Policy (π): Strategy for selecting actions
Q-Learning
Reinforcement- α: Learning rate
- γ: Discount factor (future reward importance)
- r: Immediate reward
Deep Q-Network (DQN)
Reinforcement Deep Learning- Experience Replay: Store and sample past experiences
- Target Network: Separate network for stable targets
- Frame Stacking: Use multiple frames to capture motion
Policy Gradient Methods (REINFORCE, A3C, PPO)
Reinforcement- PPO (Proximal Policy Optimization): Most popular, stable training
- A3C (Asynchronous Advantage Actor-Critic): Parallel training
- SAC (Soft Actor-Critic): Maximum entropy RL
Classification Metrics
Accuracy
Use When: Balanced classes
Avoid When: Imbalanced datasets
Precision
Use When: False positives are costly (spam detection)
Interpretation: "Of predicted positives, how many are correct?"
Recall (Sensitivity)
Use When: False negatives are costly (cancer detection)
Interpretation: "Of actual positives, how many did we find?"
F1-Score
Use When: Need balance between precision and recall
Best For: Imbalanced datasets
ROC-AUC
(TPR vs FPR)
Use When: Evaluating ranking quality
Range: 0.5 (random) to 1.0 (perfect)
Log Loss
Use When: Probability calibration matters
Best For: Multi-class problems
Regression Metrics
Mean Absolute Error (MAE)
Use When: Outliers present, want interpretable error
Units: Same as target variable
Mean Squared Error (MSE)
Use When: Want to penalize large errors heavily
Note: Sensitive to outliers
Root Mean Squared Error (RMSE)
Use When: Need interpretable units like MAE
Advantage: Same units as target
R² Score (Coefficient of Determination)
Range: -∞ to 1 (1 = perfect fit)
Use When: Comparing models, understanding variance explained
Mean Absolute Percentage Error (MAPE)
Use When: Need scale-independent metric
Warning: Undefined when yᵢ = 0
Adjusted R²
Use When: Comparing models with different feature counts
Advantage: Penalizes unnecessary features
Clustering Metrics
Silhouette Score
Range: -1 to 1 (higher is better)
Use When: Evaluating cluster quality
Davies-Bouldin Index
Range: 0 to ∞ (lower is better)
Use When: Comparing different K values
Calinski-Harabasz Index
Range: 0 to ∞ (higher is better)
Best For: Dense, well-separated clusters
| Problem | Symptoms | Solutions |
|---|---|---|
| Overfitting | High training accuracy, low test accuracy; large gap between train/val loss |
• Increase training data • Reduce model complexity • Apply regularization (L1/L2, dropout) • Early stopping • Cross-validation |
| Underfitting | Low training and test accuracy; high bias |
• Increase model complexity • Add more features • Reduce regularization • Train longer • Try non-linear models |
| Class Imbalance | High accuracy but poor minority class performance |
• SMOTE or oversampling • Class weights • Stratified sampling • Use F1-score instead of accuracy • Ensemble methods |
| Data Leakage | Unrealistically high performance; test accuracy > train accuracy |
• Separate test set before any processing • Fit preprocessing on train only • Check for future information in features • Time-based splits for time series |
| Vanishing Gradients | Deep networks stop learning; weights don't update |
• Use ReLU instead of sigmoid/tanh • Batch normalization • Residual connections (ResNet) • Gradient clipping • Better initialization |
| Exploding Gradients | NaN or Inf losses; unstable training |
• Gradient clipping • Lower learning rate • Batch normalization • Weight regularization |
| Feature Scaling Issues | Slow convergence; poor performance with distance-based algorithms |
• StandardScaler (mean=0, std=1) • MinMaxScaler (0-1 range) • RobustScaler (use median, resistant to outliers) • Required for: SVM, KNN, Neural Networks, PCA |
| Curse of Dimensionality | Too many features; poor generalization; distance metrics break down |
• Feature selection (Lasso, feature importance) • Dimensionality reduction (PCA, UMAP) • Regularization • More training data |
| Algorithm | Training Speed | Prediction Speed | Interpretability | Handles Non-linearity | Scales to Big Data |
|---|---|---|---|---|---|
| Linear Regression | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐⭐ |
| Logistic Regression | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐⭐ |
| Decision Trees | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | ⭐⭐⭐ |
| Random Forest | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ✅ | ⭐⭐⭐⭐ |
| XGBoost | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | ✅ | ⭐⭐⭐⭐⭐ |
| SVM | ⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ✅ (with kernels) | ⭐⭐ |
| KNN | ⭐⭐⭐⭐⭐ (no training) | ⭐⭐ | ⭐⭐⭐⭐ | ✅ | ⭐ |
| Neural Networks | ⭐ | ⭐⭐⭐⭐ | ⭐ | ✅ | ⭐⭐⭐⭐⭐ |
| K-Means | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐ |
| DBSCAN | ⭐⭐⭐ | N/A | ⭐⭐⭐ | ✅ | ⭐⭐ |