SKY — Data Science Roadmap

01

Foundations

Mathematics · Statistics · Programming · Tools

0%

Complete

Topic Checklists

📐

Linear Algebra

0%

▼

Scalars, vectors, matrices, tensorscore
Matrix addition, subtraction, multiplicationcore
Transpose, inverse, identity matrixcore
Dot product and cross productconcept
Linear transformationsconcept
Eigenvalues and eigenvectorscore
Singular Value Decomposition (SVD)advanced
Norms (L1, L2)core
Implement matrix ops with NumPytool

🔨 Confidence Project — Image Compression with SVD

1.Load a grayscale image as a NumPy matrixbeginner

2.Apply SVD decomposition using numpy.linalg.svd

3.Reconstruct the image using only top-k singular values

4.Plot compression ratio vs image quality side-by-side

5.Write a 1-page summary of what eigenvalues represent visually

📚 Resources

3Blue1Brown: Essence of LAGilbert Strang MIT OCWKhan Academy LA

∫

Calculus

0%

▼

Functions, limits, continuitycore
Derivatives and differentiation rulescore
Partial derivativescore
Chain rulecore
Gradient, Jacobian, Hessianadvanced
Gradient descent intuitionconcept
Integrals (basic, definite)core

🔨 Confidence Project — Gradient Descent Visualizer

1.Pick a 2D function (e.g. x² + y²) and compute its gradient manuallybeginner

2.Implement gradient descent from scratch in Python (no ML libraries)

3.Animate the descent path on a contour plot using Matplotlib

4.Experiment with different learning rates — observe overshooting/convergence

📚 Resources

3Blue1Brown: Calculus seriesKhan Academy Calculus

📊

Probability & Statistics

0%

▼

Mean, median, mode, variance, std devcore
Probability rules, Bayes theoremcore
Probability distributions (Normal, Binomial, Poisson)core
Central Limit Theoremconcept
Hypothesis testing (t-test, chi-square, ANOVA)core
p-values, confidence intervalscore
Correlation vs causationconcept
A/B testing designcore
Bayesian vs Frequentist approachesadvanced
Use scipy.stats for teststool

🔨 Confidence Project — A/B Test Simulator

1.Simulate a website button color A/B test with fake conversion databeginner

2.Run a two-sample t-test and chi-square test using scipy.stats

3.Visualize distributions of both groups with confidence intervals

4.Write a plain-English business conclusion from the p-value

📚 Resources

StatQuest (YouTube)Think Stats (book)Khan Academy Stats

🐍

Python Programming

0%

▼

Variables, data types, operatorscore
Control flow: if/else, loopscore
Functions, lambda, *args, **kwargscore
List, dict, set comprehensionscore
OOP: classes, inheritance, methodsconcept
File I/O, error handling, context managerscore
NumPy: arrays, broadcasting, operationstool
Pandas: DataFrames, Series, indexingtool
Jupyter Notebooks workflowtool
Virtual environments, pip, condatool

🔨 Confidence Project — Personal Finance Tracker

1.Create a CSV of 100 fake monthly expenses across categoriesbeginner

2.Load with Pandas, clean data types, handle missing values

3.Group by category and month, compute totals and trends

4.Plot monthly spending trends and category breakdowns with Matplotlib

5.Export summary report to a new CSV file

📚 Resources

Python.org docsAutomate the Boring StuffPython for Data Analysis

🗄️

SQL & Databases

0%

▼

SELECT, WHERE, ORDER BY, LIMITcore
JOINs: INNER, LEFT, RIGHT, FULLcore
GROUP BY, HAVING, aggregate functionscore
Subqueries and CTEs (WITH clause)core
Window functions (ROW_NUMBER, RANK, LAG)advanced
CASE statementscore
Indexing and query optimization basicsconcept
Connect Python to SQL (SQLAlchemy, sqlite3)tool

🔨 Confidence Project — E-Commerce Sales Analysis

1.Download Northwind or Chinook sample database (SQLite)beginner

2.Answer 10 business questions with SQL (top products, customer LTV, etc.)

3.Use CTEs and window functions to compute monthly revenue growth

4.Connect to Python with sqlite3 and visualize results with Pandas + Seaborn

📚 Resources

SQLZooMode Analytics SQLLeetCode SQL

🔧

Git & Dev Tools

0%

▼

git init, clone, add, commit, pushcore
Branches: create, merge, rebasecore
Pull requests and code reviewscore
.gitignore and best practicestool
GitHub: repos, issues, READMEtool
Command line basics (bash)core

🔨 Confidence Project — Portfolio GitHub Setup

1.Create a GitHub account and configure SSH keysbeginner

2.Push your Python finance tracker project with a proper README

3.Practice branching: create a feature branch, make changes, merge via PR

4.Add a requirements.txt and .gitignore to your repo

🏆 Stage 1 Capstone Projects

Build these to prove your foundations are solid

Beginner 🎲

Monte Carlo Pi Estimator

Use random sampling to estimate π. Visualize how accuracy improves with more samples. Demonstrates probability, Python, and NumPy together.

NumPyMatplotlibProbability

✦ Covers: Probability · Python · Visualization

Beginner 📈

Stock Price Statistics Dashboard

Download 1 year of stock data using yfinance. Compute rolling mean, std deviation, and detect anomalies. Plot interactive charts.

PandasyfinanceStatistics

✦ Covers: Stats · Pandas · Time Series basics

Intermediate 🧮

Linear Regression from Scratch

Implement OLS linear regression using only NumPy matrix operations — no sklearn. Verify your results match sklearn's output exactly.

NumPyLinear AlgebraCalculus

✦ Covers: Linear Algebra · Calculus · Python

02

Core Data Skills

Wrangling · EDA · Visualization · Feature Engineering

0%

Complete

Topic Checklists

🔄

Data Wrangling

0%

▼

Load CSV, JSON, Excel, Parquet filescore
Detect and handle missing values (dropna, fillna, impute)core
Detect and remove duplicatescore
Outlier detection and treatmentcore
Data type conversionscore
String cleaning and regextool
Merging, joining, concatenating DataFramescore
Pivot tables and reshapingcore
GroupBy operations and aggregationcore
Apply, map, applymap functionstool

🔨 Confidence Project — Messy World Happiness Data Cleaner

1.Download the World Happiness Report CSV (intentionally messy version)beginner

2.Identify all data quality issues: missing values, wrong types, duplicates

3.Write a data cleaning pipeline using Pandas, document every decision

4.Compare before/after stats to prove cleaning quality improved the data

🔍

Exploratory Data Analysis

0%

▼

Univariate analysis (distributions, histograms)core
Bivariate analysis (scatterplots, correlation)core
Correlation matrix and heatmapscore
Box plots, violin plotscore
Identify patterns, anomalies, trendsconcept
Summary statistics: describe(), info()tool
Pair plots (seaborn pairplot)tool

🔨 Confidence Project — Titanic Survivor EDA Report

1.Load Titanic dataset from Kaggle or seaborn datasetsbeginner

2.Perform full EDA: age distribution, survival by gender/class, fare outliers

3.Generate a 5-insight summary with one visualization per insight

4.Write a 300-word narrative conclusion as if presenting to a business team

🎨

Data Visualization

0%

▼

Matplotlib: figures, axes, subplotstool
Seaborn: statistical plotstool
Plotly: interactive chartstool
Chart selection (when to use what chart)concept
Color theory and accessibility in chartsconcept
Tableau or Power BI basicstool
Dashboard design principlesconcept

🔨 Confidence Project — COVID-19 Interactive Dashboard

1.Download Our World in Data COVID CSV datasetintermediate

2.Build interactive line charts for cases/deaths across 5 countries using Plotly

3.Add a choropleth world map showing vaccination rates

4.Wrap in a Streamlit app with country filter dropdowns

⚙️

Feature Engineering

0%

▼

Encoding: One-Hot, Label, Target encodingcore
Scaling: StandardScaler, MinMaxScaler, RobustScalercore
Log/power transformationscore
Binning/discretizationcore
Date/time feature extractioncore
Feature interactions and polynomial featuresadvanced
Feature selection: Filter, Wrapper, Embedded methodsadvanced
Handle class imbalance (SMOTE, oversampling)advanced

🔨 Confidence Project — House Price Feature Engineering

1.Load Ames Housing dataset from Kaggleintermediate

2.Engineer 10+ new features (age of house, total rooms, price/sqft ratio)

3.Compare a simple model with vs without your engineered features

4.Plot feature importances — which features matter most?

🏆 Stage 2 Capstone Projects

Full data pipeline projects to add to your portfolio

Beginner 🏡

Airbnb NYC Data Story

Download the NYC Airbnb open dataset. Clean it, explore pricing patterns by neighborhood, create a beautiful 5-chart story and share on GitHub.

PandasSeabornEDA

✦ Covers: Wrangling · EDA · Visualization

Intermediate 🏦

Bank Churn Preprocessing Pipeline

Take a raw bank customer churn dataset, build a full sklearn preprocessing pipeline with feature engineering, encoding, and scaling. Ready for modelling.

sklearnPipelineFeatures

✦ Covers: Feature Engineering · Wrangling · sklearn

Intermediate 🌍

Global Development Dashboard

Merge World Bank GDP, education, and health datasets. Build a Plotly Dash or Streamlit dashboard with interactive scatter plots and a year slider.

PlotlyStreamlitMerging

✦ Covers: Visualization · Wrangling · EDA

03

Machine Learning

Supervised · Unsupervised · Evaluation · Scikit-learn

0%

Complete

Topic Checklists

🎯

Supervised Learning

0%

▼

Linear Regression (theory + implementation)core
Ridge, Lasso, ElasticNet regularizationcore
Logistic Regression (binary + multiclass)core
Decision Trees (splitting, pruning)core
Random Forest (bagging, feature importance)core
Gradient Boosting (GBM, XGBoost, LightGBM)core
Support Vector Machines (SVM)advanced
K-Nearest Neighbors (KNN)core
Naive Bayes classifierscore

🔨 Confidence Project — Credit Card Fraud Detector

1.Download the Kaggle Credit Card Fraud dataset (imbalanced)intermediate

2.Train Logistic Regression, Random Forest, and XGBoost models

3.Handle class imbalance with SMOTE, evaluate using Precision-Recall AUC

4.Explain which model you'd deploy and why — write a business summary

📚 Resources

Scikit-learn docsHands-On ML (Géron)StatQuest ML playlist

🌀

Unsupervised Learning

0%

▼

K-Means clustering (elbow method, inertia)core
Hierarchical clustering (dendrograms)core
DBSCAN (density-based, noise handling)advanced
PCA: theory, variance explained, scree plotcore
t-SNE and UMAP for visualizationadvanced
Anomaly detection (Isolation Forest, LOF)advanced

🔨 Confidence Project — Customer Segmentation for Retail

1.Use the Online Retail dataset from UCI ML Repointermediate

2.Compute RFM (Recency, Frequency, Monetary) features per customer

3.Apply K-Means, find optimal K with elbow + silhouette score

4.Visualize clusters with PCA 2D scatter plot, name each segment

5.Propose a marketing strategy for each customer segment

📏

Model Evaluation

0%

▼

Train/validation/test splitcore
Cross-validation (k-fold, stratified)core
Bias-variance tradeoffconcept
Overfitting & underfitting (learning curves)core
Regression metrics: MAE, MSE, RMSE, R²core
Classification metrics: Accuracy, Precision, Recall, F1core
Confusion matrixcore
ROC curve and AUC-ROCcore
Hyperparameter tuning (GridSearch, Optuna)core
Model interpretability (SHAP, LIME)advanced

🔨 Confidence Project — Model Horse Race

1.Pick any classification dataset (Heart Disease, Churn, Spam)intermediate

2.Train 5 different models with the same train/test split

3.Plot ROC curves for all models on the same graph

4.Run Optuna hyperparameter tuning on the best model

5.Use SHAP to explain top 5 feature contributions on the winning model

🔩

ML Pipelines

0%

▼

sklearn Pipeline objecttool
ColumnTransformer for mixed datatool
Custom Transformers (BaseEstimator)advanced
Save/load models with joblib, pickletool
Experiment tracking basicsconcept

🔨 Confidence Project — Production-Ready Titanic Classifier

1.Build a full sklearn Pipeline with preprocessing + RandomForestintermediate

2.Use ColumnTransformer to handle numeric and categorical columns separately

3.Save the entire pipeline as a .pkl file with joblib

4.Load it and predict on 5 new manually-crafted passenger examples

🏆 Stage 3 Capstone Projects

Real-world ML projects that belong in every DS portfolio

Intermediate 🏠

House Price Prediction

Kaggle's classic competition. Build end-to-end regression with feature engineering, XGBoost, and hyperparameter tuning. Aim for top 20% on the leaderboard.

RegressionXGBoostKaggle

✦ Covers: Full supervised ML pipeline

Intermediate 📧

Spam Email Classifier

Train a Naive Bayes + Logistic Regression spam detector on the Enron email dataset. Compare models and deploy as a simple web app with Streamlit.

NLP basicsClassificationStreamlit

✦ Covers: Classification · Text basics · Deployment

Advanced 🏥

Hospital Readmission Risk Model

Use the Diabetes 130-US hospitals dataset. Build a model to predict 30-day readmission. Present findings in a mock hospital board presentation format.

HealthcareSHAPImbalanced

✦ Covers: Full pipeline · Ethics · Explainability

04

Advanced AI

Deep Learning · Vision · NLP · Time Series

0%

Complete

Topic Checklists

🧠

Deep Learning Basics

0%

▼

Perceptrons & Neural Networkscore
Activation functions (ReLU, Sigmoid, Softmax)core
Forward and backward propagationconcept
Optimizers (Adam, SGD, RMSprop)core
Loss functions (MSE, Cross-Entropy)core
PyTorch or TensorFlow basicstool

👁️

Computer Vision (CNNs)

0%

▼

Convolutional Neural Networks (CNNs) principlescore
Image Augmentation & Preprocessingcore
Transfer Learning (ResNet, VGG, MobileNet)advanced
Object Detection basics (YOLO, Faster R-CNN)advanced
Keras or Torchvision practical applicationtool

💬

Natural Language Processing

0%

▼

Text cleaning & Tokenizationcore
TF-IDF and CountVectorizercore
Word Embeddings (Word2Vec, GloVe)concept
RNNs, LSTMs, and GRUscore
Transformers and Attention basicsadvanced
Hugging Face Transformers librarytool

📈

Time Series Analysis

0%

▼

Trend, Seasonality, Noise decompositioncore
Stationarity and Dickey-Fuller testcore
ARIMA and SARIMA modelsadvanced
Prophet (Facebook)tool
LSTM for forecastingadvanced

05

Gen AI & LLMs

Prompting · Embeddings · Vector DBs · RAG

0%

Complete

Topic Checklists

✍️

Prompt Engineering

0%

▼

Zero-shot, Few-shot promptingcore
Chain-of-Thought (CoT) promptingcore
Prompt Optimizationadvanced
OpenAI API / Anthropic APItool

📉

Embeddings & Vector DBs

0%

▼

Understanding Dense vs Sparse Vectorscore
Cosine Similarity & Distance Modelscore
Vector DBs: Pinecone, Chroma, FAISStool

🔍

Retrieval-Augmented Gen (RAG)

0%

▼

RAG Architecture Overviewcore
Document Chunking Strategiescore
LangChain / LlamaIndex integrationtool
Evaluation Metrics for RAGadvanced

🛠️

Fine-Tuning

0%

▼

PEFT (Parameter-Efficient Fine-Tuning)advanced
LoRA & QLoRA methodologiesadvanced
Preparing Instruct Datasetscore
Hugging Face AutoTraintool

06

MLOps & Deployment

Serving · Tracking · Docker · Cloud

0%

Complete

Topic Checklists

🚀

Model Deployment

0%

▼

Flask / FastAPI for ML servingtool
Streamlit / Gradio for web appstool
Docker containerization basicscore
REST API endpointscore

📦

Experiment Tracking

0%

▼

MLflow basics (tracking, models, registry)tool
Weights & Biases (W&B)tool
Versioning data & models (DVC)advanced

☁️

Cloud Platforms

0%

▼

AWS (SageMaker, S3, EC2) basicstool
GCP (Vertex AI, BigQuery) basicstool
Serverless deployments (AWS Lambda)advanced

07

Career & Portfolio

Projects · Branding · Job Search

0%

Complete

Topic Checklists

📁

Portfolio Building

0%

▼

Create a personal website / GitHub Pagescore
Host minimum 3 end-to-end projectscore
Write excellent READMEs (context, data, metrics)core
Deploy at least one project liveproject

🗣️

Soft Skills

0%

▼

Data Storytelling & Presentationcore
Translating business problems to MLcore
Explaining complex models to non-technical folkscore

💼

Job Search & Interviewing

0%

▼

Optimize LinkedIn profile & Resumecore
SQL interview practice (LeetCode, HackerRank)core
Python algorithmic interview basicscore
Prepare for ML system design interviewsadvanced
Behavioral interviews (STAR method)core