🧑💻 SKY
SKY Data Scientist
📷
// complete learning tracker + projects

Data Science
Roadmap

Master every topic · Build real projects · Gain confidence

OVERALL PROGRESS0%
01

Foundations

Mathematics · Statistics · Programming · Tools

0%
Complete
📐

Linear Algebra

0%
  • Scalars, vectors, matrices, tensorscore
  • Matrix addition, subtraction, multiplicationcore
  • Transpose, inverse, identity matrixcore
  • Dot product and cross productconcept
  • Linear transformationsconcept
  • Eigenvalues and eigenvectorscore
  • Singular Value Decomposition (SVD)advanced
  • Norms (L1, L2)core
  • Implement matrix ops with NumPytool
🔨 Confidence Project — Image Compression with SVD
1.Load a grayscale image as a NumPy matrixbeginner
2.Apply SVD decomposition using numpy.linalg.svd
3.Reconstruct the image using only top-k singular values
4.Plot compression ratio vs image quality side-by-side
5.Write a 1-page summary of what eigenvalues represent visually
📚 Resources
3Blue1Brown: Essence of LAGilbert Strang MIT OCWKhan Academy LA

Calculus

0%
  • Functions, limits, continuitycore
  • Derivatives and differentiation rulescore
  • Partial derivativescore
  • Chain rulecore
  • Gradient, Jacobian, Hessianadvanced
  • Gradient descent intuitionconcept
  • Integrals (basic, definite)core
🔨 Confidence Project — Gradient Descent Visualizer
1.Pick a 2D function (e.g. x² + y²) and compute its gradient manuallybeginner
2.Implement gradient descent from scratch in Python (no ML libraries)
3.Animate the descent path on a contour plot using Matplotlib
4.Experiment with different learning rates — observe overshooting/convergence
📚 Resources
3Blue1Brown: Calculus seriesKhan Academy Calculus
📊

Probability & Statistics

0%
  • Mean, median, mode, variance, std devcore
  • Probability rules, Bayes theoremcore
  • Probability distributions (Normal, Binomial, Poisson)core
  • Central Limit Theoremconcept
  • Hypothesis testing (t-test, chi-square, ANOVA)core
  • p-values, confidence intervalscore
  • Correlation vs causationconcept
  • A/B testing designcore
  • Bayesian vs Frequentist approachesadvanced
  • Use scipy.stats for teststool
🔨 Confidence Project — A/B Test Simulator
1.Simulate a website button color A/B test with fake conversion databeginner
2.Run a two-sample t-test and chi-square test using scipy.stats
3.Visualize distributions of both groups with confidence intervals
4.Write a plain-English business conclusion from the p-value
📚 Resources
StatQuest (YouTube)Think Stats (book)Khan Academy Stats
🐍

Python Programming

0%
  • Variables, data types, operatorscore
  • Control flow: if/else, loopscore
  • Functions, lambda, *args, **kwargscore
  • List, dict, set comprehensionscore
  • OOP: classes, inheritance, methodsconcept
  • File I/O, error handling, context managerscore
  • NumPy: arrays, broadcasting, operationstool
  • Pandas: DataFrames, Series, indexingtool
  • Jupyter Notebooks workflowtool
  • Virtual environments, pip, condatool
🔨 Confidence Project — Personal Finance Tracker
1.Create a CSV of 100 fake monthly expenses across categoriesbeginner
2.Load with Pandas, clean data types, handle missing values
3.Group by category and month, compute totals and trends
4.Plot monthly spending trends and category breakdowns with Matplotlib
5.Export summary report to a new CSV file
📚 Resources
Python.org docsAutomate the Boring StuffPython for Data Analysis
🗄️

SQL & Databases

0%
  • SELECT, WHERE, ORDER BY, LIMITcore
  • JOINs: INNER, LEFT, RIGHT, FULLcore
  • GROUP BY, HAVING, aggregate functionscore
  • Subqueries and CTEs (WITH clause)core
  • Window functions (ROW_NUMBER, RANK, LAG)advanced
  • CASE statementscore
  • Indexing and query optimization basicsconcept
  • Connect Python to SQL (SQLAlchemy, sqlite3)tool
🔨 Confidence Project — E-Commerce Sales Analysis
1.Download Northwind or Chinook sample database (SQLite)beginner
2.Answer 10 business questions with SQL (top products, customer LTV, etc.)
3.Use CTEs and window functions to compute monthly revenue growth
4.Connect to Python with sqlite3 and visualize results with Pandas + Seaborn
📚 Resources
SQLZooMode Analytics SQLLeetCode SQL
🔧

Git & Dev Tools

0%
  • git init, clone, add, commit, pushcore
  • Branches: create, merge, rebasecore
  • Pull requests and code reviewscore
  • .gitignore and best practicestool
  • GitHub: repos, issues, READMEtool
  • Command line basics (bash)core
🔨 Confidence Project — Portfolio GitHub Setup
1.Create a GitHub account and configure SSH keysbeginner
2.Push your Python finance tracker project with a proper README
3.Practice branching: create a feature branch, make changes, merge via PR
4.Add a requirements.txt and .gitignore to your repo

🏆 Stage 1 Capstone Projects

Build these to prove your foundations are solid
Beginner 🎲
Monte Carlo Pi Estimator
Use random sampling to estimate π. Visualize how accuracy improves with more samples. Demonstrates probability, Python, and NumPy together.
NumPyMatplotlibProbability
✦ Covers: Probability · Python · Visualization
Beginner 📈
Stock Price Statistics Dashboard
Download 1 year of stock data using yfinance. Compute rolling mean, std deviation, and detect anomalies. Plot interactive charts.
PandasyfinanceStatistics
✦ Covers: Stats · Pandas · Time Series basics
Intermediate 🧮
Linear Regression from Scratch
Implement OLS linear regression using only NumPy matrix operations — no sklearn. Verify your results match sklearn's output exactly.
NumPyLinear AlgebraCalculus
✦ Covers: Linear Algebra · Calculus · Python
02

Core Data Skills

Wrangling · EDA · Visualization · Feature Engineering

0%
Complete
🔄

Data Wrangling

0%
  • Load CSV, JSON, Excel, Parquet filescore
  • Detect and handle missing values (dropna, fillna, impute)core
  • Detect and remove duplicatescore
  • Outlier detection and treatmentcore
  • Data type conversionscore
  • String cleaning and regextool
  • Merging, joining, concatenating DataFramescore
  • Pivot tables and reshapingcore
  • GroupBy operations and aggregationcore
  • Apply, map, applymap functionstool
🔨 Confidence Project — Messy World Happiness Data Cleaner
1.Download the World Happiness Report CSV (intentionally messy version)beginner
2.Identify all data quality issues: missing values, wrong types, duplicates
3.Write a data cleaning pipeline using Pandas, document every decision
4.Compare before/after stats to prove cleaning quality improved the data
🔍

Exploratory Data Analysis

0%
  • Univariate analysis (distributions, histograms)core
  • Bivariate analysis (scatterplots, correlation)core
  • Correlation matrix and heatmapscore
  • Box plots, violin plotscore
  • Identify patterns, anomalies, trendsconcept
  • Summary statistics: describe(), info()tool
  • Pair plots (seaborn pairplot)tool
🔨 Confidence Project — Titanic Survivor EDA Report
1.Load Titanic dataset from Kaggle or seaborn datasetsbeginner
2.Perform full EDA: age distribution, survival by gender/class, fare outliers
3.Generate a 5-insight summary with one visualization per insight
4.Write a 300-word narrative conclusion as if presenting to a business team
🎨

Data Visualization

0%
  • Matplotlib: figures, axes, subplotstool
  • Seaborn: statistical plotstool
  • Plotly: interactive chartstool
  • Chart selection (when to use what chart)concept
  • Color theory and accessibility in chartsconcept
  • Tableau or Power BI basicstool
  • Dashboard design principlesconcept
🔨 Confidence Project — COVID-19 Interactive Dashboard
1.Download Our World in Data COVID CSV datasetintermediate
2.Build interactive line charts for cases/deaths across 5 countries using Plotly
3.Add a choropleth world map showing vaccination rates
4.Wrap in a Streamlit app with country filter dropdowns
⚙️

Feature Engineering

0%
  • Encoding: One-Hot, Label, Target encodingcore
  • Scaling: StandardScaler, MinMaxScaler, RobustScalercore
  • Log/power transformationscore
  • Binning/discretizationcore
  • Date/time feature extractioncore
  • Feature interactions and polynomial featuresadvanced
  • Feature selection: Filter, Wrapper, Embedded methodsadvanced
  • Handle class imbalance (SMOTE, oversampling)advanced
🔨 Confidence Project — House Price Feature Engineering
1.Load Ames Housing dataset from Kaggleintermediate
2.Engineer 10+ new features (age of house, total rooms, price/sqft ratio)
3.Compare a simple model with vs without your engineered features
4.Plot feature importances — which features matter most?

🏆 Stage 2 Capstone Projects

Full data pipeline projects to add to your portfolio
Beginner 🏡
Airbnb NYC Data Story
Download the NYC Airbnb open dataset. Clean it, explore pricing patterns by neighborhood, create a beautiful 5-chart story and share on GitHub.
PandasSeabornEDA
✦ Covers: Wrangling · EDA · Visualization
Intermediate 🏦
Bank Churn Preprocessing Pipeline
Take a raw bank customer churn dataset, build a full sklearn preprocessing pipeline with feature engineering, encoding, and scaling. Ready for modelling.
sklearnPipelineFeatures
✦ Covers: Feature Engineering · Wrangling · sklearn
Intermediate 🌍
Global Development Dashboard
Merge World Bank GDP, education, and health datasets. Build a Plotly Dash or Streamlit dashboard with interactive scatter plots and a year slider.
PlotlyStreamlitMerging
✦ Covers: Visualization · Wrangling · EDA
03

Machine Learning

Supervised · Unsupervised · Evaluation · Scikit-learn

0%
Complete
🎯

Supervised Learning

0%
  • Linear Regression (theory + implementation)core
  • Ridge, Lasso, ElasticNet regularizationcore
  • Logistic Regression (binary + multiclass)core
  • Decision Trees (splitting, pruning)core
  • Random Forest (bagging, feature importance)core
  • Gradient Boosting (GBM, XGBoost, LightGBM)core
  • Support Vector Machines (SVM)advanced
  • K-Nearest Neighbors (KNN)core
  • Naive Bayes classifierscore
🔨 Confidence Project — Credit Card Fraud Detector
1.Download the Kaggle Credit Card Fraud dataset (imbalanced)intermediate
2.Train Logistic Regression, Random Forest, and XGBoost models
3.Handle class imbalance with SMOTE, evaluate using Precision-Recall AUC
4.Explain which model you'd deploy and why — write a business summary
📚 Resources
Scikit-learn docsHands-On ML (Géron)StatQuest ML playlist
🌀

Unsupervised Learning

0%
  • K-Means clustering (elbow method, inertia)core
  • Hierarchical clustering (dendrograms)core
  • DBSCAN (density-based, noise handling)advanced
  • PCA: theory, variance explained, scree plotcore
  • t-SNE and UMAP for visualizationadvanced
  • Anomaly detection (Isolation Forest, LOF)advanced
🔨 Confidence Project — Customer Segmentation for Retail
1.Use the Online Retail dataset from UCI ML Repointermediate
2.Compute RFM (Recency, Frequency, Monetary) features per customer
3.Apply K-Means, find optimal K with elbow + silhouette score
4.Visualize clusters with PCA 2D scatter plot, name each segment
5.Propose a marketing strategy for each customer segment
📏

Model Evaluation

0%
  • Train/validation/test splitcore
  • Cross-validation (k-fold, stratified)core
  • Bias-variance tradeoffconcept
  • Overfitting & underfitting (learning curves)core
  • Regression metrics: MAE, MSE, RMSE, R²core
  • Classification metrics: Accuracy, Precision, Recall, F1core
  • Confusion matrixcore
  • ROC curve and AUC-ROCcore
  • Hyperparameter tuning (GridSearch, Optuna)core
  • Model interpretability (SHAP, LIME)advanced
🔨 Confidence Project — Model Horse Race
1.Pick any classification dataset (Heart Disease, Churn, Spam)intermediate
2.Train 5 different models with the same train/test split
3.Plot ROC curves for all models on the same graph
4.Run Optuna hyperparameter tuning on the best model
5.Use SHAP to explain top 5 feature contributions on the winning model
🔩

ML Pipelines

0%
  • sklearn Pipeline objecttool
  • ColumnTransformer for mixed datatool
  • Custom Transformers (BaseEstimator)advanced
  • Save/load models with joblib, pickletool
  • Experiment tracking basicsconcept
🔨 Confidence Project — Production-Ready Titanic Classifier
1.Build a full sklearn Pipeline with preprocessing + RandomForestintermediate
2.Use ColumnTransformer to handle numeric and categorical columns separately
3.Save the entire pipeline as a .pkl file with joblib
4.Load it and predict on 5 new manually-crafted passenger examples

🏆 Stage 3 Capstone Projects

Real-world ML projects that belong in every DS portfolio
Intermediate 🏠
House Price Prediction
Kaggle's classic competition. Build end-to-end regression with feature engineering, XGBoost, and hyperparameter tuning. Aim for top 20% on the leaderboard.
RegressionXGBoostKaggle
✦ Covers: Full supervised ML pipeline
Intermediate 📧
Spam Email Classifier
Train a Naive Bayes + Logistic Regression spam detector on the Enron email dataset. Compare models and deploy as a simple web app with Streamlit.
NLP basicsClassificationStreamlit
✦ Covers: Classification · Text basics · Deployment
Advanced 🏥
Hospital Readmission Risk Model
Use the Diabetes 130-US hospitals dataset. Build a model to predict 30-day readmission. Present findings in a mock hospital board presentation format.
HealthcareSHAPImbalanced
✦ Covers: Full pipeline · Ethics · Explainability
04

Advanced AI

Deep Learning · Vision · NLP · Time Series

0%
Complete
🧠

Deep Learning Basics

0%
  • Perceptrons & Neural Networkscore
  • Activation functions (ReLU, Sigmoid, Softmax)core
  • Forward and backward propagationconcept
  • Optimizers (Adam, SGD, RMSprop)core
  • Loss functions (MSE, Cross-Entropy)core
  • PyTorch or TensorFlow basicstool
👁️

Computer Vision (CNNs)

0%
  • Convolutional Neural Networks (CNNs) principlescore
  • Image Augmentation & Preprocessingcore
  • Transfer Learning (ResNet, VGG, MobileNet)advanced
  • Object Detection basics (YOLO, Faster R-CNN)advanced
  • Keras or Torchvision practical applicationtool
💬

Natural Language Processing

0%
  • Text cleaning & Tokenizationcore
  • TF-IDF and CountVectorizercore
  • Word Embeddings (Word2Vec, GloVe)concept
  • RNNs, LSTMs, and GRUscore
  • Transformers and Attention basicsadvanced
  • Hugging Face Transformers librarytool
📈

Time Series Analysis

0%
  • Trend, Seasonality, Noise decompositioncore
  • Stationarity and Dickey-Fuller testcore
  • ARIMA and SARIMA modelsadvanced
  • Prophet (Facebook)tool
  • LSTM for forecastingadvanced
05

Gen AI & LLMs

Prompting · Embeddings · Vector DBs · RAG

0%
Complete
✍️

Prompt Engineering

0%
  • Zero-shot, Few-shot promptingcore
  • Chain-of-Thought (CoT) promptingcore
  • Prompt Optimizationadvanced
  • OpenAI API / Anthropic APItool
📉

Embeddings & Vector DBs

0%
  • Understanding Dense vs Sparse Vectorscore
  • Cosine Similarity & Distance Modelscore
  • Vector DBs: Pinecone, Chroma, FAISStool
🔍

Retrieval-Augmented Gen (RAG)

0%
  • RAG Architecture Overviewcore
  • Document Chunking Strategiescore
  • LangChain / LlamaIndex integrationtool
  • Evaluation Metrics for RAGadvanced
🛠️

Fine-Tuning

0%
  • PEFT (Parameter-Efficient Fine-Tuning)advanced
  • LoRA & QLoRA methodologiesadvanced
  • Preparing Instruct Datasetscore
  • Hugging Face AutoTraintool
06

MLOps & Deployment

Serving · Tracking · Docker · Cloud

0%
Complete
🚀

Model Deployment

0%
  • Flask / FastAPI for ML servingtool
  • Streamlit / Gradio for web appstool
  • Docker containerization basicscore
  • REST API endpointscore
📦

Experiment Tracking

0%
  • MLflow basics (tracking, models, registry)tool
  • Weights & Biases (W&B)tool
  • Versioning data & models (DVC)advanced
☁️

Cloud Platforms

0%
  • AWS (SageMaker, S3, EC2) basicstool
  • GCP (Vertex AI, BigQuery) basicstool
  • Serverless deployments (AWS Lambda)advanced
07

Career & Portfolio

Projects · Branding · Job Search

0%
Complete
📁

Portfolio Building

0%
  • Create a personal website / GitHub Pagescore
  • Host minimum 3 end-to-end projectscore
  • Write excellent READMEs (context, data, metrics)core
  • Deploy at least one project liveproject
🗣️

Soft Skills

0%
  • Data Storytelling & Presentationcore
  • Translating business problems to MLcore
  • Explaining complex models to non-technical folkscore
💼

Job Search & Interviewing

0%
  • Optimize LinkedIn profile & Resumecore
  • SQL interview practice (LeetCode, HackerRank)core
  • Python algorithmic interview basicscore
  • Prepare for ML system design interviewsadvanced
  • Behavioral interviews (STAR method)core