Handle class imbalance (SMOTE, oversampling)advanced
🔨 Confidence Project — House Price Feature Engineering
1.Load Ames Housing dataset from Kaggleintermediate
2.Engineer 10+ new features (age of house, total rooms, price/sqft ratio)
3.Compare a simple model with vs without your engineered features
4.Plot feature importances — which features matter most?
🏆 Stage 2 Capstone Projects
Full data pipeline projects to add to your portfolio
Beginner🏡
Airbnb NYC Data Story
Download the NYC Airbnb open dataset. Clean it, explore pricing patterns by neighborhood, create a beautiful 5-chart story and share on GitHub.
PandasSeabornEDA
✦ Covers: Wrangling · EDA · Visualization
Intermediate🏦
Bank Churn Preprocessing Pipeline
Take a raw bank customer churn dataset, build a full sklearn preprocessing pipeline with feature engineering, encoding, and scaling. Ready for modelling.
1.Build a full sklearn Pipeline with preprocessing + RandomForestintermediate
2.Use ColumnTransformer to handle numeric and categorical columns separately
3.Save the entire pipeline as a .pkl file with joblib
4.Load it and predict on 5 new manually-crafted passenger examples
🏆 Stage 3 Capstone Projects
Real-world ML projects that belong in every DS portfolio
Intermediate🏠
House Price Prediction
Kaggle's classic competition. Build end-to-end regression with feature engineering, XGBoost, and hyperparameter tuning. Aim for top 20% on the leaderboard.
RegressionXGBoostKaggle
✦ Covers: Full supervised ML pipeline
Intermediate📧
Spam Email Classifier
Train a Naive Bayes + Logistic Regression spam detector on the Enron email dataset. Compare models and deploy as a simple web app with Streamlit.
NLP basicsClassificationStreamlit
✦ Covers: Classification · Text basics · Deployment
Advanced🏥
Hospital Readmission Risk Model
Use the Diabetes 130-US hospitals dataset. Build a model to predict 30-day readmission. Present findings in a mock hospital board presentation format.