Start a professional Python project.
Clone repo to local Install Extensions Set up virtual environment Git Commit Git pull origin main Run Tests: uv sync –extra dev –extra docs –upgrade uv cache clean git add . uvx ruff check –fix uvx pre-commit autoupdate uv run pre-commit run –all-files git add . uv run pytest
The cleaning process follows these steps:
Once the data is cleaned, the next step is to prepare it for machine learning by splitting it into a training set and a test set.
The dataset is divided into:
This ensures that model performance is measured on unseen data.
2_train_test_split.py```python df = pd.read_csv(“automobile_price_data3_cleaned.csv”)
python Copy code X = df.drop(columns=[“price”]) y = df[“price”]
python Copy code X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )
File Purpose train_data.csv Training dataset (80%) test_data.csv Test dataset (20%)
This example demonstrates how to load, inspect, clean, and prepare the Howell.csv dataset for a machine learning project.
The dataset contains information on individuals’ height, weight, age, and gender.
The first step loads the dataset safely and performs basic checks for missing values and duplicates.
1_data_cleaning.pyPurpose:
Howell.csv from the same directory.Howell_cleaned.csv) for later steps.Key Code Snippet:
```python from pathlib import Path import pandas as pd import sys
data_path: Path = Path(file).parent / “Howell.csv”
if not data_path.exists(): sys.exit(f”❌ ERROR: Could not find Howell.csv at {data_path}”)
howell_df = pd.read_csv(data_path, sep=”;”)
print(“✅ Howell dataset loaded successfully!”) print(howell_df.head()) print(“\nMissing values:\n”, howell_df.isnull().sum()) print(“\nDuplicate rows:”, howell_df.duplicated().sum()) print(“\nSummary statistics:\n”, howell_df.describe())
cleaned_path = Path(file).parent / “Howell_cleaned.csv” howell_df.to_csv(cleaned_path, index=False) print(f”\n💾 Cleaned file saved to: {cleaned_path}”) 💾 Step 1.5: Howell_cleaned.csv After running the cleaning script, a new file called Howell_cleaned.csv will be created in the same directory:
makefile Copy code C:\Repos\applied-ml-foster\notebooks\example02\Howell_cleaned.csv This cleaned dataset will be used for training and testing in later steps.
This lab uses the cleaned Howell dataset (Howell_cleaned.csv) to create visualizations, explore patterns, and add features for analysis.
Using the cleaned dataset ensures that the visualizations are meaningful and consistent with your workflow.
```python from pathlib import Path import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
sns.set(style=”whitegrid”) 2️⃣ Load the Cleaned Howell Data python Copy code
data_path = Path(r”C:\Repos\applied-ml-foster\notebooks\example02\Howell_cleaned.csv”)
howell_df = pd.read_csv(data_path, sep=”;”) # change sep=”,” if needed
howell_df.info() howell_df.head() If the file was saved with commas instead of semicolons, change sep=”;” to sep=”,”.
Height Distribution Weight Distribution Height vs Weight by Gender Create and Visualize BMI
Create BMI Categories
Visualize BMI Category by Gender Age vs Height (Adults Only)
Sometimes we want to restrict the data used in plots without removing any rows. This can be done using masking, which tells the plotting function which values to include.
In this example, we focus on adult instances only (age ≥ 18) and create masks for male and female (male = 1, female = 0).
```python import matplotlib.pyplot as plt import seaborn as sns sns.set(style=”whitegrid”)
howell_adult = howell_df[howell_df[“age”] >= 18]
male_mask = howell_adult[“male”] == 1 female_mask = howell_adult[“male”] == 0
plt.figure(figsize=(8,5)) sns.histplot(x=howell_adult[“height”][male_mask], kde=True, color=”blue”, label=”Male”, alpha=0.6) sns.histplot(x=howell_adult[“height”][female_mask], kde=True, color=”pink”, label=”Female”, alpha=0.6) plt.title(“Adult Height Distribution by Gender”) plt.xlabel(“Height (cm)”) plt.ylabel(“Count”) plt.legend() plt.show()
plt.figure(figsize=(8,5)) sns.histplot(x=howell_adult[“weight”][male_mask], kde=True, color=”blue”, label=”Male”, alpha=0.6) sns.histplot(x=howell_adult[“weight”][female_mask], kde=True, color=”pink”, label=”Female”, alpha=0.6) plt.title(“Adult Weight Distribution by Gender”) plt.xlabel(“Weight (kg)”) plt.ylabel(“Count”) plt.legend() plt.show()
Lab: Stratified Train/Test Split with Howell Dataset
This lab demonstrates how to prepare the Howell cleaned dataset for machine learning, including:
Calculating BMI and adding categorical features
Filtering adult individuals
Performing a stratified train/test split
Saving the resulting datasets to CSV
Setup
Imports from pathlib import Path import pandas as pd from sklearn.model_selection import train_test_split
Load the Cleaned Howell Data
data_path = Path(r”C:\Repos\applied-ml-foster\notebooks\example02\Howell_cleaned.csv”)
howell_df = pd.read_csv(data_path, sep=”,”) # CSV uses commas
howell_df.info() howell_df.head()
howell_df[“BMI”] = howell_df[“weight”] / (howell_df[“height”]/100)**2
def bmi_category(bmi): if bmi < 18.5: return “Underweight” elif bmi < 25: return “Normal” elif bmi < 30: return “Overweight” else: return “Obese”
howell_df[“bmi_category”] = howell_df[“BMI”].apply(bmi_category)
howell_df.to_csv(r”C:\Repos\applied-ml-foster\notebooks\example02\Howell_cleaned.csv”, index=False)
Note: In this dataset, only Underweight and Normal categories exist.
howell_adult = howell_df[howell_df[“age”] >= 18]
valid_categories = y.value_counts()[lambda x: x >= 2].index mask = y.isin(valid_categories) X = X[mask] y = y[mask]
Stratified Train/Test Split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )
Combine Features and Target & Save CSVs train_data = X_train.copy() train_data[“bmi_category”] = y_train
test_data = X_test.copy() test_data[“bmi_category”] = y_test
train_data.to_csv(r”C:\Repos\applied-ml-foster\notebooks\example02\train_data.csv”, index=False) test_data.to_csv(r”C:\Repos\applied-ml-foster\notebooks\example02\test_data.csv”, index=False)
✅ Notes
Stratified splitting preserves the proportion of BMI categories in both train and test sets.
Only categories with ≥ 2 samples are included to prevent errors.
The final CSV files are ready for downstream modeling tasks.
This project aims to predict the median house values in California’s districts using the California Housing dataset. The dataset comprises various features such as median income, average number of rooms, and geographic coordinates. The goal is to build a machine learning model that can accurately estimate house prices based on these features.
Project Workflow
Load the dataset using fetch_california_housing from sklearn.datasets.
Inspect the first few rows to understand the structure of the data.
Check for missing values and handle them appropriately.
Generate summary statistics to grasp the distribution and central tendencies of the features.
Histograms: Display the distribution of each numeric feature to understand their spread.
Boxenplots: Identify outliers and visualize the distribution of each feature.
Pairplots: Examine relationships between pairs of features and the target variable.
Select features: Choose relevant features such as ‘MedInc’ (Median Income) and ‘AveRooms’ (Average Rooms).
Define the target variable: Set ‘MedHouseVal’ (Median House Value) as the target.
Prepare the feature matrix (X) and target vector (y) for model training.
Split the data: Divide the dataset into training and testing sets (e.g., 80% train, 20% test).
Initialize the model: Create an instance of LinearRegression.
Train the model: Fit the model on the training data.
Make predictions: Use the trained model to predict house prices on the test set.
R² (Coefficient of Determination): Measure how well the model explains the variance in the target variable.
MAE (Mean Absolute Error): Calculate the average of the absolute errors between predicted and actual values.
RMSE (Root Mean Squared Error): Compute the square root of the average of squared errors, giving more weight to larger errors.