DataThink Development
  • Modules

On this page

  • Case Study 5: I can clean your data
    • Background
    • Tasks

cs-05

Case Study 5: I can clean your data

Background

The Scientific American argues that humans have been getting taller over the years. As the data scientists that we are becoming, we would like to find data that validates this concept. Our challenge is to show different male heights across the centuries.

This project is not as severe as the two quotes below, but it will give you a taste of pulling various data and file formats together into “tidy” data for visualization and analysis. You will not need to search for data as all the files are listed here

  1. “Classroom data are like teddy bears and real data are like a grizzly bear with salmon blood dripping out its mouth.” - Jenny Bryan
  2. “Up to 80% of data analysis is spent on the process of cleaning and preparing data” - Hadley Wickham
  • Back or Course Website

Tasks

import pandas as pd
import polars as pl
from pathlib import Path

def read_dta_polars(file_path: str | Path) -> pl.DataFrame:
    file_path = Path(file_path)

    if not file_path.exists():
        raise FileNotFoundError(f"File not found: {file_path}")

    try:
        # Load into pandas first (pandas supports .dta)
        pdf = pd.read_stata(file_path)

        # Convert to Polars DataFrame
        return pl.from_pandas(pdf)

    except ValueError as e:
        raise ValueError(f"Error reading .dta file: {e}")
    except Exception as e:
        raise RuntimeError(f"Unexpected error: {e}")