class: center, middle, font30 # Introduction to Polars (and Plotly) in Python J. Hathaway - Data Science Program Chair at BYU-I --- class: font30 # Disclaimers I am focusing on modern tools for data science in Python _Polars over Pandas and Plotly over Matplotlib._ These modern tools reflect the best of - declarative programming (task focused programming) - clean grammar (language abstraction that allows intuitive but complex actions) - industry respect (these tools are very popular for their quality and have rapid growth) ??? - Highlight that Pandas is still the industry standard but that the grammar and speed of polars should compete with Pandas in the long run. - Plotly is probably close to the industry standard. However, Matplotlib is the default plotting environment for many data science packages. --- class: font40 # Agenda Exemplify the data science process - Extract, Transform, Load, Analyze 1. Introduction and Set-up (10 minutes) 2. Polars for data munging (40 minutes) 3. Break (5 minutes) 4. Plotly for data visualization (35 minutes) --- class: font30 # J. Hathaway Data Scientists with ~20 years of industry experience and 8 years in Academia. Undergraduate degree in Economics (University of Utah) and a graduate degree in Statistics (BYU). ![:scale 50%](https://raw.githubusercontent.com/hathawayj/ghana_datascience/master/img/family_2020.png) --- class: font40 # Checking our installation 1. [Python Installed](https://www.python.org/downloads/) 2. [VS Code Installed](https://code.visualstudio.com/download) 3. [Python VS Code Extension Installed](https://marketplace.visualstudio.com/items?itemName=ms-python.python) 4. Python packages installed. ```bash pip install polars plotly statsmodels pyarrow ``` ```bash pip3 install polars plotly statsmodels pyarrow ``` --- class: font20 # Introduction to Polars and Data Munging (speed) > Polars is a lightning fast DataFrame library/in-memory query engine. Its embarrassingly parallel execution, cache efficient algorithms and expressive API makes it perfect for efficient data wrangling, data pipelines, snappy APIs and so much more. Polars is about as fast as it gets, see the results in the [H2O.ai benchmark](https://h2oai.github.io/db-benchmark/). > > [Polars Website](https://www.pola.rs/) ![:scale 60%](https://raw.githubusercontent.com/pola-rs/polars-static/master/logos/polars_github_logo_rect_dark_name.svg) --- ## What is Polars? > The goal of `Polars` is to provide a lightning fast `DataFrame` library that: > - Utilizes all available cores on your machine. > - Optimizes queries to reduce unneeded work/memory allocations. > - Handles datasets much larger than your available RAM. > - Has an API that is consistent and predictable. > - Has a strict schema (data-types should be known before running the query). > > Polars is written in Rust which gives it C/C++ performance and allows it to fully control performance critical parts in the query engine. > > As such `Polars` goes to great lengths to: > > - Reduce redundant copies. > - Traverse memory cache efficiently. > - Minimize contention in parallelism. > - Process data in chunks. > - Reuse memory allocations. > > [Polars Documentation](https://pola-rs.github.io/polars-book/user-guide/) Their documenation compares Polars to some other Python data science tools --- # Introduction to Polars and Data Munging (declarative API) Polars functions (Contexts & Expressions) are human readable and they align with standard SQL methods as well as the very popular big data package [PySpark](https://www.databricks.com/glossary/pyspark). ## Contexts - __Selection:__ `df.select([..])`, `df.with_columns([..])` - __Filtering:__ `df.filter()` - __Groupby/Aggregation:__ `df.groupby(..).agg([..])` ## Expressions ```python new_df = df.select( pl.col("names").n_unique().alias("unique"), pl.approx_unique("names").alias("unique_approx"), (pl.col("nrs").sum() + 5).alias("nrs + 5"), pl.col("integers").cast(pl.Float32).alias("integers_as_floats"), pl.col("animal").str.n_chars().alias("letter_count") ) ``` --- class: font30 # Polars programming Now let's practice using Polars with our installation of Python 1. Read data, melt data, and save as `.parquet` (01_read.py) 2. Munge data for visualization (02_munge.py) ![:scale 75%](https://raw.githubusercontent.com/pola-rs/polars-static/master/logos/polars_github_logo_rect_dark_name.svg) --- class: font20 # Introduction to Parquet files and Arrow When users ask for data or when many institutions share data the files are usually some type of text file (`.txt`, `.csv`, `.tsv`) or an Excel file. The most ubiquitous format is `.csv`. However, these formats limit effective data handling. We want a format that stores small, reads fast, and maintains variable types. The [Apache Arrow project](https://arrow.apache.org/) facilitates the goal with the `.parquet` and `.feather` formats and their respective packages for leveraging those formats. __Arrow is primarily an in-memory format, whereas Parquet is a storage format.__ The following table from the [openbridge blog](https://blog.openbridge.com/how-to-be-a-hero-with-powerful-parquet-google-and-amazon-f2ae0f35ee04) provides a strong example of the benefits of these new formats. ![:scale 95%](https://raw.githubusercontent.com/hathawayj/ghana_datascience/master/img/csv_parquet.png) --- class: font20 # Introduction to Data Visualization __The world is full of [visual communication tools](https://datavizcatalogue.com/)__ > Our eyes are drawn to [colors and patterns](https://www.tableau.com/learn/whitepapers/tableau-visual-guidebook). We can quickly identify red from blue, and squares from circles. Our culture is visual, including everything from art and advertisements to TV and movies. Data visualization is another form of visual art that grabs our interest and keeps our eyes on the message. > [Tableau Reference](https://www.tableau.com/learn/articles/data-visualization) .left-column[ ### Advantages of data visualization: - Easily sharing information. - Interactively explore opportunities. - Visualize patterns and relationships. ] .right-column[ ### Disadvantages: - Biased or inaccurate information. - Correlation doesn’t always mean causation. - Core messages can get lost in translation. ] --- class: font20 # Introduction to Plotly and Lets-Plot .left-column[ ### Plotly: > Plotly Express is a built-in part of the plotly library, which makes interactive, publication-quality graphs. ```bash pip3 install plotly #mac pip install plotly #windows ``` ```python import plotly.express as px ``` ] .right-column[ ### Lets-Plot: > Lets-Plot is a multiplatform plotting library based on the Grammar of Graphics. We provide ggplot2-like plotting API for Python ```bash pip3 install lets-plot #mac pip install lets-plot #windows ``` ```python from lets_plot import * LetsPlot.setup_html() ``` ] --- class: font20 # Introduction to __Plotly__ for Data Visualization The Plotly Python package leverages the plotly.js JavaScript library to enables Python users to create beautiful interactive web-based visualizations. Plotly.js is built on top of d3.js and stack.gl, Plotly.js is a high-level, declarative charting library. plotly.js ships with over 40 chart types, including 3D charts, statistical graphs, and SVG maps. ![:scale 50%](https://raw.githubusercontent.com/hathawayj/ghana_datascience/master/img/plotly_charts.png) --- class: font30 # Plotly programming Now let's practice using [Plotly](https://plotly.com/python/plotly-express/) with our installation of Python 1. Plotly practice (03_begin_plotly.py) 2. Data visualization with our munged data (03_explore.py) ```python import plotly.express as px df = px.data.iris() # iris is a pandas DataFrame fig = px.scatter(df, x="sepal_width", y="sepal_length") fig.show() ``` ```python import plotly.express as px df = px.data.gapminder().query("continent == 'Oceania'") fig = px.line(df, x='year', y='lifeExp', color='country', markers=True) fig.show() ``` --- class: font30 # Lets-Plot programming Now let's practice using [Lets-Plot](https://lets-plot.org/) with our installation of Python 2. Data visualization with our munged data (03_explore.py) ```python import numpy as np from lets_plot import * LetsPlot.setup_html() np.random.seed(12) data = dict( cond=np.repeat(['A', 'B'], 200), rating=np.concatenate((np.random.normal(0, 1, 200), np.random.normal(1, 1.5, 200))) ) ggplot(data, aes(x='rating', fill='cond')) + \ ggsize(700, 300) + \ geom_density(color='dark_green', alpha=.7) + \ scale_fill_brewer(type='seq') + \ theme(panel_grid_major_x='blank') ``` --- class: font40 # Next Steps 1. Program your day (incorporate programming into your daily work) 2. Display your talent -- [Github.com](https://github.com/hathawayj) 3. Offer your services -- Find small projects to do for friends and contacts 4. Apply for jobs -- [LinkedIn](https://www.linkedin.com/jobs/search/?currentJobId=3636093741&geoId=104353902&keywords=data%20science&location=Greater%20Accra%20Region%2C%20Ghana&originalSubdomain=gh&refresh=true)