Python Visualizations - Altair - 1 (Histogram)

Altair is one of the recent packages built on top of Vega Lite for creating interactive visualizations in Python. I had so much fun playing with its features and in this series of posts, I would like to share my experiments with it here.

First of all, install it as shown below:

Anaconda:
conda install altair --channel conda-forge

To use Jupyter notebook renderer, you must install the vega package and the associated Jupyter extension:
conda install -c conda-forge vega_datasets notebook vega
Pip:
pip install -U altair
pip install -U vega_datasets notebook vega

As I go along, wherever applicable I try to compare Altair visuals and code with those of Matplotlib, to highlight the features and ease-of-use of Altair package.

Step 1: Import required packages

import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
alt.renderers.enable('notebook')

Step 2: Read the data.
Dataset name: Hours to Pay Mortgage (Source: https://data.world/makeovermonday/2018w47)

data = pd.read_excel(r'Hours to Pay Mortgage.xlsx', sheet_name=r'Sheet1')
Here is the description of the dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97 entries, 0 to 96
Data columns (total 9 columns):
City                                97 non-null object
State                               97 non-null object
Median Home Listing Price           97 non-null int64
30-year Fixed Mortgage Rate         97 non-null float64
Monthly Mortgage Payment            97 non-null int64
Median Household Income             97 non-null int64
Hours per Month to Afford a Home    97 non-null float64
Number of Periods                   97 non-null int64
Present Value                       97 non-null int64
dtypes: float64(2), int64(5), object(2)
memory usage: 6.9+ KB
None

Step 3: Simple Histogram

alt.Chart(data).mark_bar(color='gold').encode(
    alt.X('Hours per Month to Afford a Home', bin=True, axis=alt.Axis(title='Hours per Month to Afford a Home (in bins)')),
    alt.Y('count(*):Q', axis=alt.Axis(title='Number of Cities')),
)


The syntax is very simple. You start with calling Chart function on the data frame and add encodings thereafter for each property. For the histogram, we can encode the color for the entire visual upfront, since it isn't dependent on any data and is chosen arbitrarily.

For X axis function,  by specifying the bin property as True, we defined the bar chart type as histogram.  We can specify the column for X-axis and also the title all together.

The Y axis definition is more interesting. We just want to plot count of rows for each bin and hence the aggregation  specified is 'count' all and then to make it more explicit, we specify Q to indicate that it's a quantitative variable. (The other options include N for nominal and O for Ordinal.) The ability to provide aggregation is a powerful feature, which implies that we can plot data at any aggregated level without having to transform the data frame first.

You can see that the horizontal grid lines appear by default and the bars are separated slightly so that the visual is more appealing and readable.

Now, compare this with the default histogram that can be created with Matplotlib:

plt.hist(data['Hours per Month to Afford a Home'], color='lightpink')
plt.xlabel('Hours per Month to Afford a Home (in bins)')
plt.ylabel('Count')


The Matplotlib code for basic histogram is comparably simple, but see the difference in rendition. A stark difference indeed.

Comments