In this post, I would like to demonstrate creating Choropleth Map (Filled Map) visual. This one needs a lot of data preparation.
First, we need to get the geocode data. The geo-code data uses codes to represent geographies. So, we need to get State Codes into our dataset.
Step 1: Read and prep the dataset
Dataset source: Hours to Pay Mortgage
A little clean-up of the state codes - to remove nulls and to set the appropriate data format
We build the choropleth map, by plotting the states first and then color it using the data from our dataset. This requires us to aggregate our dataset, which is at City level, to the State level.
Step 2: Plotting the map
Let's get States related geodata and then lookup our main dataframe on State Numeric Codes to get Avg Hours and State name, which we need for the visualization.
After the initial encodings of the chart, we chain other functions. This way, we can add transformations, set the properties etc.
Some states aren't plotted as there is no data for them in the dataset.
Step 3: Double Choropleth
Let's plot two maps side by side.
Creating two datasets, with the Hours per Month to Afford a Home aggregated in two different ways - Max and Min, representing the metric for the City with the highest number and City with the lowest number per state respectively.
Chart 1 - Representing the distribution of the largest values per state
Chart 2 - Representing the distribution of the smallest values per state
Combining them together side by side and most importantly we want the color scales to be independent. Otherwise, the resulting visual wouldn't be that useful.
It is interesting to note that while home-owners in Massachusetts spend very high number of hours per month to pay their mortgage, the lowest number of hours is also much higher compared to all other states. So, the spread isn't much in the case of Massachusetts. In contrast, while people in California spend the highest number of hours per month to pay their mortgage, the lowest number of hours in that state is well below the median. Here, we observe a considerable spread. The reason why would be clear in future posts, when we explore this dataset further.
First, we need to get the geocode data. The geo-code data uses codes to represent geographies. So, we need to get State Codes into our dataset.
Step 1: Read and prep the dataset
Dataset source: Hours to Pay Mortgage
hours = pd.read_excel(r'Hours to Pay Mortgage.xlsx', sheet_name=r'Sheet1')
#Get State Numeric Code from Reference dataset
dfs=pd.read_html('https://www.census.gov/geo/reference/ansi_statetables.html', header=0)[0]
A little clean-up of the state codes - to remove nulls and to set the appropriate data format
dfs0 = dfs[np.isfinite(dfs['FIPS State Numeric Code'])] #Remove Nulls
dfs1=dfs0.copy()
dfs1['FIPS State Numeric Code']= dfs0['FIPS State Numeric Code'].astype('str') #Convert Numeric Code to String
dfs1=dfs1.replace({'\.0':''}, regex=True) # Remove decimal point and the Zero
Now, let's merge the state codes into our original dataset.
#Trim the State Name in both the Reference and Main datasets and then merge on Name to get the Numeric Code into the Main dataset
dfs1['Name']=dfs['Name'].str.strip()
hours['State']=hours['State'].str.strip()
data11=pd.merge(hours,dfs1, left_on=['State'], right_on=['Name'], how='left') #
We build the choropleth map, by plotting the states first and then color it using the data from our dataset. This requires us to aggregate our dataset, which is at City level, to the State level.
data12 = data11.groupby(['State','FIPS State Numeric Code'])['Hours per Month to Afford a Home'].mean().reset_index(name='Avg Hours per Month to Afford a Home')
Step 2: Plotting the map
Let's get States related geodata and then lookup our main dataframe on State Numeric Codes to get Avg Hours and State name, which we need for the visualization.
import altair as alt
from vega_datasets import data
We use the data function to get the geocode data.
states = alt.topo_feature(data.us_10m.url,'states')
alt.Chart(states).mark_geoshape().encode(
color='Avg Hours per Month to Afford a Home:Q', tooltip = ['Avg Hours per Month to Afford a Home:Q' ,'State:N']
).transform_lookup(
lookup='id',
from_=alt.LookupData(data12, 'FIPS State Numeric Code', ['Avg Hours per Month to Afford a Home', 'State'])
).project(
type='albersUsa'
).properties(
width=500,
height=400
)
After the initial encodings of the chart, we chain other functions. This way, we can add transformations, set the properties etc.
Some states aren't plotted as there is no data for them in the dataset.
Step 3: Double Choropleth
Let's plot two maps side by side.
Creating two datasets, with the Hours per Month to Afford a Home aggregated in two different ways - Max and Min, representing the metric for the City with the highest number and City with the lowest number per state respectively.
data13 = data11.groupby(['State','FIPS State Numeric Code'])['Hours per Month to Afford a Home'].max().reset_index(name='Max Hours per Month to Afford a Home')
data14 = data11.groupby(['State','FIPS State Numeric Code'])['Hours per Month to Afford a Home'].min().reset_index(name='Min Hours per Month to Afford a Home')
The approach is to build each map separately and then combine them later.
import altair as alt
from vega_datasets import data
states = alt.topo_feature(data.us_10m.url,'states')
Chart 1 - Representing the distribution of the largest values per state
chartMax= alt.Chart(states).mark_geoshape().encode(
#color='Max Hours per Month to Afford a Home:Q',
color=alt.Color('Max Hours per Month to Afford a Home:Q', legend=alt.Legend(orient='left', title='Max Hours')),
tooltip = ['Max Hours per Month to Afford a Home:Q' ,'State:N']
).transform_lookup(
lookup='id',
from_=alt.LookupData(data13, 'FIPS State Numeric Code', ['Max Hours per Month to Afford a Home', 'State'])
).project(
type='albersUsa'
).properties(
width=500,
height=400
)Chart 2 - Representing the distribution of the smallest values per state
chartMin= alt.Chart(states).mark_geoshape().encode(
#color='Max Hours per Month to Afford a Home:Q',
color=alt.Color('Min Hours per Month to Afford a Home:Q', legend=alt.Legend(orient='left', title='Min Hours')),
tooltip = ['Min Hours per Month to Afford a Home:Q' ,'State:N']
).transform_lookup(
lookup='id',
from_=alt.LookupData(data14, 'FIPS State Numeric Code', ['Min Hours per Month to Afford a Home', 'State'])
).project(
type='albersUsa'
).properties(
width=500,
height=400
)
Combining them together side by side and most importantly we want the color scales to be independent. Otherwise, the resulting visual wouldn't be that useful.
alt.hconcat(chartMin, chartMax).resolve_legend(
color="independent",
size="independent"
).resolve_scale(color="independent")
It is interesting to note that while home-owners in Massachusetts spend very high number of hours per month to pay their mortgage, the lowest number of hours is also much higher compared to all other states. So, the spread isn't much in the case of Massachusetts. In contrast, while people in California spend the highest number of hours per month to pay their mortgage, the lowest number of hours in that state is well below the median. Here, we observe a considerable spread. The reason why would be clear in future posts, when we explore this dataset further.
Comments
Post a Comment