Notebook

Practical Data Visualization with Python (Homework - Participant)¶

Homework Overview¶

Thanks for checking out the hands-on reinforcement exercises for this seminar. The goal of this homework is to provide you with a handful of questions that necessitate visualization that you might conceivably face on the job. There is not one "right" answer for the questions below, but some answers are more right than others. For example, if you were to be asked to visualize the trends in LTV over the course of a year, would plotting average LTV over time be a better visualization than building twelve violin plots of LTV--one for each month? Not necessarily. But would both of those be better than a single box-and-whisker plot of LTV all originations in that year? Absolutely. It all depends on the context of the question, and the information you intend to convey with your visualization.

When in doubt, ask yourself: am I clearly and powerfully communicating the relevant information with this visualization?

With each of these questions below, you will be asked to do two things:
1. Construct a visualization to answer the question.
  - You'll be pre-allotted one code cell in the notebook for this, but feel free to use as many as you'd like. As was shown in the lecture materials, a good visualization almost always requires iteration. Feel free to keep the remnants of your iterative creative procees in your notebooks; just ensure your final viz. for each question is clearly marked.
2. Briefly explain (in no more than a paragraph) why you chose to visualize the data as you did.
  - You'll be pre-allotted one markdown cell in the notebook for this, directly following the code cell. If you are struggling to think of what to write, fall back on the lecture materials, particulary Section 1: Why We Visualize. Imagine that each of your visualizations was going to be presented to your team at a Process Confirm / Code Review; your paragraph should read like the explanation you would give in that context, detailing why your choices made for the most effective viz. Be sure to focus on how your visualization answers the question at hand, the crux of which is in bold although the entire question provides relevant information as to what is expected.

We'll be using the same data we've been dealing with throughout the seminar: January and December 2017 FNMA originations. Remember, if you don't understand what some of the variables mean, all the information you need is in the data_prep_nb.ipynb, including links to relevant glossaries and data dictionnaries.

Note: For all questions below, you are free to use whatever python visualization package you want. That said, some questions require a specific type of visualization (example: if you know that you need an interactive visualization, don't start by using a package that you know cannot build interactive visualizations).

Good luck!

Setup¶

In [1]:

# basic packagesimportnumpyasnpimportpandasaspdimportdatetime

In [2]:

# store the datetime of the most recent running of this notebook as a form of a logmost_recent_run_datetime=datetime.datetime.now().strftime("%Y-%m-%d %H:%M")f"This notebook was last executed on {most_recent_run_datetime}"

Out[2]:

'This notebook was last executed on 2019-09-08 20:42'

In [3]:

# pulling in our main data; for more info on the data, see the "data_prep_nb.ipynb" filemain_df=pd.read_csv(filepath_or_buffer='../data/jan_and_dec_17_acqs.csv')# taking a peek at our datamain_df.head()

Out[3]:

	loan_id	orig_chn	seller_name	orig_rt	orig_amt	orig_trm	orig_dte	frst_dte	oltv	ocltv	...	occ_stat	state	zip_3	mi_pct	product_type	cscore_c	mi_type	relocation_flg	cscore_min	orig_val
0	100020736692	B	CALIBER HOME LOANS, INC.	4.875	492000	360	12/2017	02/2018	75	75	...	I	CA	920	NaN	FRM	NaN	NaN	N	757.0	656000.000000
1	100036136334	R	OTHER	2.750	190000	180	12/2017	01/2018	67	67	...	P	MD	206	NaN	FRM	798.0	NaN	N	797.0	283582.089552
2	100043912941	R	OTHER	4.125	68000	360	12/2017	02/2018	66	66	...	P	OH	432	NaN	FRM	NaN	NaN	N	804.0	103030.303030
3	100057175226	R	OTHER	4.990	71000	360	12/2017	02/2018	95	95	...	P	NC	278	30.0	FRM	NaN	1.0	N	696.0	74736.842105
4	100060715643	R	OTHER	4.500	180000	360	12/2017	02/2018	75	75	...	I	WA	983	NaN	FRM	NaN	NaN	N	726.0	240000.000000

5 rows × 27 columns

Question 1¶

A business partner of yours came to you to ask about how occupancy status relates to risk. They were wondering, what occupancy status appears riskier in our data: principal homes (i.e. someone's primary residence), second homes, or investor-owned homes? There are obviously many ways of measuring risk. Here it's safe to assume your business partner means credit risk, so some variables you may want to consider would be the borrower's credit score, DTI, or LTV. You can use one or more of these variables in your analysis, or something else altogether if you see fit; just ensure that in the end you arrive at one a single visualization to share with your business partner.

In [4]:

# code for visualization goes here

Explanation for why you chose this particular visualization goes here...

Question 2¶

Imagine that a recent news event broke that had to do with mortgage insurance (MI), and even though we don't yet know exactly how that news will impact Fannie Mae's business, you've been asked to produce a visualization that communicates to what extent our December 2017 acquisitions were covered by MI.

In [5]:

# code for visualization goes here

Explanation for why you chose this particular visualization goes here...

Question 3¶

One of your business partners is trying to learn more about the areas of the country where we are providing the highest value loans in terms of origination amount. You've also been told that an interactive map of the United States would be optimal here, and they'd like you to add whatever data you might think are relevant to the tooltip.

In [6]:

# code for visualization goes here

Explanation for why you chose this particular visualization goes here...

Question 4¶

You've received a very open-ended question from an account manager hoping to learn more about how the seller with whom they work most closely compares to all sellers. Pick any seller (aside from "Other") and any two variables in our data (i.e. origination amount and origination value, but don't use that combo), and put together a visualization that communicates whether or not that seller is unique in any way as it pertains to the two variables you selected. The answer can be yes, no, or maybe... just justify your answer with your visualization.

In [7]:

# code for visualization goes here

Explanation for why you chose this particular visualization goes here...