- [9/28] Fix the example usage of
rows. - [9/30] Watch getting started video
- [9/30]
rows()andloans()are generators thanyieldresults (fixed incorrect text describing them as returning lists) - [10/7] clarified that
Nodeshould be a child class ofSimplePredictor - [10/11] added debug hint for testBias test
In this project, you'll learn about zip files, modules, object oriented programming, and trees.
You'll create a new tree.py file from scratch; that's the only file
you need to hand in (though it is probably useful to also create a
notebook to use your module for debugging purposes).
You will end up implementing the following classes and functions.
class ZippedCSVReader
class Loan
class Bank
def get_bank_names
class SimplePredictor
class Node
def build_tree
def bias_testYou can watch this for hints on how to get started: https://mediaspace.wisc.edu/media/t/1_uzdufbmd
Sadly, there is a long history of lending discrimination based on race in the United States. In some cases, lenders have literally drawn red lines on a map around certain neighbourhoods where they would not offer loans, based on the racial demographics of those neighbourhoods (read more about redlining here: https://en.wikipedia.org/wiki/Redlining). If you're interested as to how redlining can still be seen today, here is an article discussing similar behaviors seen in the insurance industry: https://www.propublica.org/article/minority-neighborhoods-higher-car-insurance-premiums-white-areas-same-risk
In 1975, congress passed the Home Mortgage Disclosure Act (HDMA), to bring more transparency to this injustice (https://en.wikipedia.org/wiki/Home_Mortgage_Disclosure_Act). The idea is that banks must report details about loan applications and which loans they decided to approve. In this project, we'll be analyzing HDMA data from Wisconsin, Illinois, and Louisiana: https://www.consumerfinance.gov/data-research/hmda/historic-data/.
As data scientists, a real concern we must consider is whether our models show bias. If we train our models to mimic human behavior, will they pickup on human bias? If we don't base our models on sufficient data, will they overgeneralize? In this project, we'll be providing several files describing decision trees. Decisions trees are a kind of model that can output things like approve/deny on a row-by-row basis. Your job will be to write Python code to load and run the decision trees. At least one of them is racially biased, and you'll be asked to write a function that exposes this.
Most of you will be more familiar with writing a Jupyter notebook for a project. Writing a module is a bit different: you're creating a module with a collections of functions/classes that other people could use in their projects.
Using those functions/classes isn't part of this project, but you'll
need to do that anyway in order to troubleshoot your code. We
recommend creating a debugging.ipynb notebook (for your own
purposes, not to turn in) that imports your module and uses some of
the classes/functions. You might do something like this:
from tree import *
%load_ext autoreload
%autoreload 2This imports everything from your tree.py file so that you can write
snippets of code that use it. The % lines automatically reload your
module if you change it (well, it usually works -- you'll need to
occasionally run "Kernel Restart & Run All" when it doesn't).
Be sure to also run tester.py regularly to estimate your grade
(prior to TA deductions).
We strongly recommend against copying code back and forth between the notebook and .py file throughout the development process. It's a habit that will slow you down in the long run (it's better to comfortable directly writing code in your .py).
For this portion of the project, you may collaborate with your group members in any way (even looking at working code). You may also seek help from 320 staff (mentors, TAs, instructor). You may not seek receive help from other 320 students (outside your group) or anybody outside the course.
We're providing loans.zip, mini.zip. This class
will help your other code access the data. Here are a couple examples
of how the class is instantiated:
tree_reader = ZippedCSVReader("trees.zip")
data_reader = ZippedCSVReader("mini.zip")After the above call, it should be possible to see a list of files via a paths attribute, like this:
print(data_reader.paths) # in alphabetical order!For this, you can refer to Lab 3.
Your ZippedCSVReader will have one generator method to help people
access the data inside a zip file: rows. It accepts an argument
specifying the name of a file inside the zip. rows works on .csv
files; it yields dicts corresponding to each row (hint: look into how
csv.DictReader works). Furthermore, if no file name is passed to
rows, then it will read all files ending with ".csv" contained
inside the zip, yielding dicts corresponding to the
records in all the CSV files (the rows of CSV files that are
alphabetically earlier will appear earlier in this list).
Example usage:
dict_list = list(data_reader.rows("wi.csv"))
print(dict_list[0])
print()
dict_generator = data_reader.rows()
print(sum(1 for _ in dict_generator))Expected output:
{'as_of_year': '2017', 'respondent_id': '33-0975529', 'agency_name': 'Department of Housing and Urban Development', 'agency_abbr': 'HUD', 'agency_code': '7', 'loan_type_name': 'VA-guaranteed', 'loan_type': '3', 'property_type_name': 'One-to-four family dwelling (other than manufactured housing)', 'property_type': '1', 'loan_purpose_name': 'Refinancing', 'loan_purpose': '3', 'owner_occupancy_name': 'Owner-occupied as a principal dwelling', 'owner_occupancy': '1', 'loan_amount_000s': '165', 'preapproval_name': 'Not applicable', 'preapproval': '3', 'action_taken_name': 'Loan originated', 'action_taken': '1', 'state_name': 'Wisconsin', 'state_abbr': 'WI', 'state_code': '55', 'county_name': 'Outagamie County', 'county_code': '87.0', 'applicant_ethnicity_name': 'Not Hispanic or Latino', 'applicant_ethnicity': '2', 'co_applicant_ethnicity_name': 'Not Hispanic or Latino', 'co_applicant_ethnicity': '2', 'applicant_race_name_1': 'White', 'applicant_race_1': '5', 'applicant_race_name_2': '', 'applicant_race_2': '', 'applicant_race_name_3': '', 'applicant_race_3': '', 'co_applicant_race_name_1': 'White', 'co_applicant_race_1': '5', 'co_applicant_race_name_2': '', 'co_applicant_race_2': '', 'applicant_sex_name': 'Male', 'applicant_sex': '1', 'co_applicant_sex_name': 'Female', 'co_applicant_sex': '2', 'applicant_income_000s': '57', 'purchaser_type_name': 'Life insurance company, credit union, mortgage bank, or finance company', 'purchaser_type': '7', 'denial_reason_name_1': '', 'denial_reason_1': '', 'denial_reason_name_2': '', 'denial_reason_2': '', 'population': '5765.0', 'minority_population': '24', 'hud_median_family_income': '74700'}
60
The Loan class will provide a convenient way to represent information about loans. It will have the following methods:
class Loan:
def __init__(self, amount, purpose, race, sex, income, decision):
pass # TODO
def __repr__(self):
pass # TODO
def __getitem__(self, lookup):
pass # TODOIt can be instantiated like this:
loan = Loan(40, "Home improvement", "Asian", "Male", 120, "approve")repr(loan) should return something like this:
"Loan(40, 'Home improvement', 'Asian', 'Male', 120, 'approve')"In this example, if you implement __getitem__ properly, loan["amount"] should give 40, loan["purpose"]
should give "Home improvement", and so on.
loan[????] should work for ANY value in the brackets. If the value
in the brackets does NOT match one of the parameter names in the
constructor, the behavior will be different. It will return 0 or 1,
depending on whether any argument passed to those parameters matches
the value in brackets. For example, loan["Refinance"] will be 0,
and loan["Asian"] will be 1.
The Bank class ties together ZippedCSVReader and Loan.
Instances can be instantiated like this:
b = Bank(name, reader)name is a string and reader is an instance of your
ZippedCSVReader class. A loans method can be used like this:
b = Bank("NCUA", data_reader)
for loan in b.loans():
print(loan) # loan is of type LoanExpected output:
Loan(94, 'Refinancing', 'Information not provided by applicant in mail, Internet, or telephone application', 'Information not provided by applicant in mail, Internet, or telephone application', 71, 'deny')
Loan(55, 'Home purchase', 'White', 'Male', 41, 'deny')
Loan(20, 'Refinancing', 'Black or African American', 'Female', 41, 'approve')
Loan(22, 'Refinancing', 'White', 'Male', 36, 'approve')
Loan(175, 'Refinancing', 'White', 'Male', 70, 'approve')
Loan(191, 'Home purchase', 'Information not provided by applicant in mail, Internet, or telephone application', 'Information not provided by applicant in mail, Internet, or telephone application', 68, 'approve')
Loan(82, 'Refinancing', 'White', 'Male', 40, 'deny')
Bank is doing two things here: (1) converting dict rows to Loan
objects, and (2) filtering to rows where agency_abbr is "NCUA".
loans is a generator function, so loan objects are yielded. If
None is passed for the bank name, loans() should return Loan
objects for all rows in the zip file.
Relevant fields when reading from the CSV: agency_abbr, loan_amount_000s, loan_purpose_name, applicant_race_name_1, applicant_sex_name, applicant_income_000s, action_taken. When converting, amount and income should be converted to ints. Missing values ("")
should be replaced with 0. action_taken is 1 for "approve", otherwise decision is "deny"
To figure out what bank names (like "HUD") are in the dataset, you
should have a function (not a method!) in trees.py that works like
this:
reader = ZippedCSVReader('loans.zip')
names = get_bank_names(reader) # should be sorted alphabetically
print(names)Expected output:
['CFPB', 'FDIC', 'FRS', 'HUD', 'NCUA', 'OCC']
Instances of SimplePredictor can be used to decide whether to
approve a loan. You can start from the following:
class SimplePredictor():
def __init__(self):
pass
def predict(self, loan):
pass
def get_approved(self):
pass
def get_denied(self):
passAssuming spred is a SimplePredictor object, spred.predict(loan)
will return True if the loan should be accepted, and False otherwise.
spred.get_approved() will return how many applicants have been
approved so far
The policy of SimplePredictor is simple: approve all loans where the purpose is "Refinancing" and deny all others.
For example, SimplePredictor object can be used like this:
spred = SimplePredictor()
my_loans = [Loan(175, 'Refinancing', 'White', 'Male', 70, 'approve'),
Loan(145, 'Home purchase', 'White', 'Female', 37, 'deny'),
Loan(200, 'Home purchase', 'White', 'Male', 95, 'approve'),
Loan(414, 'Home purchase', 'White', 'Female', 300, 'approve'),
Loan(22, 'Refinancing', 'White', 'Female', 36, '1')]
for loan in my_loans:
print(loan, 'predict:', spred.predict(loan))
print('approved:', spred.get_approved(), 'denied', spred.get_denied())Expected output:
Loan(175, 'Refinancing', 'White', 'Male', 70, 'approve') predict: True
approved: 1 denied 0
Loan(145, 'Home purchase', 'White', 'Female', 37, 'deny') predict: False
approved: 1 denied 1
Loan(200, 'Home purchase', 'White', 'Male', 95, 'approve') predict: False
approved: 1 denied 2
Loan(414, 'Home purchase', 'White', 'Female', 300, 'approve') predict: False
approved: 1 denied 3
Loan(22, 'Refinancing', 'White', 'Female', 36, '1') predict: True
approved: 2 denied 3
Decision Trees are trees that can be used to make predictions (or decisions). Consider the following picture:
How can we use the tree to decide whether to approve or deny a loan?
Let's say somebody is applying for a 190 (thousand dollar) loan
(amount=190) and makes 45 (thousands dollars) per year
(income=45). We see that "field": "amount" and "threshold": 200". Since amount <= 200, we take the left branch.
Next, we see "field": "income" and "threshold: 35" from the left
child node. Since income > 35 we take the right branch. In the right
child node, we see "field": "class" and "threshold: 1", which
represents predicted class is 1. In these trees, class 1 means
"approve" and class 0 means "deny". This particular loan
application is therefore approved.
In terms of code, a DT (decision tree) has some similarities to a BST (binary search tree). In both cases, branches are recursively taken to the right and left based on thresholds. With a BST, we're working with a single value, so the comparisons at all nodes are against that one value. With a DT, we're working with a row of data, and each node tells us not only the threshold, but which field of the row should be considered.
Create a Node class, starting with the following (Node should
inherit get_approved and get_denied from SimplePredictor):
class Node(????):
def __init__(self, field, threshold, left, right):
# TODO: call parent constructor
# TODO: create attributes with same names/values as the parameters
def dump(self, indent=0):
if self.field == "class":
line = "class=" + str(self.threshold)
else:
line = self.field + " <= " + str(self.threshold)
print(" "*indent+line)
if ????:
self.left.dump(indent+1)
if self.right != None:
????Test your code. You should be able to create a 3-node tree like this:
leaf1 = Node(field="class", threshold=0, left=None, right=None)
leaf2 = Node(field="class", threshold=1, left=None, right=None)
root = Node(field="income", threshold=50, left=leaf1, right=leaf2)
root.dump()You should see something like this:
income <= 50
class=0
class=1
You won't normally build trees by writing a line of code for each
Node, as in the above example.
trees.zip contains several trees, represented as CSV files. For
example, simple.csv looks like this:
field,threshold,left,right
amount,200,1,2
income,35,3,4
income,70,5,6
class,0,-1,-1
class,1,-1,-1
class,0,-1,-1
class,1,-1,-1
The root node corresponds to the first row after the header:
amount,200,1,2. Notice that its left and right children are at row
indexes 1 and 2; this works out to income,35,3,4 and income,70,5,6
respectively.
simple.csv can be read to a list of dicts like this:
node_rows = list(tree_reader.rows("simple.csv"))
node_rowsOutput:
[{'field': 'amount', 'threshold': '200', 'left': '1', 'right': '2'},
{'field': 'income', 'threshold': '35', 'left': '3', 'right': '4'},
{'field': 'income', 'threshold': '70', 'left': '5', 'right': '6'},
{'field': 'class', 'threshold': '0', 'left': '-1', 'right': '-1'},
{'field': 'class', 'threshold': '1', 'left': '-1', 'right': '-1'},
{'field': 'class', 'threshold': '0', 'left': '-1', 'right': '-1'},
{'field': 'class', 'threshold': '1', 'left': '-1', 'right': '-1'}]
Your job is to write a build_tree function in your module that takes
such a list of dicts (into a rows parameter) and the index of a root
row (root_idx) parameter and construct a tree of Node objects:
def build_tree(rows, root_idx=0):
# TODO: recursively call build_tree to create child Nodes (if any)
# before constructing+returning the node corresponding to the row
# at index root_idx in rows
return Node(????)From your debug notebook, you could call it like this:
root = build_tree(node_rows)
root.dump()Output:
amount <= 200
income <= 35
class=0
class=1
income <= 70
class=0
class=1
Add one more recursive method named node_count to Node that
counts the number of nodes in the tree. The following should return
61, for example:
tree_reader = ZippedCSVReader("trees.zip")
dt = build_tree(list(tree_reader.rows("good.csv")))
dt.node_count()You have to do the remainder of this project on your own. Do not discuss with anybody except 320 staff (mentors, TAs, instructor).
Add a recursive predict method that takes a Loan object and
traverses the nodes of the decision tree to determine whether or not
to approve the loan. predict should return True or False, and
should work like this:
loan = Loan(40, "Home improvement", "Asian", "Male", 120, "approve")
root = build_tree(list(tree_reader.rows("simple.csv")))
root.predict(loan)The above returns True, but to manually test your code (before running
the tester.py), try changing the amount and income values to
trigger decisions based on each of the leaf nodes in the decision
tree.
Here's one possible way to measure racial/gender bias in a predictor: for a given set of loan applications, how often would the outcome (approve/deny) have been different if the applicant was of a different race or sex, but was otherwise identical on all stats?
Complete the following function to answer this question:
def bias_test(bank, predictor, field, value_override):
pass- use bank to iterate over loans with
loans - for each loan, feed it directly to predictor, and store the result
- modify the loan and according to
field, change the race or sex of applicant tovalue_override(Note that the parameterfieldcan only be "sex" or "race".) - feed the modified loan to the predictor again, and compare new result to previous result
- at the end, return the percentage of cases where the predictor gave a different result after the race was changed
Here's an example:
b = Bank(None, ZippedCSVReader("loans.zip"))
dt = build_tree(list(ZippedCSVReader("trees.zip").rows("race_biased.csv")))
bias_percent = bias_test(b, dt, "race", "Black or African American")
print(bias_percent)Here, the result should be 0.4112. The decision tree is exhibiting
major bias with respect to Black and African American applicants, with
race being a deciding factor 41% of the time.
If you get the wrong number from the tester, it can be difficult to
determine which loan classification(s) happened differently, leading
to your result. The JSON files in the testBias directory show the
loans before and after modification along the correct prediction
results before/after when the testBias test runs. If you're getting
a slightly wrong number, consider adding debug prints/output to
compare your predictions with these on a loan-by-loan basis.
When we build models to mimic human behavior, we need to be careful that our models don't also become biased. In this project, we tested a number of models for one kind of bias (racial). The HDMA data set is quite extensive. Take a moment to think about what other biases you might want to check for before using decision trees (or other models) to make loan decisions for real people. For ideas, here are some of the columns in the HDMA dataset:
as_of_year, respondent_id, agency_name, agency_abbr, agency_code,
loan_type_name, loan_type, property_type_name, property_type,
loan_purpose_name, loan_purpose, owner_occupancy_name,
owner_occupancy, loan_amount_000s, preapproval_name, preapproval,
action_taken_name, action_taken, state_name, state_abbr, state_code,
county_name, county_code, applicant_ethnicity_name,
applicant_ethnicity, co_applicant_ethnicity_name,
co_applicant_ethnicity, applicant_race_name_1, applicant_race_1,
applicant_race_name_2, applicant_race_2, applicant_race_name_3,
applicant_race_3, co_applicant_race_name_1, co_applicant_race_1,
co_applicant_race_name_2, co_applicant_race_2, applicant_sex_name,
applicant_sex, co_applicant_sex_name, co_applicant_sex,
applicant_income_000s, purchaser_type_name, purchaser_type,
denial_reason_name_1, denial_reason_1, denial_reason_name_2,
denial_reason_2, population, minority_population,
hud_median_family_income
Is there other information that can/should be collected in the HDMA data to allow other kinds of testing for bias that are not currently possible?
