Skip to content

Latest commit

 

History

History
1719 lines (1367 loc) · 98.6 KB

File metadata and controls

1719 lines (1367 loc) · 98.6 KB

Fire up GraphLab Create

import graphlab
import graphlab.aggregate as agg

Load some house sales data

sales = graphlab.SFrame('home_data.gl/')
sales
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900 3 1 1180 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000 3 2.25 2570 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000 2 1 770 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000 4 3 1960 5000 1 0
1954400510 2015-02-18 00:00:00+00:00 510000 3 2 1680 8080 1 0
7237550310 2014-05-12 00:00:00+00:00 1225000 4 4.5 5420 101930 1 0
1321400060 2014-06-27 00:00:00+00:00 257500 3 2.25 1715 6819 2 0
2008000270 2015-01-15 00:00:00+00:00 291850 3 1.5 1060 9711 1 0
2414600126 2015-04-15 00:00:00+00:00 229500 3 1 1780 7470 1 0
3793500160 2015-03-12 00:00:00+00:00 323000 3 2.5 1890 6560 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
0 3 8 1680 0 1987 0 98074 47.61681228
0 3 11 3890 1530 2001 0 98053 47.65611835
0 3 7 1715 0 1995 0 98003 47.30972002
0 3 7 1060 0 1963 0 98198 47.40949984
0 3 7 1050 730 1960 0 98146 47.51229381
0 3 7 1890 0 2003 0 98038 47.36840673
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
-122.04490059 1800.0 7503.0
-122.00528655 4760.0 101930.0
-122.32704857 2238.0 6819.0
-122.31457273 1650.0 9711.0
-122.33659507 1780.0 8113.0
-122.0308176 2390.0 7570.0
[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exploring data for housing sales

graphlab.canvas.set_target('ipynb')
sales.show(view = "Scatter Plot", x="sqft_living", y="price")

Create a simple regression model of sqft_living to price

train_data,test_data = sales.random_split(.8, seed=0)

Build the regression model

sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'])
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.
Linear regression:
--------------------------------------------------------
Number of examples          : 16508
Number of features          : 1
Number of unpacked features : 1
Number of coefficients    : 2
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 2        | 1.005147     | 4350281.148463     | 2246638.464824       | 263020.285748 | 261497.543973   |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.

Evaluate the simple model

print test_data['price'].mean()
543054.042563
print sqft_model.evaluate(test_data)
{'max_error': 4144088.343371106, 'rmse': 255184.85316374677}

Let's show what our predictions look like

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(test_data['sqft_living'], test_data['price'], '.',
        test_data['sqft_living'],sqft_model.predict(test_data), '-')
[<matplotlib.lines.Line2D at 0x1f0e6198>,
 <matplotlib.lines.Line2D at 0x1f0e6240>]

png

sqft_model.get('coefficients')
name index value stderr
(intercept) None -46636.101539 5054.00343049
sqft_living None 281.855182828 2.22196715615
[2 rows x 4 columns]

Explore other features in the data

my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
sales[my_features].show()
sales.show(view="BoxWhisker Plot", x="zipcode", y="price")

Build a regression model with more features

my_features_model = graphlab.linear_regression.create(train_data, target='price', features=my_features, validation_set=None)
Linear regression:
--------------------------------------------------------
Number of examples          : 17384
Number of features          : 6
Number of unpacked features : 6
Number of coefficients    : 115
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.024566     | 3763208.270523     | 181908.848367 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)
{'max_error': 4144088.343371106, 'rmse': 255184.85316374677}
{'max_error': 3486584.509381705, 'rmse': 179542.4333126903}

Apply learned models to predict prices of 3 houses

House 1

house1 = sales[sales['id'] == '5309101200']
house1
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
5309101200 2014-06-05 00:00:00+00:00 620000 4 2.25 2400 5350 1.5 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 4 7 1460 940 1929 0 98117 47.67632376
long sqft_living15 sqft_lot15
-122.37010126 1250.0 4880.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.

house1['price']
dtype: int
Rows: ?
[620000L, ... ]
print sqft_model.predict(house1)
[629816.3372479538]
print my_features_model.predict(house1)
[721918.9333272863]

House 2

house2 = sales[sales['id']=='1925069082']
house2
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
1925069082 2015-05-11 00:00:00+00:00 2200000 5 4.25 4640 22703 2 1
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
4 5 8 2860 1780 1952 0 98052 47.63925783
long sqft_living15 sqft_lot15
-122.09722322 3140.0 14200.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
print sqft_model.predict(house2)
[1261171.9467824404]
print my_features_model.predict(house2)
[1446472.4690774973]

House 3

bill_gates_house =  {'bedrooms':[8], 
              'bathrooms':[25], 
              'sqft_living':[50000], 
              'sqft_lot':[225000],
              'floors':[4], 
              'zipcode':['98039'], 
              'condition':[10], 
              'grade':[10],
              'waterfront':[1],
              'view':[4],
              'sqft_above':[37500],
              'sqft_basement':[12500],
              'yr_built':[1994],
              'yr_renovated':[2010],
              'lat':[47.627606],
              'long':[-122.242054],
              'sqft_living15':[5000],
              'sqft_lot15':[40000]}

print my_features_model.predict(graphlab.SFrame(bill_gates_house))
[13749825.525719076]

Assignment

avgpriceByZip = sales.groupby(key_columns='zipcode', operations={'avgPrice' : agg.MEAN('price')}).sort('avgPrice',ascending = False)
avgpriceByZip
zipcode avgPrice
98039 2160606.6
98004 1355927.09779
98040 1194230.00355
98112 1095499.36803
98102 901258.238095
98109 879623.623853
98105 862825.231441
98006 859684.763052
98119 849448.01087
98005 810164.880952
[70 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
avgpriceByZip['avgPrice'].max()
2160606.6
heightAvgZip = avgpriceByZip['avgPrice' == (avgpriceByZip['avgPrice'].max())]['zipcode']
heightAvgZip
'98039'
sales.show(view="BoxWhisker Plot", x="zipcode", y="price")
house_heighest_avg = sales[sales['zipcode'] == heightAvgZip]
house_heighest_avg
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
3625049014 2014-08-29 00:00:00+00:00 2950000 4 3.5 4860 23885 2 0
2540700110 2015-02-12 00:00:00+00:00 1905000 4 3.5 4210 18564 2 0
3262300940 2014-11-07 00:00:00+00:00 875000 3 1 1220 8119 1 0
3262300940 2015-02-10 00:00:00+00:00 940000 3 1 1220 8119 1 0
6447300265 2014-10-14 00:00:00+00:00 4000000 4 5.5 7080 16573 2 0
2470100110 2014-08-04 00:00:00+00:00 5570000 5 5.75 9200 35069 2 0
2210500019 2015-03-24 00:00:00+00:00 937500 3 1 1320 8500 1 0
6447300345 2015-04-06 00:00:00+00:00 1160000 4 3 2680 15438 2 0
6447300225 2014-11-06 00:00:00+00:00 1880000 3 2.75 2620 17919 1 0
2525049148 2014-10-07 00:00:00+00:00 3418800 5 5 5450 20412 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 12 4860 0 1996 0 98039 47.61717049
0 3 11 4210 0 2001 0 98039 47.62060082
0 4 7 1220 0 1955 0 98039 47.63281908
0 4 7 1220 0 1955 0 98039 47.63281908
0 3 12 5760 1320 2008 0 98039 47.61512031
0 3 13 6200 3000 2001 0 98039 47.62888314
0 4 7 1320 0 1954 0 98039 47.61872888
2 3 8 2680 0 1902 1956 98039 47.61089438
1 4 9 2620 0 1949 0 98039 47.61435052
0 3 11 5450 0 2014 0 98039 47.62087993
long sqft_living15 sqft_lot15
-122.23040939 3580.0 16054.0
-122.2245047 3520.0 18564.0
-122.23554392 1910.0 8119.0
-122.23554392 1910.0 8119.0
-122.22420058 3140.0 15996.0
-122.23346379 3560.0 24345.0
-122.22643371 2790.0 10800.0
-122.22582388 4480.0 14406.0
-122.22772057 3400.0 14400.0
-122.23726918 3160.0 17825.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
print house_heighest_avg['price'].mean()
2160606.6
house_range = sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] <=4000)]
house_range
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
6414100192 2014-12-09 00:00:00+00:00 538000 3 2.25 2570 7242 2 0
1736800520 2015-04-03 00:00:00+00:00 662500 3 2.5 3560 9796 1 0
9297300055 2015-01-24 00:00:00+00:00 650000 4 3 2950 5000 2 0
2524049179 2014-08-26 00:00:00+00:00 2000000 3 2.75 3050 44867 1 0
7137970340 2014-07-03 00:00:00+00:00 285000 5 2.5 2270 6300 2 0
3814700200 2014-11-20 00:00:00+00:00 329000 3 2.25 2450 6500 2 0
1794500383 2014-06-26 00:00:00+00:00 937000 3 1.75 2450 2691 2 0
1873100390 2015-03-02 00:00:00+00:00 719000 4 2.5 2570 7173 2 0
8562750320 2014-11-10 00:00:00+00:00 580500 3 2.5 2320 3980 2 0
0461000390 2014-06-24 00:00:00+00:00 687500 4 1.75 2330 5000 1.5 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 8 1860 1700 1965 0 98007 47.60065993
3 3 9 1980 970 1979 0 98126 47.57136955
4 3 9 2330 720 1968 0 98040 47.53164379
0 3 8 2270 0 1995 0 98092 47.32658071
0 4 8 2450 0 1985 0 98030 47.37386303
0 3 8 1750 700 1915 0 98119 47.63855772
0 3 8 2570 0 2005 0 98052 47.70732168
0 3 8 2320 0 2003 0 98027 47.5391103
0 4 7 1510 820 1929 0 98117 47.68228235
long sqft_living15 sqft_lot15
-122.3188624 1690.0 7639.0
-122.14529566 2210.0 8925.0
-122.37541218 2140.0 4000.0
-122.23345881 4110.0 20336.0
-122.16892624 2240.0 7005.0
-122.17228981 2200.0 6865.0
-122.35985573 1760.0 3573.0
-122.11029785 2630.0 6026.0
-122.06971484 2580.0 3980.0
-122.36760203 1460.0 5000.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
frac_houses = house_range.num_rows() /  float(sales.num_rows())
print frac_houses
0.421875722945
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]
advanced_features_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features, validation_set=None)
Linear regression:
--------------------------------------------------------
Number of examples          : 17384
Number of features          : 18
Number of unpacked features : 18
Number of coefficients    : 127
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.032030     | 3469012.450686     | 154580.940736 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

print my_features_model.evaluate(test_data)
print advanced_features_model.evaluate(test_data)
{'max_error': 3462350.260179847, 'rmse': 179400.60890613243}
{'max_error': 3556849.413858208, 'rmse': 156831.1168021901}
rmse_diff = my_features_model.evaluate(test_data)['rmse'] - advanced_features_model.evaluate(test_data)['rmse']
rmse_diff
22569.49210394232