import graphlab
import graphlab .aggregate as agg
Load some house sales data
sales = graphlab .SFrame ('home_data.gl/' )
id
date
price
bedrooms
bathrooms
sqft_living
sqft_lot
floors
waterfront
7129300520
2014-10-13 00:00:00+00:00
221900
3
1
1180
5650
1
0
6414100192
2014-12-09 00:00:00+00:00
538000
3
2.25
2570
7242
2
0
5631500400
2015-02-25 00:00:00+00:00
180000
2
1
770
10000
1
0
2487200875
2014-12-09 00:00:00+00:00
604000
4
3
1960
5000
1
0
1954400510
2015-02-18 00:00:00+00:00
510000
3
2
1680
8080
1
0
7237550310
2014-05-12 00:00:00+00:00
1225000
4
4.5
5420
101930
1
0
1321400060
2014-06-27 00:00:00+00:00
257500
3
2.25
1715
6819
2
0
2008000270
2015-01-15 00:00:00+00:00
291850
3
1.5
1060
9711
1
0
2414600126
2015-04-15 00:00:00+00:00
229500
3
1
1780
7470
1
0
3793500160
2015-03-12 00:00:00+00:00
323000
3
2.5
1890
6560
2
0
view
condition
grade
sqft_above
sqft_basement
yr_built
yr_renovated
zipcode
lat
0
3
7
1180
0
1955
0
98178
47.51123398
0
3
7
2170
400
1951
1991
98125
47.72102274
0
3
6
770
0
1933
0
98028
47.73792661
0
5
7
1050
910
1965
0
98136
47.52082
0
3
8
1680
0
1987
0
98074
47.61681228
0
3
11
3890
1530
2001
0
98053
47.65611835
0
3
7
1715
0
1995
0
98003
47.30972002
0
3
7
1060
0
1963
0
98198
47.40949984
0
3
7
1050
730
1960
0
98146
47.51229381
0
3
7
1890
0
2003
0
98038
47.36840673
long
sqft_living15
sqft_lot15
-122.25677536
1340.0
5650.0
-122.3188624
1690.0
7639.0
-122.23319601
2720.0
8062.0
-122.39318505
1360.0
5000.0
-122.04490059
1800.0
7503.0
-122.00528655
4760.0
101930.0
-122.32704857
2238.0
6819.0
-122.31457273
1650.0
9711.0
-122.33659507
1780.0
8113.0
-122.0308176
2390.0
7570.0
[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
Exploring data for housing sales
graphlab .canvas .set_target ('ipynb' )
sales .show (view = "Scatter Plot" , x = "sqft_living" , y = "price" )
Create a simple regression model of sqft_living to price
train_data ,test_data = sales .random_split (.8 , seed = 0 )
Build the regression model
sqft_model = graphlab .linear_regression .create (train_data , target = 'price' , features = ['sqft_living' ])
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
You can set ``validation_set=None`` to disable validation tracking.
Linear regression:
--------------------------------------------------------
Number of examples : 16508
Number of features : 1
Number of unpacked features : 1
Number of coefficients : 2
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1 | 2 | 1.005147 | 4350281.148463 | 2246638.464824 | 263020.285748 | 261497.543973 |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.
Evaluate the simple model
print test_data ['price' ].mean ()
print sqft_model .evaluate (test_data )
{'max_error': 4144088.343371106, 'rmse': 255184.85316374677}
Let's show what our predictions look like
import matplotlib .pyplot as plt
% matplotlib inline
plt .plot (test_data ['sqft_living' ], test_data ['price' ], '.' ,
test_data ['sqft_living' ],sqft_model .predict (test_data ), '-' )
[<matplotlib.lines.Line2D at 0x1f0e6198>,
<matplotlib.lines.Line2D at 0x1f0e6240>]
sqft_model .get ('coefficients' )
name
index
value
stderr
(intercept)
None
-46636.101539
5054.00343049
sqft_living
None
281.855182828
2.22196715615
[2 rows x 4 columns]
Explore other features in the data
my_features = ['bedrooms' , 'bathrooms' , 'sqft_living' , 'sqft_lot' , 'floors' , 'zipcode' ]
sales [my_features ].show ()
sales .show (view = "BoxWhisker Plot" , x = "zipcode" , y = "price" )
Build a regression model with more features
my_features_model = graphlab .linear_regression .create (train_data , target = 'price' , features = my_features , validation_set = None )
Linear regression:
--------------------------------------------------------
Number of examples : 17384
Number of features : 6
Number of unpacked features : 6
Number of coefficients : 115
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1 | 2 | 0.024566 | 3763208.270523 | 181908.848367 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.
print sqft_model .evaluate (test_data )
print my_features_model .evaluate (test_data )
{'max_error': 4144088.343371106, 'rmse': 255184.85316374677}
{'max_error': 3486584.509381705, 'rmse': 179542.4333126903}
Apply learned models to predict prices of 3 houses
house1 = sales [sales ['id' ] == '5309101200' ]
id
date
price
bedrooms
bathrooms
sqft_living
sqft_lot
floors
waterfront
5309101200
2014-06-05 00:00:00+00:00
620000
4
2.25
2400
5350
1.5
0
view
condition
grade
sqft_above
sqft_basement
yr_built
yr_renovated
zipcode
lat
0
4
7
1460
940
1929
0
98117
47.67632376
long
sqft_living15
sqft_lot15
-122.37010126
1250.0
4880.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
dtype: int
Rows: ?
[620000L, ... ]
print sqft_model .predict (house1 )
print my_features_model .predict (house1 )
house2 = sales [sales ['id' ]== '1925069082' ]
id
date
price
bedrooms
bathrooms
sqft_living
sqft_lot
floors
waterfront
1925069082
2015-05-11 00:00:00+00:00
2200000
5
4.25
4640
22703
2
1
view
condition
grade
sqft_above
sqft_basement
yr_built
yr_renovated
zipcode
lat
4
5
8
2860
1780
1952
0
98052
47.63925783
long
sqft_living15
sqft_lot15
-122.09722322
3140.0
14200.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
print sqft_model .predict (house2 )
print my_features_model .predict (house2 )
bill_gates_house = {'bedrooms' :[8 ],
'bathrooms' :[25 ],
'sqft_living' :[50000 ],
'sqft_lot' :[225000 ],
'floors' :[4 ],
'zipcode' :['98039' ],
'condition' :[10 ],
'grade' :[10 ],
'waterfront' :[1 ],
'view' :[4 ],
'sqft_above' :[37500 ],
'sqft_basement' :[12500 ],
'yr_built' :[1994 ],
'yr_renovated' :[2010 ],
'lat' :[47.627606 ],
'long' :[- 122.242054 ],
'sqft_living15' :[5000 ],
'sqft_lot15' :[40000 ]}
print my_features_model .predict (graphlab .SFrame (bill_gates_house ))
avgpriceByZip = sales .groupby (key_columns = 'zipcode' , operations = {'avgPrice' : agg .MEAN ('price' )}).sort ('avgPrice' ,ascending = False )
zipcode
avgPrice
98039
2160606.6
98004
1355927.09779
98040
1194230.00355
98112
1095499.36803
98102
901258.238095
98109
879623.623853
98105
862825.231441
98006
859684.763052
98119
849448.01087
98005
810164.880952
[70 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
avgpriceByZip ['avgPrice' ].max ()
heightAvgZip = avgpriceByZip ['avgPrice' == (avgpriceByZip ['avgPrice' ].max ())]['zipcode' ]
sales .show (view = "BoxWhisker Plot" , x = "zipcode" , y = "price" )
house_heighest_avg = sales [sales ['zipcode' ] == heightAvgZip ]
id
date
price
bedrooms
bathrooms
sqft_living
sqft_lot
floors
waterfront
3625049014
2014-08-29 00:00:00+00:00
2950000
4
3.5
4860
23885
2
0
2540700110
2015-02-12 00:00:00+00:00
1905000
4
3.5
4210
18564
2
0
3262300940
2014-11-07 00:00:00+00:00
875000
3
1
1220
8119
1
0
3262300940
2015-02-10 00:00:00+00:00
940000
3
1
1220
8119
1
0
6447300265
2014-10-14 00:00:00+00:00
4000000
4
5.5
7080
16573
2
0
2470100110
2014-08-04 00:00:00+00:00
5570000
5
5.75
9200
35069
2
0
2210500019
2015-03-24 00:00:00+00:00
937500
3
1
1320
8500
1
0
6447300345
2015-04-06 00:00:00+00:00
1160000
4
3
2680
15438
2
0
6447300225
2014-11-06 00:00:00+00:00
1880000
3
2.75
2620
17919
1
0
2525049148
2014-10-07 00:00:00+00:00
3418800
5
5
5450
20412
2
0
view
condition
grade
sqft_above
sqft_basement
yr_built
yr_renovated
zipcode
lat
0
3
12
4860
0
1996
0
98039
47.61717049
0
3
11
4210
0
2001
0
98039
47.62060082
0
4
7
1220
0
1955
0
98039
47.63281908
0
4
7
1220
0
1955
0
98039
47.63281908
0
3
12
5760
1320
2008
0
98039
47.61512031
0
3
13
6200
3000
2001
0
98039
47.62888314
0
4
7
1320
0
1954
0
98039
47.61872888
2
3
8
2680
0
1902
1956
98039
47.61089438
1
4
9
2620
0
1949
0
98039
47.61435052
0
3
11
5450
0
2014
0
98039
47.62087993
long
sqft_living15
sqft_lot15
-122.23040939
3580.0
16054.0
-122.2245047
3520.0
18564.0
-122.23554392
1910.0
8119.0
-122.23554392
1910.0
8119.0
-122.22420058
3140.0
15996.0
-122.23346379
3560.0
24345.0
-122.22643371
2790.0
10800.0
-122.22582388
4480.0
14406.0
-122.22772057
3400.0
14400.0
-122.23726918
3160.0
17825.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
print house_heighest_avg ['price' ].mean ()
house_range = sales [(sales ['sqft_living' ] > 2000 ) & (sales ['sqft_living' ] <= 4000 )]
id
date
price
bedrooms
bathrooms
sqft_living
sqft_lot
floors
waterfront
6414100192
2014-12-09 00:00:00+00:00
538000
3
2.25
2570
7242
2
0
1736800520
2015-04-03 00:00:00+00:00
662500
3
2.5
3560
9796
1
0
9297300055
2015-01-24 00:00:00+00:00
650000
4
3
2950
5000
2
0
2524049179
2014-08-26 00:00:00+00:00
2000000
3
2.75
3050
44867
1
0
7137970340
2014-07-03 00:00:00+00:00
285000
5
2.5
2270
6300
2
0
3814700200
2014-11-20 00:00:00+00:00
329000
3
2.25
2450
6500
2
0
1794500383
2014-06-26 00:00:00+00:00
937000
3
1.75
2450
2691
2
0
1873100390
2015-03-02 00:00:00+00:00
719000
4
2.5
2570
7173
2
0
8562750320
2014-11-10 00:00:00+00:00
580500
3
2.5
2320
3980
2
0
0461000390
2014-06-24 00:00:00+00:00
687500
4
1.75
2330
5000
1.5
0
view
condition
grade
sqft_above
sqft_basement
yr_built
yr_renovated
zipcode
lat
0
3
7
2170
400
1951
1991
98125
47.72102274
0
3
8
1860
1700
1965
0
98007
47.60065993
3
3
9
1980
970
1979
0
98126
47.57136955
4
3
9
2330
720
1968
0
98040
47.53164379
0
3
8
2270
0
1995
0
98092
47.32658071
0
4
8
2450
0
1985
0
98030
47.37386303
0
3
8
1750
700
1915
0
98119
47.63855772
0
3
8
2570
0
2005
0
98052
47.70732168
0
3
8
2320
0
2003
0
98027
47.5391103
0
4
7
1510
820
1929
0
98117
47.68228235
long
sqft_living15
sqft_lot15
-122.3188624
1690.0
7639.0
-122.14529566
2210.0
8925.0
-122.37541218
2140.0
4000.0
-122.23345881
4110.0
20336.0
-122.16892624
2240.0
7005.0
-122.17228981
2200.0
6865.0
-122.35985573
1760.0
3573.0
-122.11029785
2630.0
6026.0
-122.06971484
2580.0
3980.0
-122.36760203
1460.0
5000.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
frac_houses = house_range .num_rows () / float (sales .num_rows ())
advanced_features = [
'bedrooms' , 'bathrooms' , 'sqft_living' , 'sqft_lot' , 'floors' , 'zipcode' ,
'condition' , # condition of house
'grade' , # measure of quality of construction
'waterfront' , # waterfront property
'view' , # type of view
'sqft_above' , # square feet above ground
'sqft_basement' , # square feet in basement
'yr_built' , # the year built
'yr_renovated' , # the year renovated
'lat' , 'long' , # the lat-long of the parcel
'sqft_living15' , # average sq.ft. of 15 nearest neighbors
'sqft_lot15' , # average lot size of 15 nearest neighbors
]
advanced_features_model = graphlab .linear_regression .create (train_data , target = 'price' , features = advanced_features , validation_set = None )
Linear regression:
--------------------------------------------------------
Number of examples : 17384
Number of features : 18
Number of unpacked features : 18
Number of coefficients : 127
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1 | 2 | 0.032030 | 3469012.450686 | 154580.940736 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.
print my_features_model .evaluate (test_data )
print advanced_features_model .evaluate (test_data )
{'max_error': 3462350.260179847, 'rmse': 179400.60890613243}
{'max_error': 3556849.413858208, 'rmse': 156831.1168021901}
rmse_diff = my_features_model .evaluate (test_data )['rmse' ] - advanced_features_model .evaluate (test_data )['rmse' ]