Seattle housing market with a flavor of machine learning (October update)

Kostya
4 min readOct 22, 2018

Roughly 10 months ago I played with Kaggle dataset to get idea of Seattle housing marker. Now, I extended this analysis with the newest data, actual homes on sale as of October 2018, and added new perspective by applying sweet machine learning algorithms to analyze Seattle housing market.

All technical details were captured in a notebook and this article provides more high level overview without diving into all steps.

The dataset reflects actual active real estate items in Seattle area, beyond King county. To begin with, let’s take a look at details and compare with old Kaggle dataset, keep in mind that old summary includes only King county while October 2018 data reflects bigger area:

The median price in 2018 is whooping $750 grans and this money can by 4 bedroom property, which is one bedroom improvement compared to 2014–2015 period. Also, most likely, it’s gonna be an improvement in a size of property.

Also, median lot size is 0.19 acre and median HOA due is about $65. Roughly about a half of properties on a market a newer than 1991 which is already 27 year as a moment of writing.

To proceed, I decided to filter out everything which has:

  • less than 2 bathrooms
  • more than 4 bedrooms
  • price about 1.25M
  • size less than 1500 sq ft

it left me with 1059 properties, only 81 are townhomes (7.6%)and the rest 978 (92.4%) are detached homes. Median prices for these guys is $723.500 and size 2381 sq ft which gives us median price per sq ft $303.

Now it’s time to apply machine learning to Seattle housing market. I decided to start with a simple exercise: let’s split all houses into 4 clusters (groups) based on listing price only. KMeans should work well and group houses around four price centroids:

It gives interesting result, based on price only I got the following chart, Y-axis shows a year, but this value hasn’t been used by k-means algorithm, only to simplify visualization.

Blue group contains 230 items (21%) and price range up to $585 grans.

Green group contains 40% of items (the biggest one) with a price range $$588–785k

Maroon group has 294 items (27%) from $786 k to $1M

And last, yellow group, represents only 10% of real estate and prices are grouped north of 1M.

Surprisingly, a lot of expensive properties are located on north of lake Washington. I could explain by relatively easy access to both banks of the lake: using either i-405 to get to Kirkland/Redmond/Bellevue and I-5 directly to Seattle.

At this point, I decided to add embedding feature LL as Latitude X Longitude to investigate location influence if any. The following plot gives an idea what features are influence price and price per sq ft:

Influence of home properties on price in Seattle area

It’s a surprise to me that an age of home doesn’t influence the price, except the spike in very new home, like may be 2015 and newer. So, no reason to look for older home in an order to save money. As expected, price is growing very quickly with a grow of sq ft, and logically bigger house has cheaper sq ft, despite the bigger overall price. And obviously, location has strong correlation to closing price.

Application of AgglomerativeClustering clustering algo helped me split available houses at 10 groups and calculate median price per group. It’s similar to calculate medium/average price per city, but city are historically formed groups and I was more interested to see natural groups in available houses. As to me, grouping proposed by algorithm looks good and gives a good idea of price based on geo location:

Two very small and isolated clusters are not depicted

I’m planing to continue my analysis, feel free to leave a feedback on this one and/or suggestions on future directions like slice and dice.

--

--

Kostya

Java, start-up, hiking, photos, bicycle,journey