This post was originally published on the IBM Center for Applied Insights website. The CAI website was officially sunset on 4/15/16. While it remains online, I’m moving some of my posts from the CAI site over here in case they decide to take the site offline at some point in the future.
Last year, I set out to buy a used car. This wasn’t just any car. It would be the vehicle for a long-planned, off-road expedition to Arizona and Utah with my brother, and for many other off-road adventures in the coming years.
I had set my sights on a 1998-2007 Toyota Land Cruiser or the Lexus LX470—both well regarded in off-road circles. However, due to a number of factors, there’s often limited market availability, which makes it difficult to put an accurate price on them. As I began my search, I found significant price variations from standard pricing guides—by as much as $10,000 in some cases.
What’s a data scientist to do? Get some data and build a model of course.
What I learned not only helped me find the right car at the right price, it also offered important lessons that can be applied to almost any data initiative.
But first the car.
I began my work on cars.com. I collected the data I needed using open source tools to extract four data points for each available vehicle: price, miles (in units of 10,000 miles), year and type (Land Cruiser or LX470).
The resulting data set consisted of 390 vehicles in the U.S., 245 of which were LX470s. The mileage ranged from 38,000 to 321,000 miles. Prices ranged from $7,000 to over $40,000.
[A side note for the data scientists reading this: I conducted this analysis in R and used the rvest package to scrape the data from cars.com.]
Visualizing the data
In business intelligence arenas, visualizing data can help users better uncover patterns, so I created graphs to show the relationship between miles, price and type, and between year, price and type.


By visualizing the data, it became quite apparent, as would be expected, that newer models are priced higher, and as mileage increases, prices decrease. The visualizations also showed that LX470s are typically priced slightly higher than Land Cruisers, and highlighted the outliers so they were easy to see.
But the question remained: What should I offer for any given vehicle?
Understanding pricing variations
My next step was to create a data model to more precisely understand the prices as they related to these data points. (The model isn’t perfect, but it does the job.)
It turns out that mileage, year and type explain about 87 percent of the variation in price, and half of the vehicles in the market will be within about $1,300 of the price that the model predicts.
All things being equal, such as vehicle condition and features, each additional 10,000 miles reduces the price by $620, and LX470s sell for $635 more than Land Cruisers.
The model also showed a roughly $3,000 price increase between 2002 and 2003 vehicles, and again between 2005 and 2006 vehicles, both of which are likely due to design improvements.
While the model predictions were close to the industry guide I consulted, there were instances where the estimates diverged. These differences showed the law of supply and demand at work.
[For the data scientists reading this: The model works like this: $22555 – (miles/10,000) * $621 + year_price_adjustment + type_price_adjustment]
Putting it all in context
This exercise helped me narrow down pricing based on the market inventory at the time—giving me greater confidence as I negotiated with dealers. It also provided two important lessons that are broadly applicable to almost any enterprise’s data initiative.
Lesson 1: Focus on the data first
Models are only as good as the data you feed them, so it’s important to think about how you organize and manage the data. In this case, I spent several hours getting the data and preparing it for analysis. Creating the model only took a few minutes.
This ratio isn’t unusual. Data scientists are often jokingly referred to as “data janitors” because we spend 80 percent of our time cleaning up data.
As the amount of data enterprises collect has grown, so too has the importance of proper data management.
In fact, marketing scientists—progressive marketing leaders who use scientific methodology to effectively predict customer needs and prescribe solutions—identify effective structuring and management of data as a key pillar of their success. And they are nearly twice as proficient as traditional marketers in “architecting” the data so that it’s “digestible, dissectible, and easily retrieved” across their organizations.
Because of this, they’re better able to test new theories and to conduct more in-depth analysis than their peers.
How does your organization manage its data? Can you easily pursue new areas of inquiry? Or do you constantly need to start from ground zero? If you face the latter, it may be time to review your data architecture.
Lesson 2: Keep it simple
There are times when squeezing every drop of performance out of a model is important enough that it’s worth having data scientists build “black box” models—models so complicated that they are difficult for anyone, other than their creators, to understand.
Trading is a great example of this. In an industry where one one-hundredth of a cent matters, the complexity of the model is irrelevant; performance is everything.
However, there can be negative consequences as the complexity increases, such as what happened during the “flash crash” in the stock market several years ago.
In analytics, enterprises need to balance performance with complexity.
In my case, I could have used a slightly better model for my car search that took into account the interaction between model year and mileage. However, the performance improvement would have been modest (only about 1.5 percent) while the model would have become more complex.
Can your data scientists explain their data models in business terms? Will the potential performance improvement justify the time and effort required to manage the additional complexity? Will increased complexity open your enterprise to greater risk?
Generally speaking, it’s often best for data scientists to use the simplest model that gets the job done.
My car search was successful. I purchased a 2006 LX470 with 96,000 miles at a local dealer, and the purchase price was within $1,000 of what the model predicted. And I just returned from the first of many off-road expeditions.
I’m sure this won’t be the last time I build a data model as part of an everyday endeavor. What about you? Have you ever wanted to leverage analytics to help guide a decision? As a data scientist, I’d love to hear your stories.
