The problem with Big Data is not the Data

There is a seemingly irrational obsession about how BIG your Big Data has to be before a magical unicorn appears and delivers the answers your business needs. Not a day goes by where I don’t see some swanky infographic reminding me that Facebook collects several Yottabytes of data every day.

Ok, so I may have embellished that a little but you get the idea. The key thing is that the ‘size’ of the data is really quite arbitrary, I can easily generate a few terabytes of garbage data an hour – that doesn’t mean any of it is actually useful data.

The real problem with Big Data is knowing what you want, and how to go about solving it.

Background

I’ve met a lot of really interesting people at Conferences, Meetups etc. most are at least interested in the promise of Big Data, some are actively involved in Big Data initiatives where they work. If you talk to anyone about their aspirations for Big Data you will most certainly have encountered statements like;

“We have 27 Petabytes of web logs, what can you do with that?” or, “We put all our customer data in MongoDB/Cassandra/Hadoop, now how do we get value from it?”

This attitude is not uncommon, everywhere from Academia to Startups and respected FTSE100 companies – this perspective on the Big Data problem is everywhere. What follows next is often derided by “the business” as yet another case of “technology for technologies sake”.

Yet, the solution is deceptively simple;

Start with a business challenge, or an opportunity – then look for the data you can use/collect to solve it.

When you begin with a clear business challenge/opportunity you start to solve the real problem with Big Data – knowing what you want to achieve. Right from the start this also gives you an amazingly powerful tool that you wont find from anywhere else, the basis for a business justification that your bosses can understand.

Sadly, it doesn’t always happen this way – read on to see how not to be a Big Data Rockstar.

How to massively #FAIL at Big Data

So you’ve been to all the conferences, you’ve got the Datameer T-Shirt, the MapR baseball cap, the cute little Elephant from Informatica and all the other goodies. You’ve installed one of the major Hadoop distributions like Cloudera or Hortonworks and done your first ever Map/Reduce Wordcount. But this is all running on your laptop, its hardly ‘Big Data’ if its on your 256GB hard drive – you need to scale up!

In an effort to find the biggest sources of data to play with you’ve stumbled across terabytes of log files from your companies web servers. Armed with this information you think to yourself:

“Great! I’ve got 60TB of log files covering the last 12 months, I want to scale this to hold at least another 3 years so that’s 60TB * 4, I’ll round that up to about 300TB, and not forgetting to multiply it by 3 because of HDFS replication so thats 900TB – almost 1 Petabyte!?!?!?! (geekgasm)”

So you sketch up a business justification for your superiors and finance department which can be summarised as this:

We need Big Data to collect and store 3-4 years worth of log files from our web servers,
By mining these log files we will discover new insights about our customers and products/services just like Facebook does,
Hardware: 120 fully populated HP DL380’s ~$980,000
Software: 120 licenses for Cloudera/Hortonworks/MapR ~$400,000/year
Total: ~$1.5 million for the first year, at least $400,000/year thereafter + people

I can tell you now, this sort of proposal will be DENIED by anyone with a shred of common-sense, let-alone business acumen.

Solving the real problem with Big Data

I’m going to let you in on a little secret, follow these steps and your project will be successful (disclaimer: unicorns not included)

1). Don’t go anywhere near the technology!?!?!?!

You cannot possibly hope to make a decision about the technology stack without knowing what you hope to achieve first. Indeed going this route too-early will likely compromise what you can/can’t do later. I know, I’ve been there, overcoming the urge to play with every shiny new Big Data widget is very difficult. So tempting as it might be: step away from the keyboard, now!

2). Talk to people, lots of people…

You may consider yourself the fountain of all knowledge in the place you work, and while that might well be true – sometimes it’s better to get other peoples perspectives (even if you think they’re wrong). This serves three very important purposes:

(a) it may reinforce your confidence about a challenge/opportunity you’re already aware,
(b) you might discover new challenges/opportunities that you weren’t previously aware of,
(c) when it comes to your business justification you have more “buy-in” from your colleagues,

Remember, you’re hoping Big Data will give you data-driven decisions – so collect more data.

3). Don’t bite off more Big Data than you can chew

By now you’ll have talked to a cross-section of colleagues across your company and have an almost endless list of possible challenges/opportunities. Now the hard part is you have to decide which one(s) to do. This process is really important, and again I strongly recommend you use a data-driven approach. Ask yourself these questions for each of the potential projects on your shortlist:

(a) by solving this problem are there clear and measurable benefits (i.e. money), if so how much?
(b) can I achieve this with data readily available to me, or do I have to collect this data from scratch?
(c) do I have (or have access to) the domain expertise necessary to solve this problem?

These simple questions will quickly filter your options, ideally there should be just one or two clear winners.

4). Thoroughly evaluate the possible solutions

Lastly, the technology part.. the aim here is to decide on the most appropriate technology to solve the problem. Sure the best choice might not be the sexiest, but if it solves the problem cost-effectively its far more sexy (in business terms).

Does the data need to stay on-site, can I use cloud resources?
Can I solve my problem with existing tools/libraries (Pig, Hive, Streaming, Cascading)?
If I go down this technology-path am I being tied into a long-term commitment?

By now you should have a clear goal, the backing/buy-in of colleagues, a good idea of how to solve it and most importantly – data to back up your decisions.

Be the Big Data Rockstar

Having taken on this knowledge and diligently applying the steps outlined above you’re now ready to formulate an articulate description of the challenge and its solution to your superiors. Such a business justification could be summarised as the following:

Colleagues in our Fraud department are aware that “account sharing” is going on i.e. people sharing their username/password with others so they don’t need to subscribe themselves.
We are confident we can identify this activity from web server logs and have the last 12 months of logs readily available to us.
Because there is no personally-identifiable-information (names, addresses, payment details) in these logs, we recommend storing and analysing this data in a Cloud-based service.
Costs: ~$5,000/month ($60,000/year) + people
Benefits: potential to reduce fraud (reduce costs) and, offer subscriptions to the would-be fraudsters (increase revenue), no ongoing costs/commitments should the project be shelved.

This sort of approach is almost a dead-certainty to be APPROVED, the risk is minimal and the upside is clear (cost reduction and increased revenue). Better yet if it turns out the original hypothesis (account sharing fraud) is proven wrong, there are no long-term commitments – you can cut your losses and turn the Cloud-services off.

– –
In closing I hope this article gave you a few ideas to put into practise. As I said at the beginning the problem with Big Data is not the Data, nor is it the tools/software, its knowing what you want, and how to go about solving it. Please leave your comments or share your experiences below, alternatively you can also find me on Twitter @cotdp.

Happy Big Data Problem Solving! :o)