Surviving the 1st Global Data Science Hackathon

English: Air quality from EPA and MODIS on 11 ...

If you haven’t heard already, this week is BigDataWeek – a coordinated set of community events around the globe to bring like-minded Data Geeks together.

One of the most exciting developments is the announcement of the 1st Global Data Science Hackathon.  Starting on Saturday and running for 24 hours competitors around the global will be vying to better predict the Air Quality Index in Cook County, Illinois, USA.

If you haven’t already signed up its not too-late, you can take part either onsite at one of seven venues around the world, or participate remotely from wherever you may be.

The contest will be held on the Kaggle platform, where the leader-board will track and score the competition entries as they come in.

How it’s likely to work

If you take a look at other Kaggle competitions, the general approach is you are given a “training dataset” containing several values and a result (or answer) with which to develop an algorithm that can start to make predictions.

After training your algorithm, you will apply it to the given “validation dataset” which only contains the values (no answers).  The objective is to predict what the answer should be.

Generally you’re given a “submission format” to write your predictions into, you upload this to Kaggle where it is scored.  The winner is the team / algorithm that can make the most accurate predictions (i.e. with the least error).

The scoreboard generally measures your accuracy using root-mean-square error (RMSE) – a measure of the amount of difference between your predictions and the actual answers.  The lower the difference, the more accurate your predictions were such that a RMSE of zero means you got all the answers exactly correct.

Preparation

It makes sense to prepare ahead of time.  If you’re participating on-site you and possibly hundreds of other people are going to be sharing the venues internet connectivity.  It’s not the best time to go trying to download tools.

The dataset files released on Kaggle are generally comma-separated-value (CSV) files which are supported in everything from Microsoft Excel to Relational Databases.

I’d heartily recommend that you stick to the tools you know and spend more time working on the problem than struggling with unfamiliar tools.

However if you are totally new to these kinds of contests you might want to experiment ahead-of-time with some freely available tools:

Signing up to Kaggle now rather than on the day is another good idea :o)

I wish you all the best of luck!!  I’ll be participating from the comfort of my own home and will be following the event from social media (@cotdp).

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top