Challenge Overview

The HP Haven Big Data platform harnesses 100% of your data—structured and unstructured—to inform every decision and help you capitalize on opportunities and solve problems. Available on-premise or in the cloud, Haven offers Big Data analytics and next gen applications at unmatched speed and scale.

Through a mix of fun and real world challenges, HP is inviting the TopCoder community to learn how to build the next generation of Big Data and analytics apps using the HP Haven Big Data platform. We hope that this series will be interesting, challenging, and rewarding for developers of all levels that are looking to gain valuable new skills and experience.  You can find the latest Topcoder challenges related to the HP Haven Big Data Platform here:

http://hphaven.topcoder.com/

More information about the complete HP Haven offering can be found at the HP Haven web site:

http://www8.hp.com/us/en/software-solutions/big-data-platform-haven/

 

Gasoline Price Predictive Analytics Tutorial with HP Haven

In a previous challenge, we installed a local version of the Vertica Analytics platform.  For this challenge, you’ll use two key components of the HP Haven Big Data platform: Vertica and Distributed-R. The goal is to use these to create an application that:

  • - Reads data from the data sets provided.  These data sets are available as SQL scripts that they can easily be loaded into Vertica.
  • - Produces the input and output values in spreadsheet form as final output for the 2011-2015 time period.  You should have a column on the spreadsheet for each input variable.
  • Applies a model (developed by you) for predicting gasoline prices using the provided data and other data sources that you may collect.  

You can choose whatever programming language you’d like to create your app.

 

Environment

To setup the environment for this challenge, you’ll need to download the Community Editions for Vertica and Distributed-R. For Vertica, you are strongly encouraged to use the Vertica VMWare virtual disk image provided.  Another option, however, is to set up a Community Edition of Vertica directly.  This can be obtained here from the HP Vertica Community site at no cost.  This requires a Linux server. We're also attaching a Vertica lab manual which describes how to add users, create schemas, and load data into the system.  It assumes, however, that you have the Vertica Virtual Server instance installed and locally available.

  • - Product documentation for Vertica is here.

Please download Distributed R from the HP Haven Marketplace, here.

  • Product documentation for Distributed R is available, here.

 

Input Data

We’ve provided data sets from various data sources from the US Department of Energy that provide inputs such as Crude Oil Futures Contract Prices, Crude Oil Field Production statistics, Petroleum Product Storage data, and Sales and Delivery Data. You can use these data sets as well as any others you find/procure to develop a model for gasoline price prediction.   

The data sets, provided as SQL scripts, contain weekly data for the years 1990-2015.  Be sure to partition the data as follows:

  • Test your model with the 1990-2010 data
  • Use the 2011-2015 data for your final submission.

 

OPTIONAL dataset augmentation using HP IDOL OnDemand

Using combinations of the APIs from HP IDOL OnDemand, you could try to improve the accuracy of your predictions by finding events/topics in the current news that are likely to have an effect on near-term future Oil prices e.g. extract insights about entities (e.g. oil and gas regulator/company reaction to a disaster) from the IDOL OnDemand news dataset (see the Query Text Index API public indexes) and then fine-tune your prediction based on these current real-world influences.

 

Building Your Model

You can use any algorithm to construct your model: linear regression, decision trees, random forests, support vector machines, and so forth.   

 

Additional instructions/notes

This is a tutorial challenge. Your code should be clear and well documented. You have creative license about what language/platform to use.  

You should include data definition scripts for any additional tables which you created to enhance your model which wasn't provided.

In addition to developing your app, we’re asking that you produce the following materials:

  • A blog describing your application. The blog is an integral part of this challenge, and may be featured on the HP Developer community.   You should comment on why you chose the algorithm that you did and how you implemented it.  Your blog should also discuss the tradeoffs among the different potential algorithms.
  • A screencast video which explains your model, your data and how you used Vertica to solve this problem.


Final Submission Guidelines

Your application should use Vertica and Distributed-R. Using IDOL OnDemand is optional.

You must submit a single zip file containing:

  1. The source files for your application.
  2. Any SQL or DDL scripts used to create any new database structures that you create besides the ones provided with this challenge.  You should also provide your scripts or code for adding additional data to Vertica as well.
  3. A .csv file containing the input and output values for the 2011-2015 time period.  Make sure your spreadsheet includes a column for each input variable.
  4. A submission.txt file in the root folder of your submission zip with links to your blog post and video tutorial.
  5. Instructions on how to build and deploy your app.
  6. The submissions will be evaluated on the quality of the models, the code, and the tutorial materials.

 

Employees and direct and indirect subcontractors of Hewlett-Packard Company and its subsidiaries and other affiliates (“HP”), and employees and direct and indirect subcontractors of HP’s partners (including TopCoder and its affiliates) are not eligible to participate in the challenge.

ELIGIBLE EVENTS:

2015 topcoder Open

REVIEW STYLE:

Final Review:

Community Review Board

Approval:

User Sign-Off

SHARE:

ID: 30048812