Challenge Overview
Recently Topcoder has developed an Product Inventory Audit Web Application. First we developed a web crawler to pull the raw html from the site, and data extraction processes to parse data and put information into the Vertica database platform. We've also developed a REST API which allows clients to access the data in JSON format over HTTP and the first client for this service - a Product Inventory Audit Web application. HP is finding the extracted data quite useful and would like to expand the scope of the application.
In this challenge we need to create a new command line application to retrieve product data information from Best Buy site, and to modify the existing data extraction application to support modified schema.
Note that the Amazon support is NOT required in this challenge.
- We'll provide the existing schema file for the current data model and the new schema which (we think) will support saving product prices and product reviews from multiple sites. If you need to make additional changes to the Vertica data model to support the functionality requested then you'll need to provide the DDL scripts for your changes to the product extraction database.
- You can populate product data into your local data by using the current data extraction application which is attached to this challenge and will be provided in the Code Document forums. It is straightforward to build and execute. Here is a set of raw HTML pages from the HP web site which will facilitate this. You'll need to upload at least a few products into the Product table so that your application knows which products prices and reviews to pull from Best Buy.
- Update the data extraction application to support modified schema. The current application supports only one price and one set of reviews (for the default HP site). The application must support prices for different sites and reviews for different sites. Do not break existing functionality. Initial requirements were given here: https://www.topcoder.com/challenge-details/30050923/?type=develop&noncache=true. Keep in mind that the application suffered a few modifications, so please ask if some features are not clear.
- The change is only about saving some of the data in different tables. You will not be concerned with the logic of extracting the data from the pages.
- Create a new command line application using Java to retrieve product data information from Best Buy site
- The application should be configurable using command line arguments. In production, it will be executed on a daily basis using cron scripts.
-
The new application should retrieve product data information for the products that were added in the current day (default behavior; the day should also be configurable, in case the user wants to run this new application for previous days).
-
Based on the products in the HP Product Extract Product table, the application should retrieve the price, review and rating data from Best Buy, parse it and save it in the Vertica database. We'd also like to retreive aggregated rating data for the particular products if that is available. For example of the HP site when you look on a product page like this one - http://store.hp.com/us/en/pdp/business-solutions/hp-officejet-pro-8610-e-all-in-one-printer - it show that there are 95 reviews with an average of 4 stars.
-
Best Buy API: https://developer.bestbuy.com/apis
-
-
You can download a community edition of Vertica directly from HP: http://www.vertica.com/. You simply sign in for a free developer account. However, a direct Vertica installation requires a Unix/Linux server. The more straightforward way to standup Vertica is to use VMWare. VMWare also has free trials available. A server image can be found at my.vertica.com. But Topcoder is providing a recent disk image file for Vertica at the following link. This is a large download (~2 GB). Also attached to this challenge on Vertica/VMWare set up instructions.
https://drive.google.com/file/d/0ByjxTGykXQjAWkkwTWUzcXJucjQ/view?usp=sharing
JDBC Jar files for Vertica can be found here:
Here is a link to the raw HTML pages crawled from HP site with product information: https://drive.google.com/file/d/0ByjxTGykXQjAcDZGMk5hYnhHa1k/view?usp=sharing. You can use the data extraction application to extract the data from these pages and populate the Vertica database.
The applications should be extendable to support other sites (i.e. Amazon). This aspect should be reflected in the database and in the design of the new retriever application.
Note: The winner must fix the Required comments from Review.
Final Submission Guidelines
- Upload all your source (and sql) code in a zip file.
- Provide documentation for your application. It should contain complete build, deployment, and execution instructions.
- You should use the existing code (data extraction application) as a starting point for your solution. It uses Java, Hibernate, and Gradle. Please use a similar deployment, code structure and technologies for your application to keep our codebase consistent.
- Screen sharing video is not required for this application.