DS Challenge: Use IBM Watson to Predict Customer R - DS Challenge: Use IBM Watson to Predict Customer Reviews

Key Information

Register
Submit
The challenge is finished.

Challenge Overview

Problem Statement

    

Prizes

  1. $10,000
  2. $ 6,000
  3. $ 4,000
  4. $ 3,000
  5. $ 2,000

Overview

A common feature of online products and services is the ability to leave a review. These reviews are commonly used by future (human) visitors as a means of assessing the expected quality.

Of course, as reviews themselves are also made (presumably) by humans, there is some degree--possibly substantial-- in subjectivity and personal preferences in any kind of evaluation. Reviews typically encompass both a quantitative evaluation (e.g. a rating score, or # of stars, etc) as well as a more freeform subjective portion.

In this challenge, we will attempt to perform sentiment analysis on the review comments which customers have left, and how they correlate to the quantitative review score for a given seller. The means by which competitors attempt to make such correlations is left completely open.

For our data set, we will be evaluating a service that allows individuals to rent part of their property for short-term, temporary residence by visitors. Several data fields about each listing are provided, including textual descriptions of the offering. Also, importantly, the reviews which have previously been left are provided for each listing.

For this challenge, there is an overall rating score for each listing (scored up to 100), as well as six sub-scores (scored up to 10)for various different categories: accuracy, cleanliness, checkin, communication, location, and value.

Special Requirements

For this challenge, competitors are required to use IBM Cloud / Watson Studio for part of their solution. The exact ways in which you use it are left to you, however, in order to be eligible for a prize, you should be prepared to include in your write-up about how those services were a part of your solution.

Data

There are three main files of concern:

  • contestdata.zip: A zip of all listings, details, reviews, and geojson data for all cities.
  • train.csv: Ground truth review scores for many of the listings, which can be used for training.
  • test.csv: The listing IDs which should be submitted in a CSV for testing. This file lists only the IDs, but the actual submission should have each line in the form "listing_id,rating,accuracy,cleanliness,checkin,communication,location,value".

Apart from the data provided as part of this competition, no external data sources should be used in this case. While it is certainly possible that other external sources of data could provide additional insights beyond what is provided, in this case the goal is the analyze based only upon that information which would be immediately and readily available to a potential customer looking to make a rental; which is what is provided in the data files.

Possible Approaches

Although by no means an exhaustive list of possible avenues for investigation, the following are areas where one may find some data for correlation:

  • Keywords or sentiment analysis of the title and description associated with each listing
  • Sizing, pricing, or other details associated with each listing
  • Geographical considerations
  • Comparisons between similar described/located properties
  • Number or and/or recency of reviews that have been left
  • Keywords or sentiment analysis of the reviews themselves

Scoring

For each of the seven categories, the "Root Mean Squared Error" (RMSE) will be calculated. The seven RMSE values will be summed to a grand total. As the primary rating comprises a greater range of values, it will contribute the most to the overall total.

Your score will then be computed as MAX(20 - TotalRMSE, 0), and scaled to 1000000. That is, only submissions with a total RMSE of less than 20 will get a positive score. (Note that this is not overly hard to achieve with a naive solution that makes the same prediction for all listings.)

Submission Requirements

During the course of the contest, it will only be necessary to submit a CSV as described in the scoring section. The "stub" Java code you submit will have a single method to return the URL at which your CSV can be downloaded by the tester. (You can use the linked example CSV to confirm what a valid submission should look like. It is the native approach described above.)

public class ProductReviews {
  public String getUrl() { return "http://timk1980-001-site1.ctempurl.com/average_test.csv"; }
}

Following the competition, the top 5 submissions will be invited for final testing. The top submissions as a result of final testing will then need to setup an IBM Cloud VM (provided) with a working implementation of their code that is capable of producing the same results as were previously provided.

  • As the data sets are all based upon publicly available ���open data���, this last step is essential to verify the performance of the actual algorithms to generate the provided results.
  • As many data science approaches have random elements, we define ���same results��� to be substantially similar, or with any differences that can easily be explained by the nature of the algorithm(s).

Those top 5 winning submissions will be required to submit a write-up of their solution, documenting how the code works, any considerations that should be known to a future user wishing to run it, and some explanation on the overall approach and methodology. Keep in mind that using IBM Cloud / Watson for at least some portion of the solution remains an important requirement here.

Special Note

This match is valid towards the TCO18 Cognitive trip. You will be awarded points according to the criteria on the page: https://tco18.topcoder.com/win-a-trip/cognitive-community/

 

Definition

    
Class:ProductReviews
Method:getUrl
Parameters:
Returns:String
Method signature:String getUrl()
(be sure your method is public)
    
 

Examples

0)
    
"E"
Returns: ""

This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2020, TopCoder, Inc. All rights reserved.