Sandesh - Data Privatization Challenge








    Next Deadline: Review
    0s until current deadline ends
    Show Deadlinesicon-arrow-up

    Challenge Overview

    Challenge Objective

    In this challenge, you have to implement differential privacy preserving techniques on the given dataset in a way so that no correlation to real-world objects/people/entities is possible making use of any available open source libraries.

    Project Background

    The client is exploring the possibilities of using data science challenges for various use cases of their business. As part of the data preparation for data science work, we need to protect privileged information and prevent linkage attacks before opening it to the community. Multiple levels of masking might be required for this. We need to come up with a data masking solution that can provide high scalability and ease of use for the dataset.

    Development Assets

    • The sample xls file containing the required columns that need to be masked will be shared in the challenge forums.

    Technology Stack

    • Python 2.7
    • RAPPOR (https://github.com/google/rappor)
    You are free to research and use other open source libraries after getting approval in the forum.

    Individual Requirements

    • You have to implement a masking program in Python that can reasonably prevent linkage attacks making use of the recommended libraries or any good anonymization software. Masking should create a statistical twin rather than random noise.
    • The model used for masking should be documented properly for review purpose. Include a doc or PDF that describes your approach.
    Deployment Guide and Validation Document

    Make sure to require two separate documents for validation.

    A README.md that covers:
    • Deployment - that covers how to build and test your submission.
    • Configuration - make sure to document the configuration that is used by the submission.
    • Dependency Installation -  should clearly describe the step-by-step guide for installing dependencies and should be up to date.
    A Validation.md that covers:
    Validation of each requirement can be mentioned in this document which will be easier for reviewers to map the requirements with your submission.

    Important Notes
    • The dataset provided has only limited set of rows, however the review will be done against a bigger dataset with more than 1000 rows. Also the year columns in the dataset will have data upto 10 years (YEAR10 column), so make sure that your code handles all columns.
    • If you feel that DP is not the right way to do this privatization after checking the data, you can suggest and implement what would be the right way to create a statistical twin.
    • The review for this will be subjective based on the details provided below.
    Scorecard Review

    This submission will be subjectively reviewed, however following criteria will be taken into account during the review to pick the best submission. Your submission will be reviewed on these requirements:
    • Challenge Spec Requirements (40%)
      • Requirements Coverage
    • Coding Standards (10%)
      • Best Practices
      • Code Quality
    • Development Requirements (40%)
      • Testing against bigger dataset
      • Performance
      • Deployment
    • Documentation (10%)

    Final Submission Guidelines

    • All original source code.
    • Documentation

    Reliability Rating and Bonus

    For challenges that have a reliability bonus, the bonus depends on the reliability rating at the moment of registration for that project. A participant with no previous projects is considered to have no reliability rating, and therefore gets no bonus. Reliability bonus does not apply to Digital Run winnings. Since reliability rating is based on the past 15 projects, it can only have 15 discrete values.
    Read more.


    Final Review:

    Community Review Board


    User Sign-Off


    Review Scorecard