Hestia - Basket Item Set Mismatch Analysis












    Next Deadline: Review
    5d 12h until current deadline ends
    Show Deadlinesicon-arrow-up

    Challenge Overview


    In this challenge, you are provided with many months of invoice data for a parts supply company.  The company supplies all the builders in North America with plumbing parts.


    The company’s customers include small family-owned plumbing businesses and very large national home, office, etc builders.  So: the company’s customers build everything from outhouses to skyscrapers.


    The company is aware that, for a given job, their customers buy a large percentage of the required plumbing materials from them, but not 100%.  They realize that some portion of the required materials is supplied by competitors instead. They realize the reason for this in some cases is simply because inventory is not immediately available in sufficient quantity (see note below).  But in other cases, the company believes inventory is not the reason, and customers are buying materials from competitors for other reasons. The reasons could include price, color or manufacturer, etc, but the exact reasons are unknown to them.


    Regardless of the reason, the company wants to know which items are being left out of orders that should otherwise include them.


    Additional notes:

    • The company’s customers take exactly-on-time delivery very seriously.  This means that customers schedule an exact delivery time, to an exact place, to the Company, and expect the company to successfully meet that requirement.  If the company does not meet that requirement, the customers must then pay their labor force to sit idle at the building site, until the delivery arrives. For this reason, sufficient inventory supply and delivery reliability are together very important.  Therefore the availability or lack of it of a certain type of item is one reason why a customer may order that item from another supplier.

    • An “order” may be represented by several invoices over time.  For example, imagine a fake customer who installed only bathrooms.  They might buy all the faucets and pipes in one order, and then buy the toilets in a later order.  Providing they are both sent to the same shipping location and/or they have the same contract ID, and/or they are otherwise meant for the same job, then they are part of the same order and wouldn’t be considered left out (meaning they bought them somewhere else).

    • Using the analogy above, an example of the type of condition we ARE looking for is if all the bathroom-only customers typically bought the faucets, pipes and toilets together, but a subset of customers consistently bought only the pipes and faucets, and never or almost never bought the toilets.  We’d assume the toilets are getting supplied by somebody else.



    In this challenge, your goal is to help the company understand which items their customers are buying elsewhere that they COULD be buying from them instead.


    Your input data is two data sets.  The first data set is the list of invoice headers.  The second dataset is the list of line item details for each invoice header.


    Your input data also includes tips and suggestions from the Company on which customers types or buyer profiles to prioritize in your work.


    Your output should be:

    • Groups of “buying profiles” according to similarities in orders over the period of time.

    • For each group, the list of customer membership - the list should name the customer, invoice number, date, and ship-to address (as information - its up to you whether or not these should provide input on grouping) for each order that is considered included

    • For each group, the list of items that are expected when a purchase of this group type is placed

    • For each group, the list of items that were expected but not included in specific orders along with the customer ID, contract ID, quantity that was expected (if possible) and date for the exception


    How Winners will be Identified

    You will need to provide a write-up as described below.  You will need to perform analysis on the data and develop an approach that produces output as described above, and provide a POC implementation.  Most important: you need to provide statistical explanations for items you determine were “left out.” For example, you could show that the exclusion of item X from Buyer Group 1 can’t be explained by random chance.  The contestants who perform these tasks best and provide the strongest reasoning for “left-out” items will rank highest.


    Final Submission Guidelines

    Submission Requirements


    Your submission should include a text, .doc, PPT or PDF document that includes the following sections and descriptions

    • Overview: describe your approach in “laymen’s terms”

    • Methods: describe what you did to come up with this approach, eg literature search, experimental testing, etc

    • Materials: did your approach use a specific technology?  Any libraries? List all tools and libraries you used

    • Discussion: Explain what you attempted, considered or reviewed that worked, and especially those that didn’t work or that you rejected.  For any that didn’t work, or were rejected, briefly include your explanation for the reasons (e.g. such-and-such needs more data than we have).  If you are pointing to somebody else’s work (eg you’re citing a well known implementation or literature), describe in detail how that work relates to this work, and what would have to be modified

    • Data:  What other data should one consider?  Is it in the public domain? Is it derived?  Is it necessary in order to achieve the aims?  Also, what about the data described/provided - is it enough?

    • Assumptions and Risks: what are the main risks of this approach, and what are the assumptions you/the model is/are making?  What are the pitfalls of the data set and approach?

    • Results: Did you implement your approach?  How’d it perform? If you’re not providing an implementation, use this section to explain the EXPECTED results.

    • Other: Discuss any other issues or attributes that don’t fit neatly above that you’d also like to include.


    Proof of Concept

    • Include PoC code in Python to illustrate your particular approaches / solutions.

    • Other languages (like R) might possibly be acceptable, but please seek permission in the forums before you begin working, and expect 24 hours for a response (longer during weekends).


    Judging Criteria

    We provide for 5 awards.  One first-place award is guaranteed, providing the statistical explanation is sufficient and complete.


    These submissions will be evaluated subjectively by the client based on the following criteria:

    1. Completeness and Effectiveness (50%)

      1. Did you complete the sections as required above

      2. What're the key insights we can get from your analysis?

      3. How will these discovered insights benefit the client?

    2. Feasibility (50%)

      1. Does your submission include enough detail for us to understand if this approach is feasible?

      2. Is your solution more likely feasible than other submissions to the challenge?

    Proof of concept (PoC) solutions are appreciated and will be weighed over otherwise similar submissions.

    Reliability Rating and Bonus

    For challenges that have a reliability bonus, the bonus depends on the reliability rating at the moment of registration for that project. A participant with no previous projects is considered to have no reliability rating, and therefore gets no bonus. Reliability bonus does not apply to Digital Run winnings. Since reliability rating is based on the past 15 projects, it can only have 15 discrete values.
    Read more.


    Final Review:

    Community Review Board


    User Sign-Off


    Review Scorecard