The objective of this challenge is to build a script that matches a patient to their nearest clinic.
The script primarily needs to do the following:
To help achieve this, The output should be a csv file with number of rows equal to that of patients.csv, with each row indicating the nearest clinic to the corresponding patient.
Take as input a pair of csv files: a pair of CSV files will be passed to the script, say a patients.csv file and a clinics.csv file, which will contain the address details of each patient and each clinic respectively.
Each csv will contain address details: Each csv will contain the address details of the subject, i.e patient's address in case of patients.csv and the clinic in case of clinics.csv.
The script should first find the geocode of each address: The script is expected to find the geo code of each row in both the csv files. The geocode here refers to the latitude and longitude coordinates of the address.
Each patient should be matched to their nearest clinic: Using the geocode generated in the previous step, the script should be able to find the clinic nearest to each individual patient, in terms of travel/commute distance (and not the straight line distance)
The goal of the project and hence the architecture is to solve the client's need for a script that that can suggest the nearest clinic to a patient using their location details as input. Right now, the location details are available in terms of natural language addresses.
The script should be able to efficiently find the the closest clinic to each patient. By efficiently, we mean that something more efficient than brute force matching should be attempted, although if no other strategy seems possible, brute force is acceptable. The exact strategy to be chosen is left to the participant to explore.
The following are the requirements of the challenge, in a bit more detail:
Take as input a pair of csv files: a pair of CSV files will be passed to the script, say a patients.csv file and a clinics.csv file, which will contain the address details of each patient and each clinic respectively. Here, the patients.csv will have these columns: _ID,Address,Postal Code,FSA,City,Province. And clinics.csv will have these columnns: Clinic ID,Clinic Name,Clinic Address,Postal Code,FSA,City,Province _
Each csv will contain address details: Notice that there are multiple columns in the file related to the address of the patient and the clinic. The contestants are recommended to do their own research on the web, and understand the meaning of terms such as FSA, Province etc in context of this dataset. The contestants are free to use all or just some selected columns to find the geocode.
The script should first find the geocode of each address: Using the desired combination of address columns, the geocode of the patient or clinic address should be extracted. The participants are encouraged to research into this topic and find the service/tool that provides the most accurate geocode outputs for the provided dataset. Some examples are Google's APIs, and services like geopy or geocoder. Important Note - The contestants are expected to try various services to find which among them performs the best for your submission. Note that preference will be given to free services (more on this in the Review Criteria section below) Note in case geocode detection fails - In case the geocode is not located that should be indicated in the code. The code should not break in case some address is not detected. This is important because the script should be able to recommend clinics even in case geocode was not found (more details on this in the following points).
Each patient should be matched to their nearest clinic: Using the geocode generated in the previous step, the script should be able to find the clinic nearest to each individual patient, in terms of travel/commute distance (and not the straight line distance) The emphasis on the shortest ravel or commute distance should be noted. The code should not return the nearest in terms of straight line cartesian distance or least in terms of commute time. It should be in terms of shortest travel distance. Note that the distance should be in Kilometers, and not Miles. Here are some potential resources that can be used as a starting point for your research:
- Driving distance between two or more places in python
- Calculating distance between two geo-locations in Python
- Output format: The output should be a csv of the following format: Patient_ID, Pat_Geo_Cols, Pat_Geocode, Pat_Address, Pat_Postal_Code, Pat_FSA, Nearest_Clinic_ID, Clinic_Geo_Cols, Clinic_Geocode, Clinic_Address, Clinic_Postal Code, Clinic_FSA, Clinic_Distance.
Here, the meaning of each column is as follows:
- Patient_ID - Patient ID
- Pat_Geo_Cols - The columns used to find the geocode of the patient.
- Pat_Geocode - The calculated lat/long coordinates of the patient.
- Pat_Address - Copy the address details of the patient.
- Pat_Postal_Code - Copy the Postal code of the patient.
- Pat_FSA - Copy the FSA of the patient.
- Nearest_Clinic_ID - ID of the nearest clinic found by the solution.
- Clinic_Geo_Cols - The columns used to find the geocode of the clinic.
- Clinic_Geocode - The calculated lat/long coordinates of the clinic.
- Clinic_Address - Copy he address details of the clinic.
- Clinic_Postal Code - Copy the postal code of the clinic.
- Clinic_FSA - Copy the FSA of the clinic.
- Clinic_Distance - The calculated travel distance between the patient and the nearest clinic with ID = Nearest_Clinic_ID.
Note that the output needs to generate the shortest distance between the patient and the clinic as well in addition to the recommending the nearest clinic.
Important - In case geocode is not found: in case the geocode of any address is not found, in that case details such as the FSA or other details of the patient or the clinic should be used to find the geocode. For example, if for a patient X, their geocode could not be found, then the geocode of their FSA (or whatever the smallest location indicator is) should be used to find the nearest clinic. The point here is that the geocode should be calculated one way of the other, with preference given to more precise locations. Similarly, for a clinic, if its geocode could not be ascertained, then in that case it should still be matched to a patient who is living in the same or nearby FSA, assuming that no other more accurate clinic was not found closer to the approx distance between the two FSAs. In case of conflicting situations, the participants will need to use their own judgement to implement the solution such that it achieves the best results as per the review criteria, although the geocode in each case should be recorded.
A note on columns Pat_Geo_Cols and Clinic_Geo_Cols: These columns should contain information about which columns from patient.csv and clinic.csv were used to find the geocode. For instance, if all the address columns were used, the the entry into the PatGeo_Cols would be: _Address,Postal Code,FSA,City,Province. Similarly, if only FSA was used, then the entry should be FSA.
Error Handling and Configurability
The code should have appropriate checks to ensure that the code execution continues even in case there is a detection error for any particular row. Suppose if during the processing of a particular row, if the API request to some service fails because of some network issue, it should retry a few times and then if it doesn't work, proceed to the next row. Alternatively, the execution can be made to pause and check for connection re-establishment, and only then the execution should resume. The exact method of error handling has been left for the contestant to decide - but the overall goal is to ensure that in case a large dataset is used in the future - the whole execution should not halt in case there is an issue in one of the rows or if there is a temporary network issue.
Note that in case some configurable values are used, like number of retries or wait time before connection check etc, in that case these values should be configurable using a config.json file. In general, any value that can be configured by the user, should be put inside the config.json file.
The review will be performed in the form of a combination of subjective, objective and Free/Not Free review criteria:
In the objective review, a custom review strategy will be used. This review strategy will basically be: Using a test set of 40 labels (which is potentially noisy), to filter out submission that don't cross a particular threshold. And then ensemble the output of these filtered submission to generate the conditional ground truth labels. Here the threshold will be decided subjectively based on the final number of submissions received. <br> This ground truth will then be used as the de-facto actual labels against which the objective scoring will be performed. These de-facto actual labels will be used to classify outputs of each submission as True and False, and the final score will be calculated using the AUC ROC scoring metric on these True/False values. <br><br>
In the subjective review, the documentation will be subjectively reviewed to understand the strategy used to generate the output. The code will also be reviewed to gauge the overall code quality, such as proper amount of comments, code structure. The submitters are expected to write readable code with appropriate comments, particularly in non-trivial sections of the code base. <br>
In the Free/Not Free review critera, full score will be given to a contestant whose solution uses services that are completely free for use. If any service is not free, then the submission will receive a zero score in this section.
- The weighage would be as follow:
- Objective Review - 75%
- Subjective Review - 20%
- Free/Not Free - 5%
Deployment Environment - Deployable on any platform that can execute Python via a terminal or command line
Language - Python 3.9.2 (mandatory)
Dataset & Other Resource access
The dataset and some additional ancillary resources have been shared in the challenge forum.
What To Submit
- Source code - The source code of the script
- Documentation - Detailed instructions to deploy the solution from scratch. It should also detail the approach used in the solution. Note - Here, details need to be shared how the situation of no geocode detection has been dealt with, and also details on how the solution manages to suggest the nearest clinic in case of a conflicting situation, say if there is a nearby clinic with known geocode, vs clinic with unknown geocode in the same/nearby FSA.
- Verification Instructions - Steps to verify the solution should be communicated via a demo video.