As healthcare systems evolve through mergers, acquisitions, and partnerships, there is a large problem identifying and recognizing duplicate and erroneous information on entities such as doctors, practices, and clinics when the data from various sources is combined. There is a growing need to establish a single “source of truth” in a process known as Master Data Management.
We created and staggered 2 different types of data science challenges to arrive at our winning data solutions.
Members were tasked with creating an algorithm to identify and recognize duplicate and erroneous information. The first challenge allowed members to compete without library or software constraints. We followed the first challenge up with a Data Science Ideation Contest, testing the 5 winning algorithms from the previous challenge. on a great variety of types of errors to determine their relative strengths.
The winning solution first reduced the dimension of the problem using a clever hashing to create a subset of records most likely containing duplicates. Then, it created features from the text fields of the data set with which to build a predictive model. Next, it used a forest of extremely randomized trees to predict the duplicate records.