The Problem: Bad Data Equals Garbage In and Poor Analytics Results

According to surveys conducted by Anaconda and Figure Eight, data scientists spend 45% of their time preparing data, and data cleaning can take a quarter of that time. Data cleaning fixes or discards anomalous or wrong numbers and/or data and otherwise ensures that the data is an accurate representation of the analytics model it is meant to measure. Automating the task is challenging because different data sets require different types of cleaning, and common-sense judgment calls about objects in the world are often needed (e.g., which Davis in the city of Denver, CO).

Poor data quality is likely caused by four factors:

  • No integration of data sources. In healthcare, one challenge is making sure the same data elements represent the same context. This can be offset by using coded data that is captured in the EHR or claims database. If data has to be rekeyed, the challenge of transcription errors is introduced.
  • Inconsistent data-capture protocols. In some systems, data is captured as encoded data. In other systems, the data may be in free-text formats.
  • Poor data migration or system consolidations. Inconsistent data formats can negatively impact the migration of data from one system to another.
  • Data decay. How often is the data updated? If data is not routinely evaluated for consistency and usage, it may lead to inaccurate analytics.

Remember the old computer science adage—“Garbage In, Garbage Out”!

The Solution: Applying Bayesian Logic to Data Cleaning

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis to be true as more evidence or information becomes available. MIT researchers have created a new system that automatically cleans “dirty data” like typos, duplicates, missing values, misspellings, and inconsistencies dreaded by data analysts, data engineers, and data scientists. The system is called PClean. PClean provides generic common-sense models for judgment calls that can be customized to specific databases and types of errors. PClean is the first Bayesian data-cleaning system that can combine domain expertise with common-sense reasoning to automatically clean databases of millions of records. PClean also incorporates an AI programming model developed by the MIT Probabilistic Computing Project. PClean achieves this scale via three innovations:

  • First, PClean's scripting language lets users encode what they know. This yields accurate models, even for complex databases.
  • Second, PClean's inference algorithm uses a two-phase approach based on processing records one at a time to make informed guesses about how to clean them and then revisiting the judgment calls to fix mistakes. This yields robust, accurate inference results.
  • Third, PClean provides a custom compiler that generates fast inference code. This allows PClean to run on million-record databases with greater speed than multiple competing approaches.

Other data-cleaning solutions are likely to use probabilistic programming models. These models are evolving and will compete with the MIT solution while the market evaluates the ROI of both approaches.

The Justification: Data-Cleaning Solutions Will Enable Evaluations of Data-Capture Processes

Healthcare organizations will continue to invest large amounts of informaticist and data scientist time preparing and cleaning data for use with analytics, business intelligence, and artificial intelligence applications. This time would be better used by focusing on upstream data acquisition processes for all enterprise applications. Data governance boards should be focused on guiding the organization to standard data models that dictate the capture of data in defined formats and time frames. Once most enterprise applications meet the data governance standards, data cleaning will take little if any time. This will result in analytics and artificial intelligence programs generating results that improve the organizations’ business efficiency, medical outcomes, and quality of care.

The Players: Emerging and Existing Data-Cleaning Solution Competition Will Benefit Healthcare

Data-cleaning solutions based on probabilistic programming applications range from focused functions to broad data-management capabilities.

  • PClean—An emerging Bayesian and AI probabilistic programming model.
  • Informatica—A vendor that supports end-to-end data governance and management.
  • DomoIntegration Platform as a Service solution that helps you connect and transform data from any source and doesn’t require any coding.
  • Gen—A probabilistic programming application embedded in the Julia library.

Success Factors

  1. Evaluate data-cleaning solutions based on their incorporation of probabilistic programming into an AI platform. This will likely ensure higher flexibility and faster learning processes.
  2. Determine whether the data-cleaning solution should be deployed as a front-end solution for the analytics/AI analytics process, or implemented as part of the data-capture streams in the cloud.
  3. Involve informaticists and data scientists in the review and selection process for data-cleaning solutions.


Data cleaning represents significant overhead from highly skilled and expensive resources. The ability to reduce the amount of time spent on cleaning data to ensure accurate analytics and AI results will allow for the existing models to be updated or new ones to be created.

The reduction of data-cleaning processing efforts will also allow for healthcare organizations to focus data governance efforts on capturing clean data and/or driving improvements for standardizing data formats across enterprise systems.

The healthcare industry still has data sets that are not standardized, such as symptoms or laboratory tests. Laboratories could quickly standardize their data with the use of LOINC codes. Symptoms are not standardized, although some are associated with ICD-10 codes.

These examples demonstrate the challenges that informaticists and data scientists have when evaluating data sets from different applications and different organizations. Data-cleaning solutions will eliminate time wasted on matching data that is better used for creating effective data models.

Photo credit: nattaphol, Adobe Stock