The Problem: Balancing Patient Data Privacy Regulations and Needed Data Analytics

The US established patient privacy rules for healthcare data with HIPAA regulations, and the European Union has similar patient data privacy regulations with the General Data Protection Regulation The purpose of these regulations is to ensure that patient medical data is not accessed or used without patient permission other than for healthcare treatments.

Several healthcare projects are attempting to build large data repositories of de-identified patient data to advance healthcare data analytics and associated AI deep learning models and algorithms. These advancements are expected to provide beneficial insights into patient treatments, outcomes, medication efficacy, and protocols for treating disease and chronic illness. TruvetaGoogle and AscensionGoogle and Mayo Clinic, and globally via the Health Data Collaborative are examples of these healthcare projects.

The challenge in all these efforts is to create an effective de-identification process for patient data that ensures patient privacy regulations are met. Several de-identification tools are available to process structured data, unstructured data, and images. Many of these tools have been developed by leading medical universities and are not commercial, off-the-shelf applications that are supported and frequently updated with new capabilities. Using these tools will likely create a significant amount of overhead for the data informaticists or scientists who are working to create large healthcare data lakes that can be supported and maintained for long periods of time. New commercial solutions for de-identifying data are emerging, and these new solutions may dramatically increase the value of healthcare data analytics and AI.

The Solution: Synthetic Data Emerges to Resolve Healthcare Data De-Identification Challenges

Synthetic data solutions are emerging to assist data scientists with preparing de-identified data that can be used to create large aggregate databases to generate more accurate data analytics, AI models, and AI algorithms. Synthetic data is annotated information that computer simulations or algorithms generate as an alternative to real-world data. Synthetic data may be artificial, but it mathematically or statistically reflects real-world data. Several studies attest to the benefits of using synthetic data for AI models:

  • A team at Deloitte generated an AI training model with 80% synthetic data. The training model provided the same level of accuracy that a model using real data would have provided.
  • An American University of Beirut 2020 study shows that using synthetic data improves machine learning model performance up to 20% while categorizing actions in videos.
  • Research generated by De Gruyter demonstrates the ability to identify drivers of cars with 87% accuracy by analyzing synthesized sensor data generated by vehicles.

Synthetic data will become increasingly valuable to supporting deep learning AI models. Deep learning that supports neural programming, bioinformatics, and natural language processing will benefit from large-volume synthetic data sets.

Gartner estimates that by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated. This projection suggests that the market is about to engage in a rapid uptake and utilization of synthetic data, but healthcare providers tend to lag behind the adoption curves of other industries. Emerging commercial synthetic data solutions will drive higher adoption.

The Justification: Synthetic Data Will Drive Higher Success Rates for AI Projects

Synthetic data solutions will allow healthcare organizations to generate the large data sets that are needed to produce more accurate analytics and AI models that generate algorithms that continue to improve the output. Synthetic data solutions will perform these functions while protecting the confidentiality and identification of the patient data that is synthesized. This approach for creating large patient data sets will also protect consumers from unauthorized use of the data by large technology companies (e.g., Google, Microsoft, and Amazon). We expect that some of the existing data collaborations between Google, Mayo Clinic, and Ascension will convert to using synthetic data if they are not doing so already. Synthetic data will be a catalyst for high success rates with AI projects.

The Players: Emerging Commercial Synthetic Data Companies

While synthetic data solutions have been developed by universities for their needs, the healthcare provider market will require commercial solutions to provide the support functions expected by the market segment. The following are some of the emerging vendors.

  • Replica Analytics: Supports EHR data sharing between collaborators.
  • MDClone: Enables collaboration across teams, organizations, and external third parties with the use of synthetic data.
  • Statice: Minimizes privacy risk for patient data analysis.

Success Factors:

  1. Provider organizations that desire to expand or improve their AI capabilities should first identify similar organizations to collaborate with to create large-modeled healthcare data sets.
  2. Once collaboration partners have been identified, provider organizations should test the synthetic data solutions for generating de-identified patient data against deep learning AI algorithms that will improve healthcare delivery. This testing should be conducted in well-controlled environments such as innovation centers.
  3. Once new AI algorithms have been validated, the collaboration can expand the creation and testing of new algorithms that benefit the collaborative partners.


The ability to use large patient data sets while balancing the need to comply with patient privacy regulations creates significant challenges for many organizations to implement and expand AI projects. Synthetic data provides a solution for overcoming these challenges. Large healthcare organizations and medical centers can create custom synthetic data solutions due to their ability to hire skilled programmers and informaticists. Most provider healthcare organizations do not have the budgets to recruit and retain skilled resources to support their AI deep learning projects. Commercial synthetic data vendors are emerging and will allow many healthcare organizations to recruit data collaboration partners for sharing synthesized patient data to drive higher levels of AI success.

We hope that existing data collaboration partnerships between technology companies and healthcare organizations will convert patient data to synthetic data to ensure the privacy and confidentiality of patient data.

Synthetic data solutions will drive data sharing for growing healthcare provider organizations, and that will result in achieving expected AI benefits that are envisioned by the industry.

Photo Credit: fedrunovan, Adobe Stock