Big Data Environments Using Hadoop Have Peaked

Apache Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data (e.g., structured and unstructured), enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs. Hadoop automatically creates data backups, so you lose nothing in the case of a cloud failure. Much of the cloud environment for storing and processing large data sets was developed on Hadoop. However, Hadoop has some limitations:

  • It is inefficient for smaller data sets.
  • Security measures need to be installed by the data analyst.
  • It runs on Java, which is not an easy language for beginner application developers.
  • It is poor at real-time analytics.
  • It doesn’t provide the visuals and reporting needed to support robust business intelligence.

Tools such as Spark, another Apache open-source solution, can be used to improve the processing power of Hadoop and provide improved programming support since Spark uses SQL. However, there are currently no tools for Hadoop that offer comprehensive data standardization, data management, and data governance.

As healthcare moves to value-based care service and reimbursement structures, the ability to gather, store, manipulate, and analyze large sets of data from across myriad modalities of care will be a requirement for long-term success and viability for healthcare providers. Legacy enterprise data warehouse solutions based on Hadoop will likely not be competitive for managing big data environments with various data formats.

Kubernetes Emerges Out of Necessity to Improve Data Lake Analytics

The healthcare market has grown from siloed big databases to a data lake environment that both stores and analyzes structured and unstructured data from several internal and external sources. As data storage and analytics have become challenged by larger and larger data sets, Google developed an environment called Kubernetes out of necessity to better manage and more quickly analyze petabytes of information.

Kubernetes is an ecosystem of components and tools that improve the efficiency of developing and running applications in public and private clouds. IT teams can implement and manage applications quickly and predictably, scale them in real time, roll out new features without application disruption, and optimize hardware resources as needed for the applications. The advantages of Kubernetes include:

  • Supporting cloud-native microservice applications that change frequently and benefit from dynamic, cloud-like scaling.
  • Modernization of existing applications, such as putting them into containers to improve agility, combined with modern cloud application services.
  • Supporting existing applications with reduced cost or CPU overhead of virtualization.
  • Running most AI/ML frameworks.
  • Providing a broad set of data-centric and security-centric applications that run in highly automated environments.
  • The ability to use edge computing when applications run on low-cost devices in containers.

Emerging data lake solutions from Amazon and Snowflake use Kubernetes, and we expect to see the use of Kubernetes for cloud solutions increase dramatically in the next few years since Kubernetes is now an open-source solution.

The Cloud Becomes the Borg Collective for Healthcare Computing

Healthcare IT solutions are increasingly becoming cloud-based to provide lower costs, increase performance, improve fault-tolerance, and likely improve security. In Star Trek, the Borg are cyborgs that are all connected to the collective to drive a controlled and very efficient society. Healthcare is becoming a more data-driven environment for evaluating patient risk, evidence-based medicine, standardized care, clinical outcomes, and financial risk. Large volumes of data will need to be processed and analyzed as quickly as possible to ensure that appropriate care and financial decisions are made. The ability to use data lakes with data from several healthcare entities will be needed to generate more accurate analytics and business intelligence to survive the value-based care transformation.

Big Technology Players Drive Kubernetes Adoption

Several large technology companies use Kubernetes for their cloud services and applications. Representative companies include:

  • Google, the original developer of Kubernetes.
  • Amazon HealthLake, a recently announced data lake solution for healthcare.
  • Snowflake, an emerging cloud solution that supports data lakes.
  • Microsoft Azure, Microsoft’s suite of cloud-based solutions.

Success Factors

  1. Provider organizations should evaluate their data storage and data analytics needs relative to accessing several new external data sources (e.g., Epic Health Research Network) with their current enterprise data warehouse solutions.
  2. If current data warehouse solutions are on-premises, an analysis of data acquisition costs, data storage costs, and data analytics capabilities should be conducted for comparison to cloud-based data management solutions.
  3. Monitor the uptake of Kubernetes by healthcare cloud-based solutions to determine when it may be feasible for the healthcare organization to adopt relative to stability and risk.

Summary

Technology companies that have large client bases using their cloud-based solutions have been driven by necessity to move from data architectures based on Hadoop to the newer architecture of Kubernetes to deliver more efficient data management and analytics services. The ability to provide higher application performance, more efficient allocation of computing resources, more support of existing AI/ML frameworks, and the ability to use edge servicers to contain computing costs are factors that align well with the needs of healthcare organizations for supporting the management and analysis of several types of internal and external data formats. Observing Amazon launch a specific data lake solution for healthcare suggests that large technology companies are advancing cloud-based data management to appropriately support healthcare clients. Provider organizations will need to perform a strategic analysis on their data environments to determine the most cost effective and painless path for moving to the newer data lake solutions based on Kubernetes. Resistance is futile.

Photo Credit: Adobe Stock, Kittiphat