How Automation Can Help with Healthcare’s “Messy” Data Problem

How Automation Can Help with Healthcare’s “Messy” Data Problem
Dr. Bob Lindner, Co-Founder & Chief Technology Officer of Veda

Efficient data processing and data sharing are essential functions across healthcare—from patient care, clinical research, and health services planning to billing and government reporting for funding and research. But most of the data that the industry is processing is human-generated, meaning it’s messy and riddled with errors, and often manually inputted from Excel spreadsheets into various disparate platforms and technology systems. In some cases, our healthcare system is running on infrastructure that is over 20 years old.

There are many instances across the system that are overly burdensome from an administrative standpoint, which pulls crucial resources from patients. For example, a single national U.S. insurer that I worked with had more than 500 employees manually keying in provider and facility data—information needed by its members to find doctors covered by their insurance. However, the average accuracy of that information when keyed in manually was less than 60%. The time it takes to manually enter and clean this type of data is a clear driver of the $1 trillion in annual administrative costs that plague healthcare each year. 

Automation seems like a great solution—for this and other use cases—but not all automation solutions are created the same. The solutions that work for other industries aren’t equipped to handle healthcare’s distinctive data sets. It’s extremely challenging to efficiently streamline the processing of healthcare’s messy, inconsistent data, which is why we’ve seen that it’s been so hard to pull insights from healthcare’s stores of data, not to mention solve the industry’s interoperability problems. The result? Healthcare stakeholders are stuck with highly inefficient systems that hinder care. 

Fortunately, new approaches to automation that are specifically designed to work with healthcare’s unique “messy” data, can help cut down on superfluous administrative costs and wasted time, and lead to more efficient care, fewer errors, and compliance with new legislation—like the No Surprises Act, which went into effect on January 1, 2022 and requires plans to make provider roster data updates in 48 hours. If we’re ever going to tackle the challenges that plague the industry, healthcare leaders need to look to this new era of automation tools that are purpose-built to learn and understand health data problems.

Here are three considerations for assessing the ability of a given automation solution to efficiently deal with a health plan’s “messy” data:

1. The types of data you have and its quality 

There are generally two types of data—inside data and outside data—and the type that an organization is dealing with directly correlates to the quality (or “cleanliness”) of that data. 

Inside data is created by users interacting with a web platform. For example, think of the user data created by retail companies like Amazon, Uber and DoorDash. These convenient in-app processes are easily deployed and produce clean data entirely logged by machines without missing values, misspellings, or other outliers seen with manually entered data. Inside data is well behaved because it is created by a system with a strict data schema. In contrast, outside data conforms to no schema and is subject to the free will of the creative minds entering it. In outside data you could find a phone number in the address field or see doctors’ credentials appended to their first name or last name. Outside data is at the helm of the healthcare industry’s data problems. This is the human-generated “messy” data that must somehow travel across multiple disparate systems, from provider to payer and beyond (think, emojis instead of text, blank cells, etc.).

2. Existing human knowledge and processes

Once data quality is assessed, it’s equally important to determine the context in which it is being received and ingested in order to set realistic goals for more efficient processing. For example, payers have unique goals and data processing needs. The volume of provider data received by national health plans is exponentially greater than that of those that operate in single markets. While some of these plans have started to document the data they’re processing, others haven’t even begun to evaluate their data. It’s important for payers to set these realistic goals for themselves, and for those that are further ahead, they’ll likely comply beyond what’s required by the No Surprises Act. 

What is required upfront to power these tools is the codification of the decision-making processes in place at the plan under the current manual data entry infrastructure. The knowledge residing in human brains (or in some organizations, perhaps formal manuals, and documentation) needs to be taught to the automation solution, which can then replicate human decision-making. This means that the first step to making the “new” solution is to deeply understand the business process of the “old” solution. After all, it’s no good to speed things up x100 if it’s going to make errors and fallout from lack of process since you’ll be overwhelmed by x100 errors and fallout too!

There is a significant amount of upfront work to do here, but the payoff is huge. Once provider roster data ingestion is automated, instead of scrolling through upwards of 40 inaccurate entries in their plan’s provider directory, patients can locate the right medical professional on the first try. ​

3. The appropriate level of collaboration between humans and machines 

Healthcare organizations have struggled for years to get their arms around human-generated data, without success. While human-centered data processes are slow and error-prone,  complete automation is impractical. 

To analyze and automate “messy” data, smart human-in-the-loop AI solutions—artificial intelligence that leverages both human and machine intelligence—are necessary. This approach offers a perfect middle ground for healthcare plans and vendors to use AI and human decision-making to improve data and, in return, offer better patient care, health outcomes, and medical research.

Human-in-the-loop automation means that the automation tech acts as a translator, sitting between existing systems and constantly translating data from one format to another so that it can flow easily without disrupting current systems.​ As mentioned above, the automation solutions must be taught how humans make decisions from the start, and they are also required to “learn” when to flag a human during the data processing workflow itself. Ultimately, AI doesn’t replace humans in this case—it just slots in where there are repetitive tasks it can be taught—and humans need to remain plugged into the workflows at key decision points.   

When it comes to solving its messy data problems, it’s evident the healthcare industry still has a ways to go. By bringing new automation solutions and approaches to the table, healthcare organizations can dramatically reduce unnecessary costs, enable compliance, and modernize, so they deliver better patient care and a streamlined healthcare experience for everyone. 


About Bob Lindner, PhD

As Co-Founder and Chief Science Officer, Dr. Bob Lindner oversees Veda’s scientific and research teams. He provides strategic vision, builds innovative technologies, and connects Veda’s scientists to its Scientific Advisory Board.

Bob fell hard for data science and has a passion for solving big problems. With over ten years’ experience, he is a published and acclaimed astrophysicist with expertise in modeling data and designing and building cloud-based machine learning systems.

Bob has had a number of significant discoveries and “first detections” in his years researching and studying. Most notably, he created machine learning code that automates and accelerates the ability for scientists to analyze data from next-generation telescopes. That program, Gausspy, continues to increase scientists’ understanding of our galaxy’s origins. He earned his Ph.D. in physics from Rutgers University, and was a postdoctoral researcher at UW-Madison, where he led the development of Gausspy.