The Challenges of Medical Data

In recent times, there have been several developments in applications of machine learning to the medical industry. We have heard news of machine learning systems outperforming seasoned physicians on diagnosis accuracy, chatbots that present recommendations depending on your symptoms, or algorithms that can identify body parts from transversal image slices, just to name a few.

And yet, despite the astounding success of some of these applications, adoption isn't widespread yet. Odds are that your local hospital, pharmacy or medical institution's definition of being data-driven is keeping files in labelled file cabinets, as opposed to one single drawer.

There are several hurdles to introducing these systems to the healthcare industry, such as:

  • People are very sensitive when it comes to their health, and they don't want to depend on algorithms for their well being.
  • This industry is slow to adopt new practices, due to being highly regulated and driven by policies (which is understandable, given that people's private data and health is at stake).
  • The incentives of the medical practitioners are sometimes misaligned with the incentives of the patients/clients.
  • The healthcare infrastructure is expensive, silo-based, and hard to replace.
  • Deep learning models are vulnerable against malicious adversarial examples.

One paper suggests that there is a need for a re-orientation of the healthcare industry to be more "patient-centric". In this more transparent system, the patients are empowered with information, procedures are re-designed and simplified, and the preferences of patients are incorporated into the treatments.

Furthermore, clean and accessible data, along with data driven automations, can assist medical professionals in taking this patient-centric approach by freeing them from some time-consuming processes.

With these points in mind, I argue that the biggest hurdle to the widespread adoption of these advanced techniques in the healthcare industry is not intrinsic to the industry itself, or in any way related to its practitioners or patients, but simply the current lack of high-quality data pipelines.

Example: Dr. Lecter, the nutritionist, had an appointment with a new client to assess his diet. He took some notes which he needed to store for future reference, and to compare with other patients. The day after, he noticed that he forgot to fill in one of the fields, so he just wrote it down from memory. He keeps everything in loosely organized hand-written files and some Excel spreadsheets on his personal computer, and wishes that his daily workflow was better.

His patient, in the meanwhile, has to wait to be contacted by Dr. Lecter, or has to schedule another appointment to inform the doctor of any changes.

What makes a good Data Pipeline?

A "data pipeline" is a broad term that refers to a process or series of processes that move data from one system to another, possibly transforming it along the way, or augmenting it other data from external sources.

A simple example of a data pipeline, transforming raw data, and converting it into a dashboard.

Good data pipelines are essential for any data-driven company. In an ideal situation, a company's data pipelines must efficiently process its data, so that:

  • Its employees can be informed, and can base their work decisions on fresh and accurate numerical information, as opposed to "gut feeling";
  • Its customers are empowered to make good decisions, and a relationship of transparency is cultivated between both parts.

There are several challenges to building good data pipelines. Some are more general, and exists across all industries, such as:

  • Developing idempotent data pipelines (applying them multiple times in succession does not change the result of the initial application; avoids, for example, loading duplicate data).
  • Correctly scheduling the data pipelines.
  • Handling cascading failures
  • Scaling to large amounts of data

However, each industry has its own types of data, which brings about particular challenges when it comes to building data pipelines. In the healthcare industry, the data is often characterized by being extremely private, sensitive, and important to the patients well-being.

In many applications, data errors could lead to serious problems with the patient's health. As such, medical data pipelines absolutely require that measures be taken to ensure:

  • privacy
  • data availability (no delays when trying to retrieve the data)
  • robustness to errors (minimal downtime)
  • thorough testing (no errors in the data)

Having worked with medical data briefly, during my master's thesis, I am familiar with the struggles posed by this type of data. I can say that having been exposed to these different types of data challenges definitely helped me in my work, since I learned that often they aren't immediately obvious when tackling a new project, but they definitely need to be kept in mind at all times during the design process of a data pipeline.    

Furthermore, the different data pipelines in a company can interact in a myriad of complex ways, and a data engineer must be able to look at the system in a holistic way to detect potential data issues, security problems, or sources of errors, and select the right tools for the job.

This is why I believe experience play an important role in developing good data pipelines. However, it's also very important that the data engineer takes a genuine interest in the domain and has access to the experts that work in the field. This context and industry-specific interest will help ensure that the final deliverable actually helps someone do their job rather than just introduce a theoretically cool system that only makes jobs more difficult.

To summarize, this is what I believe to be the missing attribute in many of the successful healthcare applications that didn't reach widespread adoption - they work well in isolation and reach state of the art results, but no effort is made to develop robust data pipelines with which they can be integrated.

Example: Instead of taking hand-written notes, Dr. Lecter has a simple application which provides a template he can fill in during his appointments. This application warns him if any field is missing or has invalid values, and also uploads an anonimized version of this data to a remote database, which can be accessed by a handful of other nutritionists working for Dr. Lecter's company.

It mostly works well, although sometimes he accidentally inserts the same data twice, and once he found a patient in the database aged 120 and weighing 35 kg, which he found very odd. Despite this, he is definitely happier, as he feels that his workflow is much more organized and that he has easier access to information. Still, he wishes that there was more he could do with all this data, instead of just looking at a bunch of rows in a database.

His patients, in the meanwhile, are happy with these recent technological developments, as they feel like everything is more organized. However, they still have to wait for the next appointment to give or receive some feedback from the doctor.

Production-grade Applications

Even if the core of your application is giving state of the art results, this might not be enough for it to gain trust and be widely adopted or integrated within larger systems. This is because your application's peripheral systems - such as the data pipelines that transport data in and out of it - also carry a substantial risk of failure, which can cascade through the whole system. Furthermore, even if your algorithm is perfect, it won't have much use if:

  • the data that is fed to it is not reliable, or degrades over time;
  • the data that comes out of it is not useful, or is not accessible to the people who need to use it.

When developing production-grade data applications there are several things you should be asking yourself, and which should be the drivers of the design process. Some ideas are:

  • am I using tools which allow scaling to larger amounts of data or processing? Or: can I easily employ horizontal scaling if I reach the limits of a single machine when my system is used in production?
  • am I following good software engineering guidelines for my application (such as these ones)? Or: is it clear how different data-sources bring data in and out of my application, and are the different concerns of my application separated in the code?
  • can my application withstand time-erosion? Or: is extended usage of my application eventually lead to critical failures by virtue of memory limits being reached, updates to peripheral systems, data drift, improper horizontal scaling, and other issues?
  • does my application require a lot of ad-hoc human intervention? Or: how many opportunities for human error due to lack of attention, fatigue, or typing mistakes am I unnecessarily introducing in my application?
  • is my application fulfilling an actual business need? Or: is the output reaching the right people on time, and does it have the right format?
  • can I immediately detect errors in my application? Or: are there CI/CD procedures in place that can signal any potential data or code errors as soon as they start to appear?
  • can my application gain the trust of its users? Or: am I giving enough emphasis to security and privacy?

We consider these questions to be as important as the core algorithm itself, because without a clear answer to them, it's unlikely that your application will reach widespread adoption!

However, these are not the only problems: even if a perfect "data periphery" is guaranteed, deploying machine learning models to production has an entirely unique set of problems that could fuel an entire blog post, such as concept drift, fairness and discrimination, self-reinforcing bias...

Example: The application used by Dr. Lecter has grown to hundreds of nutritionists across the country, and is now shared with the patients, which can access it with their own credentials using a "patient mode". They can keep their nutritionist updated with their daily weight, their exercise regimen and their food macros; they can also use a wearable device to track their heart rate during an exercise session.

Both types of data (daily values and heart rate sensor streams) are uploaded to a data lake in the cloud, which includes redundancy (to prevent loss of data), and is built to scale horizontally to thousands of patients. Security measures have been taken to ensure that this data is private and can't be accessed by anyone other that the patient and his doctor. This data is also transformed, anonimized, and copied to a data warehouse, where several "data marts" are created to feed live dashboards. These can be accessed by the doctors (to keep track of their patients), but also by the patients, to compare their progress with the average, for example.

The application was developed following good software engineering principles and has a robust test suite, so it is easy to add new functionalities without the risk of breaking something in production. The developers are thinking of developing a ML-based recommendation engine that can suggest diet changes to the patients, based on their nutritional needs.

The patients are much more relaxed, as they feel like they have a very dynamic relationship with their doctor. Dr. Lecter feels empowered, because he can base his recommendations on hard data and has a much easier time keeping everything organized. He will now celebrate this success with his favorite meal: liver with some fava beans and a nice chianti.

Did you relate to this post?

At DareData, we specialize in helping companies to become data-driven. As such, building stable data pipelines that are adapted to our clients' needs is something that we do on a regular basis (and enjoy doing!). We can help you improve or develop any part of your data pipeline by:

  • Storing your raw data in the cloud: using cloud solutions (such as AWS S3), we can store your data in the cloud, so that it is kept in a safe place and recoverable in case of a disaster, while minimizing storage costs.
  • Cleaning, moving and supplying you with data: using SQL and a data warehousing tool (such as Snowflake data warehouse), we can create the data pipelines that process, clean and make your data available to everyone you want, taking the necessary measures to ensure data privacy and control visibility.
  • Processing your data: we can implement more specialized algorithms if complex data transformations are required (using Python, for example), and integrate them with your data pipelines.

Ensuring that the quality of the data pipelines is continuously and automatically tested.

If any of the issues mentioned on this blog post resonated with you, please contact us; we would love to help your company achieve its data goals!

Thanks for reading, and we hope to hear from you soon.