How a Pharma Company Applied Machine Learning to Patient Data

OCTOBER 25, 2018

The growing availability of real-world data has generated tremendous excitement in health care. By some estimates, health data volumes are increasing by 48% annually, and the last decade has seen a boom in the collection and aggregation of this information. Among these data, electronic health records (EHRs) offer one of the biggest opportunities to produce novel insights and disrupt the current understanding of patient care.

But analyzing the EHR data requires tools that can process vast amounts of data in short order. Enter artificial intelligence and, more specifically, machine learning, which is already disrupting fields such as drug discovery and medical imaging but only just beginning to scratch the surface of the possible in health care.

Let’s look at the case of a pharmaceutical company we worked with. It applied machine learning to EHR and other data to study the characteristics or triggers that presage the need for patients with a type of non-Hodgkin’s lymphoma to transition to a later line of therapy. The company wanted to better understand the clinical progression of the disease and what treatment best suits patients at each stage of it. The company’s story highlights three guiding principles other pharma companies can use to successfully deploy advanced analytics in their own organizations.

Generating meaningful hypotheses (and organizational buy-in) requires engaging the right stakeholders. While the impulse might be to rush straight to the data and begin analysis, a critical preliminary step is to lay out the key business questions that must be answered and generate hypotheses. Building a comprehensive list of addressable hypotheses will allow the analytics team to determine which types of data will be necessary to test and prove (or disprove) the hypotheses.

It’s important to pull in the perspectives of key stakeholders on functional teams across the business to ensure hypotheses incorporate the right expertise and provide the highest value to the company. This also helps build buy-in and trust in analytics.

In this case, the pharma company brought in teams from its brand, medical, and business intelligence groups to generate hypotheses on the likely predictors that patients would have to move from one therapy to another and the triggers of those transitions. For example, in trying to hypothesize what drives fast or slow disease progression, the medical group contributed its clinical understanding of the disease, the brand team offered its detailed understanding of the company’s treatment offerings and how physicians use them, and the business intelligence team presented the analytical methods and datasets it had already used to shape the current understanding of treatment and disease courses.

The best data set might be a combination of data sets. It’s critical to identify a data set that is extensive and rich enough to properly train a machine learning algorithm. This is especially true in oncology, where a large number of variables — including age, gender, diagnosis history, medication and treatment history, laboratory values, and hospital encounters — collected on many patients over a sufficiently long historical stretch are needed for an effective analysis.

The pharma company’s analytics group realized that its internal data didn’t capture the variables likely to predict patient transitions in sufficient depth. The group therefore pursued a strategy in which it used internal and external data, combining an oncology-specific, integrated, structured EHR data set with some analysis replicated and validated on claims data.

All the data were stitched together and fed into an automated-feature-discovery (AFD) machine learning engine that allowed the company to test millions of hypotheses within hours. The engine explored every possible variation of the patient data to see if any variables had a statistically significant correlation with the transition to a later line of therapy. The insights gleaned from subject-matter experts helped ensure that the AFD results were clinically relevant. For example, when results indicated that an elevated liver function marker correlated with disease progression, medical officers confirmed that, although it wasn’t a factor they’d previously considered, it was clinically possible.

Feedback loops (many times over) are the key to great results. An iterative test-and-learn process is critical to developing an accurate model. The pharma company’s analytics group tested more than 200 lab values, major chronic comorbidities, and elements of medical history. Machine learning helped identify and isolate the critical variable combinations that predict transitions. Models were validated and refined to avoid noise and reduce the number of variables.

After weeks of iteratively learning and validating, a model was successfully developed to predict progression from initial diagnosis to later lines of therapy. Specifically, machine learning was used to extract features and triggers from the patient’s treatment, lab, and medication history, and the validated features were used to score and rank patients by expected likelihood of transition.

The models uncovered many critical insights, including:

  • Abnormalities in select lab results, such as the elevated liver function marker, increased the likelihood of a patient transitioning to the next line of therapy by in as much as 140% in some cases.
  • Patients on maintenance therapy were 20% less likely to transition to the next line of therapy.


With the right data, organizational processes, and clinical knowledge applied, machine learning and artificial intelligence can make a significant difference in pharma and health care today despite some limitations that still exist. It can, for example, be difficult to understanding why some complex models come to their conclusions and labeling the massive datasets required for the hungriest models can be haltingly laborious.

However, limitations like these are currently being addressed, with techniques like LIME (local-interpretable-model-agnostic explanations) helping to show model reasoning, and efforts are underway to use machine learning itself to label datasets. As limitations lift, the opportunities for pharma and health care will greatly expand. Those companies that have already begun leveraging machine learning will have the established base of infrastructure and processes needed to take advantage of these opportunities.

Rafiq Ajani is a partner in McKinsey’s office in Waltham, Massachusetts, and leads the firm’s North America Knowledge Center.

Arnaub Chatterjee is a senior expert in McKinsey’s North America Knowledge Center and a teaching associate at Harvard Medical School.

Aniketh Talwai is an expert in McKinsey’s North America Knowledge Center.

Jack Zhang is an expert in McKinsey’s North America Knowledge Center.