The surprising story behind the Apple Watch’s ECG ability

Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again
by Eric Topol

Book cover

The Apple Watch produced a seismic shift in the public’s acceptance of biometric monitoring. Sure, we’ve had step counters, heart rate and sleep monitors for years, but the Apple Watch made it hip and cool to do so. In Deep Medicine, author Eric Topol examines how recent advances in AI and machine learning techniques can be leveraged to bring (at least the American) healthcare system out of its current dark age and create a more efficient, more effective system that better serves both its doctors and its patients. In the excerpt below, Topol examines the efforts by startup AliveCor and the Mayo Clinic to cram an ECG’s functionality into a wristwatch-sized device without — and this is the important part — generating potentially lethal false positive results.

In February 2016, a small start-up company called AliveCor hired Frank Petterson and Simon Prakash, two Googlers with AI expertise, to transform their business of smartphone electrocardiograms (ECG). The company was struggling. They had developed the first smartphone app capable of single-lead ECG, and, by 2015, they were even able to display the ECG on an Apple Watch. The app had a “wow” factor but otherwise seemed to be of little practical value. The company faced an existential threat, despite extensive venture capital investment from Khosla Ventures and others.

But Petterson, Prakash, and their team of only three other AI talents had an ambitious, twofold mission. One objective was to develop an algorithm that would passively detect a heart-rhythm disorder, the other to determine the level of potassium in the blood, simply from the ECG captured by the watch. It wasn’t a crazy idea, given whom AliveCor had just hired. Petterson, AliveCor’s VP of engineering, is tall, blue-eyed, dark-haired with frontal balding, and, like most engineers, a bit introverted. At Google, he headed up YouTube Live, Gaming, and led engineering for Hangouts. He previously had won an Academy Award and nine feature film credits for his design and development software for movies including the

Transformers, Star Trek, the Harry Potter series, and Avatar. Prakash, the VP of products and design, is not as tall as Petterson, without an Academy Award, but is especially handsome, dark-haired, and brown-eyed, looking like he’s right out of a Hollywood movie set. His youthful appearance doesn’t jibe with a track record of twenty years of experience in product development, which included leading the Google Glass design project. He also worked at Apple for nine years, directly involved in the development of the first iPhone and iPad. That background might, in retrospect, be considered ironic.

Meanwhile, a team of more than twenty engineers and computer scientists at Apple, located just six miles away, had its sights set on diagnosing atrial fibrillation via their watch. They benefited from Apple’s seemingly unlimited resources and strong corporate support: the company’s chief operating officer, Jeff Williams, responsible for the Apple Watch development and release, had articulated a strong vision for it as an essential medical device of the future. There wasn’t any question about the importance and priority of this project when I had the chance to visit Apple as an advisor and review its progress. It seemed their goal would be a shoo-in.

The Apple goal certainly seemed more attainable on the face of it. Determining the level of potassium in the blood might not be something you would expect to be possible with a watch. But the era of deep learning, as we’ll review, has upended a lot of expectations.

The idea to do this didn’t come from AliveCor. At the Mayo Clinic, Paul Friedman and his colleagues were busy studying details of a part of an ECG known as the T wave and how it correlated with blood levels of potassium. In medicine, we’ve known for decades that tall T waves could signify high potassium levels and that a potassium level over 5.0 mEq/L is dangerous. People with kidney disease are at risk for developing these levels of potassium. The higher the blood level over 5, the greater the risk of sudden death due to heart arrhythmias, especially for patients with advanced kidney disease or those who undergo hemodialysis. Friedman’s findings were based on correlating the ECG and potassium levels in just twelve patients before, during, and after dialysis. They published their findings in an obscure heart electrophysiology journal in 2015; the paper’s subtitle was “Proof of Concept for a Novel ‘Blood-Less’ Blood Test.” They reported that with potassium level changes even in the normal range (3.5–5.0), differences as low as 0.2 mEq/L could be machine detected by the ECG, but not by a human-eye review of the tracing.

Friedman and his team were keen to pursue this idea with the new way of obtaining ECGs, via smartphones or smartwatches, and incorporate AI tools. Instead of approaching big companies such as Medtronic or Apple, they chose to approach AliveCor’s CEO, Vic Gundotra, in February 2016, just before Petterson and Prakash had joined. Gundotra is another former Google engineer who told me that he had joined AliveCor because he believed there were many signals waiting to be found in an ECG. Eventually, by year’s end, the Mayo Clinic and AliveCor ratified an agreement to move forward together.

The Mayo Clinic has a remarkable number of patients, which gave AliveCor a training set of more than 1.3 million twelve-lead ECGs gathered from more than twenty years of patients, along with corresponding blood potassium levels obtained within one to three hours of the ECG, for developing an algorithm. But when these data were analyzed it was a bust.

Here, the “ground truths,” the actual potassium (K+) blood levels, are plotted on the x-axis, while the algorithm-predicted values are on the y-axis. They’re all over the place. A true K+ value of nearly 7 was predicted to be 4.5; the error rate was unacceptable. The AliveCor team, having made multiple trips to Rochester, Minnesota, to work with the big dataset, many in the dead of winter, sank into what Gundotra called “three months in the valley of despair” as they tried to figure out what had gone wrong.

Petterson and Prakash and their team dissected the data. At first, they thought it was likely a postmortem autopsy, until they had an idea for a potential comeback. The Mayo Clinic had filtered its massive ECG database to provide only outpatients, which skewed the sample to healthier individuals and, as you would expect for people walking around, a fairly limited number with high potassium levels. What if all the patients who were hospitalized at the time were analyzed? Not only would this yield a higher proportion of people with high potassium levels, but the blood levels would have been taken closer to the time of the ECG.

They also thought that maybe all the key information was not in the T wave, as Friedman’s team had thought. So why not analyze the whole ECG signal and override the human assumption that all the useful information would have been encoded in the T wave? They asked the Mayo Clinic to come up with a better, broader dataset to work with. And Mayo came through. Now their algorithm could be tested with 2.8 million ECGs incorporating the whole ECG pattern instead of just the T wave with 4.28 million potassium levels. And what happened?


The receiver operating characteristic (ROC) curves of true versus false positive rates, with examples of worthless, good, and excellent plotted. Source: Wikipedia (2018)

Eureka! The error rate dropped to 1 percent, and the receiver operating characteristic (ROC) curve, a measure of predictive accuracy where 1.0 is perfect, rose from 0.63 at the time of the scatterplot to 0.86. We’ll be referring to ROC curves a lot throughout the book, since they are considered one of the best ways to show (underscoring one, and to point out the method has been sharply criticized and there are ongoing efforts to develop better performance metrics) and quantify accuracy—plotting the true positive rate against the false positive rate (Figure 4.2). The value denoting accuracy is the area under the curve, whereby 1.0 is perfect, 0.50 is the diagonal line “worthless,” the equivalent of a coin toss. The area of 0.63 that AliveCor initially obtained is deemed poor. Generally, 0.80–.90 is considered good, 0.70–.80 fair. They further prospectively validated their algorithm in forty dialysis patients with simultaneous ECGs and potassium levels. AliveCor now had the data and algorithm to present to the FDA to get clearance to market the algorithm for detecting high potassium levels on a smartwatch.

There were vital lessons in AliveCor’s experience for anyone seeking to apply AI to medicine. When I asked Petterson what he learned, he said, “Don’t filter the data too early. . . . I was at Google. Vic was at Google. Simon was at Google. We have learned this lesson before, but sometimes you have to learn the lesson multiple times. Machine learning tends to work best if you give it enough data and the rawest data you can. Because if you have enough of it, then it should be able to filter out the noise by itself.”

“In medicine, you tend not to have enough. This is not search queries. There’s not a billion of them coming in every minute. . . . When you have a dataset of a million entries in medicine, it’s a giant dataset. And so, the order or magnitude that Google works at is not just a thousand times bigger but a million times bigger.” Filtering the data so that a person can manually annotate it is a terrible idea. Most AI applications in medicine don’t recognize that, but, he told me, “That’s kind of a seismic shift that I think needs to come to this industry.”

Excerpted from Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. Copyright © 2019 by Eric Topol. Available from Basic Books.

Source link