Translating AI for health care: Myths and misconceptions

Lily Peng, MD, PhD, a physician-scientist and product manager at Google, discusses myths and misconceptions that surround translating AI for health care.

Artificial intelligence icons outlined in blue overlaid on a peach background

Visionaries standing at the intersection of deep learning and health care see infinite applications to enhance and speed the delivery of high-quality health care. Yet it takes more than a brilliant idea, a mountain of data and the power of machine learning to make tangible advances. A yawning chasm divides the space between vision and product translation, and the pace of fully-realized progress is slow.

Lily Peng, MD, PhD, a physician-scientist and product manager at Google, directs a team focused on pathbreaking approaches aimed at boosting the availability and accuracy of health care through deep learning. She discusses her ideas in greater detail in Harvard Medical School's executive education program, Designing and Implementing AI Solutions for Health Care.

Edited and condensed for clarity

AI is widely viewed as a data-hungry technology. As teams create AI models that harness deep learning and work toward translation, is more data inevitably better?

Deep learning is a really interesting tool that does rely on a large amount of data to get performance and accuracy. This allows you to be a little lazy as a machine learning scientist: you don’t have to find rules, you don’t have to specify rules and you can actually have the computer system learn from examples.

For example, we’re not specifying “if you find this kind of nodule, and it’s spiculated, and it looks like this, and it’s larger than this size, then that’s cancer.” We just give the system lots of examples: “here are cases of cancer” and “here are cases of not cancer” and have the system figure this out. You only have to give the image and the label. But how we get labels can have a really, really big impact on the performance of models. The right labels can enable new capabilities and discoveries.

Three major buckets describe the way we go about getting labels. First, you might use the single read that came with the case, sometimes using an ICD-9 code or ICD-10 code as a proxy for that read. Second, you might use multiple reads from multiple clinicians, usually qualified individuals, and then use majority votes or adjudication — not unlike a tumor board where a bunch of folks look at the evidence and come up with final data for a diagnosis.

The third bucket requires additional data from the case to get a better ground truth. In the case of breast cancer, for example, a pathologist reading of a biopsy or multiple pathologists reading that biopsy — basically, a label requiring additional analysis of data derived from a health encounter or event separate from the original diagnosis. In general, if you have the same amount of data, we think the second approach is better than the first and the third is better than the others, because at the end of the day, patients care about the final diagnosis and what the treatment was or a survival outcome.

As you go down each of these buckets, the labels become harder and harder to get. We have to be strategic in balancing the amount of data and types of labels.


Many people believe an accurate model is all you need to create a useful product. Where do you stand on this?

I would argue that it’s not about accuracy, but about usability in a real-world setting. Building an ML model is just one step in preventing blindness or detecting cancer. Let’s take the example of a model designed to screen for diabetic retinopathy (DR), a complication of diabetes that can lead to vision loss if not caught early. The model needs to be incorporated into a product that is usable by doctors and nurses. It’s critical to think about how AI can fit into a clinical workflow. And it’s deceptively difficult, because if you think about an ophthalmologist’s workflow versus a primary care physician’s workflow, they could be very different. We’re starting to gather more evidence of how to do this.

An example is work in Thailand with Dr. Paisan Ruamviboonsuk from Rajavithi Hospital, a large hospital where he runs the national screening program for DR. Working with these partners, we first conducted retrospective validation studies to make sure that the model itself is generalizable. It is. Then in late 2018, we launched our prospective study to evaluate the performance and feasibility of deploying AI into existing diabetic retinopathy screening clinics across the country. One interesting tidbit is that even within the same training program, how training happens in one clinic and one region can be totally different than in screening clinics elsewhere.

In 2020, we closed recruitment of about 7,600 participants, all of whom were screened using AI across nine different sites. We’re currently analyzing data and will have more to share soon. But we learned that the human-centered approach is really useful in building these products. We worked with user-researchers and human-computer interaction (HCI) folks within the screening program. We mapped every step of the patient’s journey and the physician or ophthalmic nurse’s journey. We looked at what they were doing each time and how long it took from the time the patient entered the clinic to when they were discharged with follow-up. This really helped us identify potential inefficiencies and bottlenecks, and actually think about procedures patients despised and would not do in clinics or things that nurses were very uncomfortable doing.

Will a great product always deliver clinical impact?

I think most people would probably agree that what we really need to evaluate is the impact on the system as a whole. We can have the best product in the world, but patients have to get access to it. One of the reasons why patients don’t show up for screening in a specialty eye hospital doesn’t really have anything to do with how good the product is, right? For many people in India, the trek to the hospital can take a day. People may be lining up for care outside one of our partner hospitals. That means finding someone to take care of their kids, coping with lost wages. This significantly impacts lots of folks in this space. A lot of times, what we want to do is put screening, AI-enabled or not, closer to where patients live, so they don’t have to choose between getting care for themselves and providing for their loved ones.

— Francesca Coltrera

Continue the conversation by joining Lily Peng at the upcoming Designing and Implementing AI Solutions for Health Care program, or connect with us on Twitter @HMS_ExecEd.