What is Natural Language Processing and Topic Modeling?
Over the spring and summer, I published a series of articles on extracting quality information from FDA enforcement initiatives, such as warning letters , recalls and inspections. But of course, FDA enforcement actions aren’t the only potential sources of quality data the FDA maintains. The FDA now has a massive medical device report (or “MDR”) dataset that can be mined for quality data. Medical device companies can, indeed, learn from the experiences of their competitors about the kinds of problems that can go wrong with medical devices.
The problem, of course, is that the interesting data in MDRs is in what a data scientist would call unstructured data, in this case English text describing a product problem, where the information or insights don’t cannot be easily extracted given the considerable volume of the reports. In calendar year 2021, for example, the FDA received nearly 2 million MDRs. It is simply not possible for a human to read them all.
This is where a form of machine learning, natural language processing, or more specifically topic modeling, comes into play. I used thematic modeling last November for an article on the main trends over a decade in MDRs. Now I want to show how the same topic modeling can be used to find more specific experiences with specific types of medical devices to inform quality improvement.
As I explained last November, topic modeling requires the exercise of judgment to find what is relevant to the business. In this article, I choose once again to use Latent Dirichlet Allocation, a form of unsupervised learning, but this time implemented via the Python SKLearn library. In unsupervised learning, a data scientist must make decisions about what information to include and what to exclude from the database to find the most significant topics.
For example, when preparing the data set, I chose to exclude the most common MDR, that of the dental DZE product code “Implant, Endosseous, Root-Form”. As you may know, this particular product code is responsible for a huge number of MDRs, around half a million in calendar year 2021. Including this data skews the other data and frankly , doesn’t add much information because most of these MDRs are very similar.
As in most data science exercises, the goal is to find the signal among the noise. I also removed words that were too common in MDRs to eliminate noise, which helped focus the signal better. Noise words included ‘MDR’, ‘investigation’, ‘problem’, ‘test’, ‘information’, ‘conclusion’, ‘product’, ’cause’, ‘(B)(4)’ (which is the language that the FDA inserts wherever a document is redacted for confidential information), “medwatch” and “CFR”.
By running a few tests on the topics, I optimized the number of topics for these purposes to 20. It seemed that 20 topics provided the most significant and common topics covered in all the remaining MDRs. Then, although a given MDR might cover multiple topics, my algorithm went through each MDR and ranked it according to the main topic covered.
My goal in this analysis was to find a particularly good visual representation of a topic, and then a good visual representation of exactly what types of products (by product code) the RDMs dealing with that topic were filed for. Below, I share a topic that was relatively focused on a given product, and another topic that cuts across many different product categories. I like to use word clouds to present the topics themselves, as they give a good idea of the most important words for the topic. The bigger the word, the more mathematically important it is for the subject.
To really understand the topic, it’s important to look at some MDR examples that are summarized by this topic. Here is one:
ACCORDING TO THE COMPLAINANT, THE DEVICE WILL NOT BE REFERRED FOR INVESTIGATION. WE ARE UNABLE TO CONFIRM THE BENT CANNULA OR DETERMINE IF IT MAY HAVE CONTRIBUTED TO THE REPORTED HYPERGLYCEMIA. NO BATCH RELEASE DOCUMENT WAS REVIEWED BECAUSE THE PRODUCT LOT NUMBER WAS NOT PROVIDED.
Here are the product codes where this topic was found in calendar year 2021:
Although this subject of the cannula occurs elsewhere, it is mainly found in insulin pumps.
Now let’s look at a topic that cuts across multiple product codes.
Again, to understand the subject, it helps to look at an example of MDR.
INFORMATION HAS BEEN RECEIVED THAT THE CADD LEGACY PCA PUMP IS MALFUNCTIONING. THE PUMP CAUSED EXTRA FLOW AND THE PATIENT HAD BREATHING DIFFICULTIES AND HAD LOW OXYGEN LEVELS, WHICH LEADED TO A VISIT TO THE EMERGENCY ROOM. ADDITIONAL INFORMATION IS COLLECTED.
Note, for example, that not all subject words appear in every place where the subject is assigned as the main subject. In the MDR above, the word “lead”, for example, does not appear. Sometimes when the word “lead” appears, instead of an electric wire, it refers to the verb, as in something “leads” to something else.
I want to emphasize this point. Thematic modeling is a high-level approximation. This shouldn’t be taken too literally to mean that each of the MDRs means the same thing. They don’t. Indeed, in my review of the actual MDRs under this topic, they represented quite a variety. Here are a few others that make this point:
CONCOMITANT MEDICAL PRODUCTS: C6TR01 CRTP, IMPLEMENTED: (B)(6) 2015. IF INFORMATION IS PROVIDED IN THE FUTURE, AN ADDITIONAL REPORT WILL BE ISSUED.
CUSTOMER (PERSON) NOT PROVIDED, INFORMATION PROVIDED BY (B)(6). PROFESSION: NON-HEALTH PROFESSIONAL. THE FILTER INTERACTS WITH THE IVC WALL, FOR EXAMPLE PENETRATION/PERFORATION/EMBEDDING. THIS CAN BE EITHER SYMPTOMATIC OR ASYMPTOMATIC. POTENTIAL CAUSES MAY INCLUDE INCORRECT DEPLOYMENT; AND (OR) EXCESSIVE FORCE OR HANDLING NEAR A FILTER IN SITU (FOR EXAMPLE, A SURGICAL OR ENDOVASCULAR PROCEDURE NEAR A FILTER). POTENTIAL ADVERSE EVENTS THAT CAN OCCUR INCLUDE BUT ARE NOT LIMITED TO THE FOLLOWING: TRAUMA TO ADJACENT STRUCTURES, VASCULAR TRAUMA, PERFORATION OF THE VENA CAVA, PENETRATION OF THE VENA CAVA. THIS REPORT INCLUDES INFORMATION KNOWN AT THIS TIME. A FOLLOW-UP REPORT WILL BE SUBMITTED IF FURTHER INFORMATION IS AVAILABLE.
The fact that these topics are higher level and probably not the way a human would construct them if a human read the millions of MDRs does not mean they are worthless. It simply means that they must be appreciated in the proper context. There are ideas to be gleaned, but those ideas may not be as literal as if a human had designed the topics.
Here are the product codes in which theme 15 was found.
Notice I deliberately kept it high to illustrate how this is done. If I worked for a particular company, and on a particular product code or set of product codes, the analysis would be much more precise. But I wanted to show how it works without embarrassing a company or a set of companies as to the quality of their products.
The purpose of this is to illustrate how quality information can be gleaned from the text portion of RDMs. Topic modeling will capture very specific words that are unique to a given product type and summarize the number of MDRs where that topic is discussed in that product code. This can be useful for companies that manufacture medical devices that produce – industry-wide – many MDRs. Rather than reading every MDR filed by your competitors, modeling the topics will give you an idea of what the topics are in that product code.
Topic modeling will also show where topics are common across multiple product categories. The usefulness of finding these topics is to then examine how the products in the other categories solve the quality problem. Companies that manufacture other products that suffer from the same problem may have offered helpful solutions that you can import for your own products.
Granted, topic 15 seems to focus on where implants may need surgery as a result of malfunctioning. Some of the products in this product set may have defects involving electronic cables, while others may not. It is certainly possible to sort MDRs by including specific words such as “lead”. Additionally, given the natural language processing tools available, we can sort MDRs where the word “lead” is used as a noun versus those where it is used as a verb. For this, we use the Python library called spacey.
Natural language processing offers a variety of tools that help us glean information from large databases, in this case MDRs, which can number in the millions in any given year. This information allows companies to learn from the experiences of others, both competitors who manufacture the same product and other medical device companies who manufacture products with similar functionality or at least similar problems. This post is just the tip of the iceberg. In the context of a particular product, I have tried to explain how it is possible to access increasingly specific information that becomes actionable in a quality improvement initiative.
©2022 Epstein Becker & Green, PC All rights reserved.National Law Review, Volume XII, Number 276