Extracting Phenotype Details from Electronic Medical Records

May 5, 2022
Extracting Phenotype Details from Electronic Medical Records

Despite the progress we have made in science—and especially in medicine—over the last few centuries, there remain things we know absolutely nothing about. In fact, the things we have learned thus far are nothing compared to the things we are yet to learn.

As one famous philosopher once put it:

“Our knowledge can only be finite, while our ignorance must necessarily be infinite.”

— Karl Popper, Austrian-British philosopher, academic.

This also extends to our understanding of disease or its variants. While we might be able to figure out symptoms and identify some of the more common diseases such as chickenpox and AIDS, we are still struggling with more obscure ones—especially with genetic diseases, where symptoms might not manifest clearly and can easily go unnoticed.

Buckle your seatbelt and get ready as we are about to go deep into the rabbit hole of modern medical research, how they make use of electronic health records (EHRs), and how the data can help us achieve breakthroughs in modern medicine.

What Is An Electronic Health Record?

First of all, we need to understand what EHRs are, what their use in modern medicine is, and then we can move on to the main topic of today—extracting phenotype details from them.

As the term suggests, EHRs (sometimes referred to as electronic medical records, EMRs) are digital-based health records that store the patient’s information and their known conditions. Below is a sample of electronic health records, for better illustration:

Image by Proxet, Electronic Health Record
Electronic Health Record

As you can see from the example above, EHRs contain vital information about the patient, such as their vital signs during their previous visits, a description of their conditions (both structured and unstructured—we will get to that later), and in general, their past medical history.

If we are to explain the relationship between paper forms and EHRs, they are essentially the same thing—but storing them digitally means that they are easily accessible irrespective of locations, and everything is centralized within a single system for easy access and analysis, which offers a great opportunity for future medical research.

Imagine the paperwork involved for a 50-year-old patient who went to the hospital twice a year alongside millions of other patients just like him. The use of modern EHRs negates all the issues that might arise with old paper-based systems.

“We have seen the shift of old paper-based systems to digital ones in recent years, even in smaller organizations. EHRs significantly improve clinical workflows during the pandemic, and in turn cutting unnecessary costs.”

Vlad Medvedovsky, CEO at Proxet (ex – Rails Reactor) – a custom software development company.

How Is Information Properly Inserted Into A Medical Record

To insert information properly into a medical record, there are a few rules everyone should abide by:

  • Medical records must be complete, legible, and timely.
  • All information in records must be objective and the information must be initialed and dated.
  • Errors should never be erased or covered with correction fluid. Instead, a single line should be drawn through an error so that the error is still readable.

A robust set of data is the foundation of easy data extraction and processing. With the shift of medical institutions toward digital solutions, the global EHR is also expected to grow significantly in the coming years. According to one research, the global EHR market is expected to grow six percent per year through 2025.

Now, let’s take a brief look at the comparison of EHR systems, and see how they differ from one another.

Comparison Of Different EHR Systems

Luckily, if you are looking for an EHR solution, there are plenty of options in the market. While this is not the focus of this article, let’s go through some of the available options on the market and look at their differences.


EpicCare is one of the more complex EHR systems in the market, suitable for larger organizations with a high number of patients and intricate clinical flows.

While it is a comprehensive system, the user interface can be overwhelming for some with the number of data present and displayed to the users. As with other more complex systems, it does come with a rather steep learning curve.


eClinicalWorks is another EHR system available in the market that helps medical institutions streamline the clinical flow and automate the process.

Image by Proxet, eClinicalWorks

While eClinicalWorks’ interface is not as complex compared to that of EpicCare’s and offers more customization, it still possesses a rather steep learning curve as the numerous functionalities can be difficult to navigate for those without the experience.

PT Practice Pro EHR

EHR for physical therapy requires different functionalities compared to ordinary EHR systems—while a lot of the functions can overlap, such as the use of scheduling and billing systems, physical therapy EHRs tend to focus on automating the workflow of physical therapists specifically.

Image by Proxet, Practice Pro for physical therapist
Practice Pro for physical therapist

Among those available in the market today is Practice Pro EHR, which has developed specific EMR systems for physical therapy. While the system might not be as comprehensive as other generic EHR systems, it does make things easier for users due to the lack of clusters present in some other systems.

Choosing the right EHR system is a topic of its own, which we have covered in our previous article. Apart from choosing an existing EHR system in the market, it is also possible to have one developed by companies that provide software development services, tailored to your needs.

With that out of the way, let’s get back on track—what’s the point of extracting data from EHRs, what exactly are phenotype details, and how can we make use of the data?

What Is The Purpose Of Extracting Data?

The short answer to this question is to utilize the data to identify and predict the outcome, just like what we would normally do with artificial intelligence (AI) and machine learning (ML). But of course, in reality, there are a lot of steps involved in the process.

Phenotype Details

If you clicked on the article, we assume that you already know the meaning of phenotype details, but in case you don’t, this is what the term means, according to Merriam-Webster:

“The observable characteristics or traits of an organism that are produced by the interaction of the genotype and the environment: the physical expression of one or more genes.”

In other words, the term is used to describe observable traits, be it manifestations of certain genes or environments. While genotypes focus on the genetic composition, phenotypes can be seen as the focus on the physical manifestation. As an example: Certain gene combinations can mean different eye colors (genotype), but when it comes to phenotypes, only the actual eye colors matter.

Structured Vs. Unstructured

As with other applications that involve data, there is a difference between structured and unstructured data. In the case of EHRs, there are occasions when the data are already tagged and structured, such as the following:

  • Age
  • Gender
  • Nationality
  • Ethnicity

However, there are also unstructured data in EHRs that can provide valuable insights into the patient’s condition, namely the notes provided by the doctor during the diagnosis. Imagine notes in an EHR that looks like something below:

“Smoker. 30 years of tobacco use. Developed coughing three weeks ago. Shortness of breath.“

Details like this might not be tagged and remained unstructured, yet they provide valuable insights as to the condition of the patient, and they can be used to identify the disease the patient is suffering from alongside other details—thus the need to extract unstructured data arises.

Extract Information From Unstructured Text

We are going to use Dr. Katherine Liao’s research as a basis for our examples here.

A brief background of Dr. Liao’s research: In one of her recent studies, she, along with her team, explored the use of unsupervised approaches for phenotyping using data from EHRs while incorporating modern informatics and biostatistics methods.

In the research, she focused on rheumatoid arthritis, a condition with only around one percent of prevalence in the population, which means existing data can be limited—in fact, using the traditional methods, it would take 15 years to gather enough genetic data for their research.

“Using data from the EHR provided us with an opportunity to—maybe not rapidly, but efficiently—identify more patients with RA (rheumatoid arthritis) for these studies.”

Dr. Katherine Liao, Associate Professor of Medicine, Brigham and Women’s Hospital and Associate Professor of Biomedical Informatics, Harvard Medical School.

That still presented an issue, however, as the existing data present in EHRs have a limited positive predictive value (PPV), which is still not enough for genetics testing. This means she needed other ways to extract meaningful data from EHRs.

Image by Proxet, The Tapestry of Potentially High-Value Information Sources That May be Linked to an Individual for Use in Health Care
The Tapestry of Potentially High-Value Information Sources That May be Linked to an Individual for Use in Health Care

Structured data refers to those searchable in the databases (as depicted in the image above). However, there are also unstructured data that are not fully utilized, such as the numerous reports and medical notes, which is what the computer will extract.

One of the key ways is natural language processing (NLP), a means to extract meaningful data from unprocessed texts.

“And the goal of this then is to transform these data into types of data we can analyze together with the structured data.”

Dr. Katherine Liao, Associate Professor of Medicine, Brigham and Women’s Hospital and Associate Professor of Biomedical Informatics, Harvard Medical School.

Popular Python Lib For Data Extracting

Now another issue arises—how do we identify all the medical terms used in the notes? We need to bear in mind that medicine is a complex topic with ever more complex terms, dedicated to describing different conditions.

Data extraction is a topic of its own, and it deserves a dedicated article just on the different extraction methods that can be used in EHR data mining. For the scope of this article, we will only focus on python tools for data analysis, especially in relation to its use with NLP and phenotype extractions in EHRs.

There are numerous python data science packages and python data analysis tools out there, and they are slightly beyond the scope of this article—see this as a brief overview of popular python data mining libraries.


Med7 is a freely available python package for spaCy which is able to extract medical-related data through NLP. The “7” in its name refers to the seven medication-related concepts: dosage, drug names, duration, form, frequency, route of administration, and strength.

Image by Proxet, Med7 in a nutshell
Med7 in a nutshell

The open-source package is also trained with both gold-standard data mixed in with noisy data for further accuracy. It requires the latest version of spaCy (2.2.3) and Python 3.6+.


MedaCy is another Python tool that can aid medical-related work. Built over spaCy, it is a text processing and learning framework built to “support the lightning fast prototyping, training, and application of highly predictive medical NLP models.”

According to its GitHub page description, it is “designed to streamline researcher workflow by providing utilities for model training, prediction and organization while insuring the replicability of systems.”

Build Your Own System

While there are numerous existing EHR systems out there on the market (as we have mentioned previously in this article), building your own EHR system still has certain advantages that mass-market solutions might not be able to replicate.

Depending on your clinical workflow and the size of your project, an existing solution might not be able to cater to your needs; on top of that, if you would like to extract medical records from EHRs and conduct further research, a tailored EHR system might be a more robust choice further down the road.

Proxet is a software solution provider with years of expertise in both AI and machine learning, developing AI-powered solutions for various medical institutions. From automating clinical workflows to EHR analyzers, all is possible with AI-based solutions. If you are interested in how your institution too can harness the power of AI, do not hesitate to contact us today.

Related Posts