Medical Named Entity Recognition in tweets

By kavya, aditi and udbhav


This project aims at parsing named entities and recognizing and classifying medical data into the relevant categories, namely drugs, diseases, symptoms, side-effects, treatment, etc. Twitter data will be the input and based on previous medical data from databases and ontologies, relevant medical terms have to be parsed and classified (medical named entities are recognized and classified based on the category they belong to(ex: drug or a disease or cure etc...).)

Problem Statement:

The task of a Medical Name Entity Recognizer is to identify medical entities in text. Medical entities can be diseases, drugs, symptoms, etc. Previously, researchers in the field have used hand crafted features to identify medical entities in medical literature. In this work, we wish to extend medical entity recognition on tweets. We are expected to use NLP toolkits designed for processing tweets along with other medical ontologies (or databases) to exploit a lot of semantic features for this task.

Challenges we faced:



ToolKits we used:


The algorithm we used for classifying the medical text goes like this:

Features Used:

Analysis from experiments:

Initially we trained using only one feature for 1-gram thing which is just the token and its corresponding label and for 5-gram thing we just took the token, previous two words and next two words. But later we thought of increasing the features and also the features that might effect the current words label, since as we have said before tweets are more contextual. The features used are mentioned above.


Before adding all the features mentioned above and used only the current token:

After adding all the features:


The source code can be viewed at:

A video describing the procedure and results can be found at:

A slideshare ppt can be found at:

ppt,video,report can be found at: