This project aims at parsing named entities and recognizing and classifying medical data into the relevant categories, namely drugs, diseases, symptoms, side-effects, treatment, etc. Twitter data will be the input and based on previous medical data from databases and ontologies, relevant medical terms have to be parsed and classified (medical named entities are recognized and classified based on the category they belong to(ex: drug or a disease or cure etc...).)
The task of a Medical Name Entity Recognizer is to identify medical entities in text. Medical entities can be diseases, drugs, symptoms, etc. Previously, researchers in the field have used hand crafted features to identify medical entities in medical literature. In this work, we wish to extend medical entity recognition on tweets. We are expected to use NLP toolkits designed for processing tweets along with other medical ontologies (or databases) to exploit a lot of semantic features for this task.
Challenges we faced:
- Tweets are very noisy and so much contextual.
- All tweets containing the keyword 'asthma' are not about the disease 'asthma'.
- Learning distributed representations for medical tweets.
- Entity linking for exploiting semantic features from ontologies (UMLS, MetaMap).
- The results of analyzing such data can be used by pharma companies to boost their sales and also procure knowledge about sales of drugs manufactured by other companies pertaining to the same disease.
- These results will also be beneficial in getting an estimate of the presence of any disease in a particular region and its prevalence.
- We have a dataset of 1 year of tweets about 4 diseases and 32 drugs.
- A team of domain experts has annotated about 2000 tweets with entities (around 20 types: diseases, drugs, symptoms) and relations (around 40 relation types: cures, causes, etc).
ToolKits we used:
The algorithm we used for classifying the medical text goes like this:
- Parsing and tokenizing tweets
- Using training data labels to generate the feature files (for both 1-gram and 5-gram models).
- Using the output feature file generated in step 2, along with the template file, we use crf_learn command to generate a model file (for both 1-gram and 5-gram).
- We now generate the feature files for the testing data, excluding the labels.
- Using the output feature file generated in step 4, along with the template file, we use crf_test command to get the labels for the test data.
- We compare the predicted labels with the actual test data labels to get the percentage accuracy.
Word features : The word itself, two words before and three words after, along with their lemmas(its the root word of the current token).
Morphosyntactic features : POS tags of the word itself, two words before and three words after.
Semantic features : Semantic category of the word, provided by Metamap+.
Other features : Next noun, previous verb, previous adjective, next verb.
Orthographic features : The word contains -, +, &, etc.. is a number, letter, punctuation, etc.. is in upper case, capitalized, etc.. Prefixes of different length (from 1 to 4),Suffixes of different length (from 1 to 4).
Analysis from experiments:
Initially we trained using only one feature for 1-gram thing which is just the token and its corresponding label and for 5-gram thing we just took the token, previous two words and next two words.
But later we thought of increasing the features and also the features that might effect the current words label, since as we have said before tweets are more contextual. The features used are mentioned above.
- We used word features because of the language dependency of the current word on its neighbors.
- We used POS tags to incorporate grammatic rules.
- We used Semantic feature of each word to find how much is it related to the label.
- We used orthographic features because medical terms have long biological names(length feature comes into picture) and similarly others.
- We also used other features like nearest previous adjective , because adjectives give a way more insight into the disease or symptoms.
Before adding all the features mentioned above and used only the current token:
- 5-grams: 75.10%
- 1-gram : 62.22%
After adding all the features:
- Information retrieval and extraction.
- Major Project.
- Medical entity recognition.
- IIIT HYDERABAD.
- Conditional random fields.
- Feature extraction.
The source code can be viewed at:
A video describing the procedure and results can be found at:
A slideshare ppt can be found at:
ppt,video,report can be found at: