Supervised Named Entity Recognition for Twitter data
Objective
To analyze -
Increase accuracy of NER tagger by using new features
Analyze the performance of CRF anf Logistic regresion labelers.
What I did?
Entities to be recognised were:- company, person, product facility, geo-loc, movie, musicartist, tvshow, sportsteam.
Engineered lexical features such as word shape, compact word shape.
Engineered orthographic features such as Tokens consisting of punctutations, mixed-case, containing hyphens and ampersands, beginning with capital letter, ending with digit etc.
Used following gazetteer features:-
familynames and lastnames for improving person identity
location.country and location lexicon list for identifying geo-loc entity
Consumer_product list for identifying product entity.
Implemented viterbi algorithm which was used by CRF tagger during training phase.
Results
Use of gazeteer features showed a drastic increase in F1 score for person and geo-loc enbtities.
word shape features behaved erratically for twitter data as the data was too noisy.
Puctuation related feature weren't of much help
Technologies Used:
Python
scikit-learn
matplotlib
Role: Developer
Event: Coursework :- CSCI 544 Applied NLP
Location: University of Southern California, Los Angeles