Authorship Identification using Naive Bayes Text Classification
Objective
Train a classifier which could identify the authorship of given text.
What I did?
- Trained a Naive Bayes Classifier using Bag Of Words approach to identify if a line was written by Shakespeare or Emily Bronte.
- Analyzed the training data using shell commands such as awk to check the top frequency words-sans stop words.
- Featurized the text using various lexical features such as
- punctuation perline.
- average length of line.
- the average length of words.
- Sentences beginning with prepositions/conjunctions.
- Presence of Hyphenated words.
Results
- Acheived an accuracy of 88% on test set.
- Identify certain grammatical patterns differentiating two authors such as:-
- Shakespeares sentences were on average 1.5 times of Emily Bronte's.
- Emily Bronte used punctuations with more frequency.
Technologies Used:
- Python
- scikit-learn
- matplotlib
- Bash shell