In this paper, we address the problem of author attribution through unsupervised clustering using lexical and syntactic features and novel deep learning based Stylometric model. For this purpose, we download all available 158918 publications accessible till 1 July 2015 from PLOS.org – an open access digital repository of full text publications. After pre-processing, out of these, we use 803 single authored publications written by 203 unique authors.
For unsupervised modeling, stylometric markers such as lexical and syntactic features are used as a distance matrix by employing k-Means clustering algorithm. For supervised modeling, we present a novel long short-term memory (LSTM) based deep learning model that predicts the testing accuracy of a given publication written by an author. Finally, our unsupervised model shows that 88.17% authors are classified into correct cluster (all papers written by the same author) with at most 0.2 coefficient of Entropy error. While our deep learning based model consistently show above 95% accuracy across all the given testing samples of publications written by an author with an average loss of 0.21.
Saeed-Ul Hassan, Mubashir Imran, Tehreem Iftikhar and Iqra Safder, “Deep Stylometry and Lexical & Syntactic Features based Author Attribution on PLOS Full Text Digital Repository”