A stylometric analysis of the corpus of Nikolai Gogol’s letters
Yuliya Ilchuk
Project Overview
“Stylometric analysis” was originally developed for forensic authorship attribution and has recently been applied to the “distant reading” of various literary corpora. Obviously, many aspects of style are conscious and deliberate, and as such they can be easily imitated. Computational stylometry, however, prioritizes subconscious elements which are more difficult to imitate because they belong to the deep structures of language. The individual style of an author, or his/her “stylistic fingerprint,” can be captured with several quantitative criteria, or discriminators—whether they are identified with most unique words, with function words, or even with sentence type. In Professor Ilchuk’s application of the method to Gogol, the “bag of words” approach was used. She manually divided Gogol’s epistolary legacy by each individual correspondent and by three periods: early letters (1820–1835), mid-life letters (1836–1846), and late letters (1847–1852). Then, she created a “Term Document Matrix” of the most frequent words (MFW) of Gogol’s letters. Given the relatively small size of the epistolary corpus, the most frequent words present the most reliable stylistic features because they are considered to be topic-neutral. She has chosen the first 100 MFW for Gogol’s early letters and 200 MFW for his mid-life and late letters, and used Eder’s Delta. For visualization, the R-studio package “stylo” was applied. The arrangements of the “leaves” among the “clades” in the dendrogram exhibit similarities; the width differences between clades measures the degree of likeness: the closer to “0” the clade’s position, the lesser degree of likeness it exhibits with others. The results of hierarchical clustering are presented in the dendrograms.