Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Historical informatics
Reference:

Determining the authorship of the "Notes of the Decembrist I.I. Gorbachevsky" by machine learning methods

Latonov Vasilii Vasilyevich

ORCID: 0000-0002-7810-8033

PhD in Physics and Mathematics

Head of the Data Research Department; Sberbank PJSC

41 Ostrovityanova str., room 1, 172, Moscow, 117342, Russia

WLatonov@gmail.com
Other publications by this author
 

 
Latonova Anastasiia Vyacheslavovna

PhD in History

independent researcher

41 Ostrovityanova str., room 1, 172, Moscow, 117342, Russia

iskrenne_vasha_aa@mail.ru
Other publications by this author
 

 

DOI:

10.7256/2585-7797.2025.1.72805

EDN:

QALGAU

Received:

22-12-2024


Published:

17-04-2025


Abstract: In the presented work, the object of research is the "Notes of the Decembrist I.I. Gorbachevsky", which are one of the most valuable sources on the history of the Decembrist movement, created by its participants themselves. They highlight the formation and development of such a Decembrist organization as the Society of United Slavs, which later joined the Southern Society of Decembrists. Written in exile in Siberia, these notes represent not only a source of factual material, but also an original concept of the secret society's development, and a retrospective "inside look" at the mistakes made by the conspirators. However, Gorbachevsky's "Notes" are notable for another circumstance. Contrary to their well-established name in literature, we cannot unequivocally assert that their author was I.I. Gorbachevsky himself from among the Decembrists. The fact is that the first publication of the "Notes" – in the journal "Russian Archive" in 1882 – was presented under the heading "Notes of an Unknown Person from the Society of the United Slavs." The subject of the research in the presented work is the question of the authorship of the "Notes", which has no clear answer among historians today. In this paper, we propose a solution to the problem of determining the authorship of the "Notes of the Decembrist I.I. Gorbachevsky" using machine learning methods. I.I. Gorbachevsky himself, as well as the Decembrist P.I. Borisov, are considered as possible authors. The novelty of the research lies in the fact that machine learning methods were used to determine the authorship of the "Notes". The authors trained four types of models to predict the authorship of each of the sentences in the Notes. As a result, most of the proposals of the "Notes" were assessed as written by Gorbachev. The largest percentage of offers, 69.2%, was attributed to Gorbachev by the Count Vectorizer + SVC model. The accuracy of all models exceeded 80% on average, while those based on BERT coding averaged close to 90%. The main conclusion of the work, therefore, can be considered that the "Notes" were more likely to have been written by I.I. Gorbachevsky than by P.I. Borisov. The methods used in the framework of the presented study provide another argument in favor of this version. The code and dataset are available at the link: https://github.com/WLatonov/Gorbachevskiy_notes .


Keywords:

authorship definition, Attribution, Stylometry, Machine learning, Neural networks, Binary classification, BERT, The Decembrists, Gorbachevskiy's notes, Gorbachevskiy's letters

This article is automatically translated.

Introduction


The "Notes" of I.I. Gorbachevsky are one of the most valuable sources on the history of the Decembrist movement, created by its participants themselves. They highlight the formation and development of such a Decembrist organization as the Society of United Slavs, which later joined the Southern Society of Decembrists. Written in exile in Siberia, these notes represent not only a source of factual material, but also an original concept of the secret society's development, and a retrospective "inside look" at the mistakes made by the conspirators. A common thread through the entire concept of "Notes" is the exposure of the blunders of Southern society, the views and actions of the participants of which are opposed to the views and actions of the "united Slavs" themselves (not without some idealization of the latter).

The "Notes" in its printed version occupy not much more than 100 pages and are divided into three sections. The first of them tells the reader the end of the history of the Society of the United Slavs, when it merges with the Southern Society of the Decembrists. The second one highlights the uprising of the Chernigov infantry regiment, which ex-"Slavs" played a significant role in the events. The third section tells about the further fate of the rebels – about their trial, exile, and the unsuccessful rebellion of convicts organized by I.I. Sukhinov. Thus, as evidence of the author's undoubted historical self-awareness, according to the facts contained in them, "Gorbachevsky's Notes" go far beyond the limits of his biography, whoever he may be.

The last caveat is not accidental. Contrary to the well-established name of the "Notes" in the literature, we cannot unequivocally assert that their author from among the Decembrists was I.I. Gorbachevsky himself. The fact is that the first publication of this source, in the journal "Russian Archive" in 1882, issue 2, was presented under the heading "Notes of an Unknown Person from the Society of the United Slavs." The publisher (P.I. Bartenev) printed them based on an anonymous manuscript received from Siberia, accompanied by a note: "It seems that these "Notes" were compiled by former lieutenant of the 8th Artillery Brigade Ivan Ivanovich Gorbachevsky; but it is impossible to guarantee this."

An overview of the existing versions of the authorship of the "Notes"

The anonymity of the first publication gave rise to doubts about Gorbachevsky's authorship, which were formulated in detail by one of the founders of Soviet Decembrist studies, M.V. Nechkina. Having demonstrated in the course of her analysis that the "Notes" could similarly belong to the pen of Gorbachevsky's comrades (whose name originally arose as an assumption of the publisher who found the manuscript), she came to the conclusion that the author of the source was most likely another member of the Society of the United Slavs, P.I. Borisov [1, pp. 136-143].

Gorbachevsky's biographer, G.P. Shatrova, also did not consider him the author of the "Notes" (in any case, the only author). In the monograph "The Decembrists and Siberia" (1962)[2], she positioned herself as the creator of P.I. Borisov's "Notes", appealing to the opinion of M.V. Nechkina; in the monograph on Gorbachev personally, she came to the conclusion that the "Notes" were compiled on the basis of collective memories with subsequent literary processing by one author, and I.I. Gorbachevsky took an active part in this common work. The main argument against Gorbachevsky's exclusive authorship, G.P. Shatrova, found a significant discrepancy between the position of the "Notes" and Gorbachevsky's views expressed in letters from the 1850s to the 1860s on the settlement. [3, P. 75]. Another researcher, N.P. Matkhanova, wrote about the "Notes" as a result of a jointly developed concept as a result of an exchange of opinions within the team. In this regard, we read: "in the casemate community, individual memory was transformed into a social, identifying group, and the development of a "collective history" – that common version of the past, a conditional scheme., a general idea that consisted of images of events" [4, p. 160].

At the same time, this point of view on the authorship of Gorbachevsky's "Notes" (and M.V. Nechkina's initial arguments themselves) was criticized by a number of other historians. B.E. Syroechkovsky, L.A. Sokolsky, I.V. Porokhov found inconsistencies between the "Notes" and other texts of Gorbachevsky (investigative testimony and letters) to be insignificant and They are explicable, and Gorbachevsky's authorship is absolutely indisputable. (For detailed argumentation, see: [5]; also see [6]) As a result of textual analysis, Zlobin E.V. later came to similar conclusions [7]. Already in the post-Soviet period, I.I. Gorbachevsky was officially listed as the author of the "Notes", for example, in the Great Russian Encyclopedia [8]. At the same time, the opposite point of view continues to prevail in the historiographical field, according to which Gorbachev is not the author of the "Notes" of the same name (see: [9]).

Thus, it can be argued that the issue of the authorship of Gorbachevsky's "Notes" has not been fully resolved to date. However, new opportunities are now emerging for its research due to the development of information technology. Agreeing with our predecessors that it is hardly possible to consider the question of the authorship of the "Notes" without comparing them with other texts by Gorbachevsky, let's try to analyze all these texts using stylometry [10]. Unlike the traditional approach, in which only a comparison of sources that overlap in any way is adequate, in our case we can include any texts that are guaranteed to belong to the pen of the persons we are interested in. So, if B.E. Syroechkovsky, L.A. Sokolsky and I.V. Porokh considered only Gorbachevsky's letter to M.A. Bestuzhev on June 12, 1861, containing a lot of material about the Decembrist movement as a whole and about individual Decembrists worthy of comparison with the "Notes", then we can now draw to comparison even those letters of Gorbachevsky in which there are no judgments on the Decembrist theme. The same applies to the materials of other members of the Society of the United Slavs, who may have participated in the creation of the "Notes", and above all, to the materials of P.I. Borisov.

There is also a version that the "Gorbachevsky Notes" were written by a team of authors. However, there is no reliable information about which authors and in what proportion could have participated in the creation of the "Notes". Consequently, in this discourse, the correct task of stylometry cannot be formulated and arbitrarily meaningful conclusions cannot be drawn. Therefore, in the framework of this work, we will focus on verifying the versions of the authorship of the "Notes" in the person of I.I. Gorbachevsky and P.I. Borisov.

Review of works on stylometry and problem statement

Stylometry is a discipline that deals with measuring stylistic characteristics in order to organize and systematize texts [10]. These characteristics can be calculated for any sufficiently large author's text, and for each author's style these characteristics will be unique. Thus, stylometry can be used to determine the authorship of a text if there are samples of texts by possible authors large enough to calculate characteristics.

The task of determining authorship itself has existed for many centuries, but it was first formalized, apparently, in the work of N.A. Morozov [11]. In this work, a method was proposed for identifying authorship through graphs of the frequency of use of words. The article [12] by A.A. Markov should also be mentioned among the first works on the mathematical study of the authors' stylistics. In this article, A.A. Markov applied a statistical analysis that he had previously described in another paper [13], where the probability of a letter being a vowel was estimated depending on the chain of the two previous letters.

In [14], the idea of chains of two-letter combinations from [13] was applied and it was shown that this approach makes it possible to determine the true author with a probability of 84% when considering 80 possible variants. A generalization of this method was proposed in [15], where not only two-letter combinations, but also single grammatical classes of words, as well as pairs of words, were considered as units of analysis. This work is of particular interest due to the fact that, according to the authors' conclusions, the definition of authorship of a text by two-letter combinations is more accurate than by single words and pairs of words. A similar result was obtained in [16], where it was shown that three-letter combinations make it possible to establish authorship more accurately than words.

In [17, 18], to establish authorship, the probabilities of the occurrence of different n-letter combinations for n > 2, called n-grams, were considered. In some works, n-grams are also called sequences of n words, which are also suitable for determining authorship, for example in the article [19]. The authors found out that for English, 6-grams give the best result of recognizing the author, while, for example, for Greek, the best result is achieved when using 3-grams.

Another well–known approach to attribution of authorship is the Burroughs Delta [20], a method published in 2002. In his work, John Burroughs introduced a metric called Delta, used to measure the distance between texts [21]. The delta is calculated based on the entire dictionary of words used in all texts, between which the distance is calculated. This metric takes into account the frequency of use of each word in a single text and the frequency of use of the word in the entire set of texts. The Burroughs delta is widely used in linguistic research [22, 23], including to determine the measure of the authors' style in collaboration [24].

Along with this, there are a number of works in Russian historiography devoted to defining the features of the author's style. The result of the works of L.V. Milov, L.I. Borodkin and other historians in the 1970s and 1980s was research [25-27], where network analysis of grammatical structures was used for the tasks of style analysis and attribution. In many ways, the object of these studies were medieval Russian texts.

Machine learning (ML) methods are often used in modern works to determine authorship. So, in [28], the authors applied a number of classical methods in the task of attribution of authorship using the example of literary texts. Among others, the k nearest neighbors (KNN) method [29] and the support vector machine (SVC) method [30] were tested. The authors pay special attention to the issue of text preprocessing: they consider learning both from the source text and from a text with deleted rarely occurring words.

Deep learning is also used in the task of attribution of authorship, for example, the authors [31] used a convolutional neural network (CNN) and compared its accuracy with other approaches, in particular, multilayer linear perceptron (MLP) [32], as well as with the KNN and SVC already mentioned above. In [33], the authors solved the problem of determining the authorship of Russian-language texts, building the learning process on texts from classical literature and short publications on social networks. Along with classical ML methods, the authors used neural networks that include architectures such as LSTM [34] and BERT [35].

In this paper, the problem of determining the authorship of Gorbachevsky's Notes is solved. The publication of the "Notes" of 1963 in Literary Monuments was used as the material for processing, for which there is a digital copy on the Internet, which makes its use more convenient. This version differs from the original 1882 publication in places by a different paragraph division, which is not essential for our task, as well as by the presence of brief annotations before the chapters (these annotations were deleted by us during text processing). I.I. Gorbachevsky and P.I. Borisov themselves were considered as candidates for the authorship of the "Notes", and samples of texts written by these two Decembrists were used for the solution. To study Gorbachev's style, his letters were used as a source, primarily storing information about the relations between the exiled Decembrists and their living conditions in the settlement. At the same time, these letters contain enough of Gorbachev's judgments about the Decembrist conspiracy. 81 of his letters for the period 1839-1868 are available for analysis. Letters were also used to study Borisov's style, as in the case of Gorbachev. There are relatively few letters from Borisov at our disposal, 20. All of them were written in 1838-1847 during his life in the settlement and are very close to the letters in their subject matter.Gorbachev's.

Problem solving

To solve this problem, we used classical ML methods: SVC and logistic regression (LR) [36]. The latter was chosen insofar as it is focused on solving the problem of binary classification, and in our formulation there are only two possible authors. It is worth noting that these methods cannot be applied by themselves to text data, so each sentence is encoded into a vector of numeric features.

Two methods were used for encoding. The first of them, Count Vectorizer (CV), refers to classical ML approaches. Each sentence was preprocessed before being encoded using CV:

1. Replacing the upper case with the lower case;

2. Removing punctuation marks, round and square brackets, and quotation marks;

3. Removal of official words: prepositions, conjunctions, particles, interjections.

The second chosen encoding method is the BERT model, pre–trained on Russian-language texts (the model available at the link was used: https://huggingface.co/papers/2408.12503 ). This approach allows you to achieve greater accuracy in learning than when coding using classical methods, since, having been pre-trained, it already contains information about Russian-language texts and how to encode them effectively. Thus, prediction models were trained.:

1. Count Vectorizer + SVC;

2. Count Vectorizer + LR;

3. BERT + SVC;

4. BERT + LR;

The diagram of models 1-2 is shown in Figure 1, models 3-4 – in Figure 2.

Figure 1.

Figure 2.

Each model accepts one sentence as input and classifies it as belonging to the authorship of Gorbachevsky or Borisov. Text preprocessing and model training were implemented in Python. The implementation of the SVC, LR, and Count Vectorizer models were taken from the sklearn library, and the pandas and numpy libraries were used for preprocessing.

There are a total of 411 sentences in Borisov's letters, while there are 2,620 in Gorbachev's letters. An equal number of sentences from both authors are needed for training, so 411 sentences had to be selected from Gorbachevsky's letters. The sample was determined by the random number generator of the numpy library, and one hundred different samples were taken using the seed parameter, varying in the range from 0 to 99. For each of the samples of Gorbachevsky's sentences, each of the four selected models was trained. During the training, 80% of the sentences were allocated to the training sample, and 20% to the test sample. The best hyperparameters were selected for each model using the Grid Search method.

Results

We have received four blocks of 100 models each. After training, each model was used to predict the author of each sentence from the "Notes" individually. The proportions of sentences attributed to Gorbachev by models of different blocks are shown in Figures 3-6. Also on these graphs, the average shares of proposals attributed to Gorbachev are marked with a crimson dotted line. The average is taken for all models within the same block. A solid crimson line marks the maximum and minimum fractions. For example, as can be seen from Figure 4, the Count Vectorizer + LR model, trained on one of a hundred samples, classified more than 80% of the sentences of the "Notes" as belonging to Gorbachevsky's authorship. At the same time, the minimum share of this model exceeds 57%. The Count Vectorizer + SVC and BERT + SVC models have a maximum share of just over 79%. The minimum shares of these models exceed 51% and 48%, respectively. Table 1 shows examples of error matrices of four models based on test samples – one model from each block. It can be seen that the accuracy of the models using BERT is higher than the rest. For all models in all blocks, the accuracy in the training and test samples differed by 1-3%. It is also seen that models that do not use BERT have higher accuracy on Gorbachevsky's sentences, and models with BERT have approximately the same accuracy on both authors. Table 2 shows the average sample shares for all models, as well as the average accuracy values for training within each block.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Table 1. Examples of model error matrices based on test samples.

Model

Average accuracy in training

Average percentage of offers

Borisov

Gorbachev

Count Vectorizer + SVC

0.80

30.8 %

69.2 %

Count Vectorizer + LR

0.81

31.8 %

68.2 %

BERT + SVC

0.89

34.0 %

66.0 %

BERT + LR

0.88

37.6 %

62.4 %

Table 2. Accuracy of models and prediction of authorship of sentences from "Notes".

Most trained models classify about 70% of the sentences of the "Notes" as written by Gorbachev. More accurate models (using BERT) attribute slightly fewer sentences of "Notes" to Gorbachev, but this is still more than 64% on average. The variation in classification results, which is visible in Figures 3-6, is explained by the fact that one hundred different samples were used to train one hundred models in each of the blocks. Nevertheless, this spread does not affect the interpretation of the result, since the graphs show a significant advantage in favor of Gorbachev in each of the four blocks.

Conclusions

The paper considers the problem of determining the authorship of the "Gorbachevsky Notes", with the assumption that the author could be either I.I. Gorbachevsky himself or P.I. Borisov. The machine learning methods used in recent years have proven to be the most accurate in attribution tasks, and in our training work they showed accuracy of more than 80% (BERT + SVC and BERT + LR – about 90%). Almost all trained models classified about 70% of the sentences of the "Notes" as written by Gorbachevsky. Thus, it can be concluded that the "Gorbachevsky Notes" should be attributed as belonging to the pen of the real I.I. Gorbachevsky, and not P.I. Borisov, if the dichotomy of these two authors is considered as a choice.

References
1. Nechkina, M.V. (1955). The Decembrist movement. Moscow: Publishing House of the Academy of Sciences of the USSR.
2. Shatrova, G.P. (1962). The Decembrists and Siberia. Tomsk: Tomsk University Press.
3. Shatrova, G.P. (1973). The Decembrist I.I. Gorbachevsky. Krasnoyarsk.
4. Matkhanova, N.P. (2010). Siberian memoiristics of the 19th century. Novosibirsk: Publishing House of the Siberian Branch of the Russian Academy of Sciences.
5. Syroechkovsky, B.E., Sokolsky, L.A., & Gunpowder, I.V. (1963). The Decembrist Gorbachevsky and his "Notes". I.I. Gorbachevsky. Notes; Letters, 257-305. Moscow: Publishing House of the USSR Academy of Sciences.
6. Mironenko, M.P. (1976). The memoir heritage of the Decembrists in the journal "Russian Archive". Archeographic Yearbook for 1975, 112-114.
7. Zlobin, E.V. (1990). On the question of the authorship of the "Notes" of the Decembrist I.I. Gorbachevsky. History of the USSR, 2, 140-155.
8The Great Russian Encyclopedia: [in 35 volumes]. (2007). Moscow: The Great Russian Encyclopedia.
9. Tumanik, E.N. (2020). The role of the memoir heritage of the Decembrists in the scientific concept of G.P. Shatrova. Humanities in Siberia, 27, 50-57.
10. Martynenko, G. Ya., & Grebennikov, A. O. (2018). Fundamentals of stylometry: textbook.-the method. stipend. St. Petersburg: Publishing House of St. Petersburg University.
11. Morozov, N.A. (1915). Linguistic spectra: a means to distinguish plagiarism from the true works of a famous author. A stylometric study. Izv. otd. Russian language and literature by Them. Academy of Sciences, 20, 93-134.
12. Markov, A.A. (1916). On one application of the statistical method. Izv. Im. akad. nauk. Ser. 6, 4, 239-242.
13. Markov, A.A. (1913). An example of statistical research on the text of "Eugene Onegin", illustrating the connection of tests in a chain. Izv. Imp. akad. nauk. Ser. 6, 3, 153-162.
14. Khmelev, D.V. (2000). Recognition of the author of the text using A.A. Markov chains. Vesti. MSU. Ser. 9. Philology, 2, 115-126.
15. Kukushkina, O. V., Polikarpov, A. A., & Khmelev, D. V. (2001). Determining the authorship of a text using alphabetic and grammatical information. Probl. transfer inform., 37, 96-109.
16. Stamatatos, P. D. et al. (2013). On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21, 7.
17. Burrows, S., Tahaghoghi, S. M. M. (2007). Source code authorship attribution using n-grams. In: Proceedings of the twelth Australasian document computing symposium (pp. 32-39). Melbourne, Australia: RMIT University.
18. Sapkota, U. et al. (2015). Not all character n-grams are created equal: A study in authorship attribution. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 93-102).
19. Peng, F. et al. (2003). Language independent authorship attribution with character level n-grams. In: 10th Conference of the European Chapter of the Association for Computational Linguistics.
20. Burrows, J. (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. In: Literary and linguistic computing (pp. 267-287). Oxford University Press.
21. Hoover, D. (2004). Testing Burrows’ Delta. Literary and Linguistic Computing, 19, 453-475.
22. Evert, S., Proisl, T., Jannidis, F., Reger, I., Pielström, S., Schöch, C., Vitt, T. (2017). Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities, 32, 4-16.
23. Jannidis, F. et al. (2015). Improving Burrows’ Delta. An empirical evaluation of text distance measures. Digital Humanities Conference.
24. Kovalev, B.V. (2024). The Birth of the Third author: a stylometric analysis of the stories of Honorio Bustos Domek. Literature of the Two Americas, 16, 120-146.
25. Borodkin, L.I., Milov, L.V., & Morozova, L.E. (1977). On the question of the formal analysis of the author's style features in the works of Ancient Russia. Mathematical methods in historical, economic and historical and cultural studies, 298-326.
26. Borodkin, L., & Milov, L. (1984). Some Aspects of the Application of Quantitative Methods and Computers in the Analysis of Narrative Texts. Soviet Quantitative History. Sage Publications: Beverly Hills/London/New Delhi.
27. Milov, L. V., Borodkin, L. I., & Ivanova, T. V. et al. (1994). From Nestor to Fonvizin: New methods for determining authorship. Moscow.
28. Jockers, M. L., & Witten, D. M. (2010). A comparative study of machine learning methods for authorship attribution. Literary and Linguistic Computing, 25, 215-223.
29. Fix, E., & Hodges, J. L. (1989). Discriminatory analysis, nonparametric discrimination. International Statistical Review, 57, 233-238.
30. Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297.
31. Boumber, D., Zhang, Y., & Mukherjee, A. (2018). Experiments with convolutional neural networks for multi-label authorship attribution. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation.
32. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65, 386.
33. Fedotova, N. (2021). Virtual exhibition as a means of implementing cultural function of the library. Litera, 6, 55-63. doi:10.25136/2409-8698.2021.6.35726 Retrieved from http://en.e-notabene.ru/fil/article_35726.html
34. Hochreiter, S. (1997). Long Short-term Memory. Neural Computation MIT-Press.
35. Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT.
36. Hosmer, D. W., & Lemeshow, S. (2013). Applied Logistic Regression. John Wiley & Sons.