Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Historical informatics
Reference:

Application of Software Methods for Automated Processing of Sources of Personal Origin

Prigodich Nikita Dmitrievich

PhD in History

Senior Lecturer, National Research ITMO University; Senior Researcher, Saint Petersburg State University

197101, Russia, Saint Petersburg, Saint Petersburg, Kronverksky Ave., 49, letter A

ndprigodich@gmail.com
Other publications by this author
 

 
Korobko Semen Sergeevich

Bachelor, ITMO University

197101, Russia, Saint Petersburg, Saint Petersburg, Kronverksky Ave., 49, letter A

semenkorobko2@gmail.com

DOI:

10.7256/2585-7797.2023.1.40376

EDN:

OJJZUU

Received:

01-04-2023


Published:

08-04-2023


Abstract: The subject of this research is software methods of automated preprocessing of historical sources and the development of effective solutions to problems when working with sources of personal origin. The article analyzes the current situation in the use of modern software methods. The authors demonstrate the main range of arguments for which such historical sources from a technical point of view should be considered separately. A methodological analysis of the features of the application of optical character recognition based on preprocessed data is carried out. Special attention is paid to the advantages and key parameters of the effectiveness of the final result of work when using automated text processing, including the further use of OCR methods. The scientific novelty of the research lies in the proposal and detailed description of a software solution to the current problem based on machine learning methods. The developed program has three phases of working with digital copies of sources of personal origin. It is based on the use of the OpenCV library and solving a number of problems using the Hough transform. Based on the general analysis of the study, we can highlight the main advantages of automated preprocessing of scanned documents: reducing time, improving accuracy, combating distortion and optimizing the process. The presented results of successful testing of the developed solution allow us to judge the possible areas of its effective application.


Keywords:

Sources of personal origin, machine learning, artificial intelligence, OpenCV library, Hough transform, preprocessing, method OCR, recognition, archiving, digitization

This article is automatically translated.

IntroductionIn modern realities, the volumes of scanned historical sources in state, regional and private archives, museums, libraries and many other organizations have significantly increased.

This systematic work is valuable both for researchers and for the institutions themselves. At the same time, the transition is gradually being made to the next stage, technically more complex, for mechanical or automatic text recognition to enable new forms of analysis. First of all, this applies to the inventories of archival funds, the most valuable sources. Researchers of the process of digitization, reception and storage of electronic documents in domestic archives described in detail the history and modern features of this process [1-4].

An example of continuous processing and translation into electronic format of handwritten and typewritten texts is the work of the center for the study of ego documents "Lived". The center's staff and volunteers are engaged in updating the database of sources of personal origin, mainly in Russian. This activity requires a number of measures to pre-process the scanned document. Automation of the preparatory processing process (cropping, pagination, visual noise cleaning, rotation) significantly increases the speed and efficiency of further work on text recognition both programmatically and manually [5].

A certain emphasis placed on the automated processing of sources of personal origin can be explained by several features of working with such documents. Firstly, the format of such records often consists of notebook pages, sheets of notebooks, personal diaries of various sizes that are not subject to direct unification in comparison with office documentation and archival sources. Secondly, most of the texts under consideration are handwritten, which imposes additional introductory when setting tasks for automation. Finally, sources of personal origin often have overlays of visual noise, the separation of which requires additional time when processed manually.

Èçîáðàæåíèå âûãëÿäèò êàê òåêñò  Àâòîìàòè÷åñêè ñîçäàííîå îïèñàíèå

Fig. 1. An example of a source of personal origin.

 

The present study is devoted to the analysis of the current state of the preparation of scanned copies of historical sources for the recognition process, as well as the search for a software solution for automating a number of tasks and bringing them to a single format using artificial intelligence and machine learning methods. In this regard, it is necessary to identify the most significant tasks to increase efficiency in the development of the program: identify the document on the scanned copy, classify depending on the position of the document, automate the rotation of the text, divide by pages and crop.

 

Methodology and features of useAutomated processing of scanned images before text recognition is a necessary step in the process of converting paper documents into digital format.

This process, which includes optical character recognition (OCR), allows you to convert paper documents into a text format for further work with them in electronic form [6]. Of course, the need for a qualitative solution to such a problem exists not only in historical science. For example, banks, medical institutions, government agencies and other organizations need to convert paper documents into electronic format to improve document management processes and improve overall work efficiency [7]. However, when working with historical sources, there are a number of characteristic features that require an adapted approach.

One of the main advantages of automated processing of scanned images is to reduce the number of errors in character recognition. This process allows you to remove unnecessary outlines and background, improving image quality and reducing the likelihood of errors in OCR that may occur when recognizing fuzzy characters [6, p. 47]. Automated processing of scanned images also increases the speed of converting paper documents into electronic format. This is possible thanks to the use of automatic recognition of the text area and the removal of noise throughout the image, which reduces the time for manual document processing. In addition, automated processing of scanned images allows you to process large volumes of documents. This is especially true for working with large arrays of historical sources.

Considering this specificity, it is necessary to highlight a number of characteristic features, which, as a result, are facilitated by the solution of the tasks set. Thus, automatic text recognition based on preprocessed samples helps researchers to analyze texts more accurately, quickly gain access to a wide range of documents based on specific words and phrases [8, p. 214]. Moreover, automatic text recognition provides a refined interpretation of the meaning of phrases and sentences in historical documents. It also contributes to the preservation of cultural heritage, as it allows you to store an electronic version of a historical document, which can be stored in a database or archive, which guarantees the protection of documents for future generations. Finally, automatic text recognition increases the effectiveness of historical research. First of all, by simplifying the collection and analysis of materials.

 

Software solutionBefore starting the development of a software solution to the designated problem, a two-dimensional typologization should be carried out.

We will assume that there are only two types of photographs of historical documents, depending on orientation: landscape orientation documents (spreads of books, notebooks, newspapers, magazines, horizontally arranged sheets) and book orientation documents (vertically arranged sheets, notebooks, individual pages of notebooks, etc.). Such a division is necessary from the point of view of determining the division into separate pages or its absence.  Based on the established tasks, it should be determined that the program being developed should consist of three phases: recognition of the document itself on the scanned copy, classification of the document depending on orientation and subsequent processing, including rotation, cropping and pagination.

Recognizing objects in a photograph, outside the context of scanned sources of personal origin, is a well–studied task in the field of Machine Learning [9]. Today there are many different libraries in the field of Computer Vision: fastai, PyTorchCV, OpenCV and some others. In this case, the OpenCV library was responsible for the most effective solution of the problem. Among its obvious advantages, it is necessary to designate high performance, support for a number of programming languages (including Python, Java and C++), which will be useful when including this program in loaded systems, as well as a large array of already implemented algorithms [10-11].  

In the course of obtaining the results of the first phase of the program, a set of coordinates of the minimum rectangle in area should be obtained, which would cover all the points of the document in the photo. In this context, there may be a problem that the resulting rectangle will not be rotated as the user requires, whether it is 90 or 180 degrees. This problem is designed to solve the second phase, in which the calculation of the desired rotation is possible with an introductory knowledge of the orientation of the document.

The first phase can be implemented in two ways:

1) Development and training of a model that would recognize a document in an image.

2) Using the Hough transform method used inside the OpenCV library. This method is designed to identify geometric objects on raster images [12]. In our case, such an object is a rectangle.

As part of the study, the second way was chosen, due to the relative simplicity and performance compared to the first option.

            During the preliminary testing, it turned out that the second phase of the software solution can also be implemented using the Hough transform method [12, p. 829]. It is necessary to elaborate in more detail on exactly how to do this. We will assume that the first phase has already been completed, that is, the second phase is performed within the rectangle containing the document. Then, if we apply the Hough transform again, and take the average number of slopes of the obtained lines, then we can calculate the desired orientation.

            At the same time, it should be pointed out separately that the presented method of implementing the second phase has a significant drawback: it is impossible to distinguish a document from the same one, but inverted by 180 degrees. However, upon closer examination, it becomes clear that such a problem is also relevant for a person who cannot understand whether the manuscript is oriented correctly, provided that he does not have the initial data about the language in which the text is written. As a result, it turns out that the only way to eliminate this disadvantage is to solve the problem of optical character recognition in handwritten texts, which today does not have a sufficiently accurate solution.

            Finally, the final, third phase of the software solution can be implemented using the auxiliary tools of the OpenCV library for working with images [11, p. 18]. To do this, you need to "cut" the rectangle obtained at the first stage, rotate it and "cut" into pages. The last action is carried out depending on the orientation that was obtained in the second phase.

Èçîáðàæåíèå âûãëÿäèò êàê òåêñò  Àâòîìàòè÷åñêè ñîçäàííîå îïèñàíèå

Fig. 2. An example of the program.

 

ConclusionsAs a result of solving a practical problem, a software solution was developed that allows simplifying and speeding up the process of digitizing scanned sources of personal origin.

 

Based on the general analysis of the study, we can highlight the main advantages of automated preprocessing of scanned documents:

1. Reduction of recognition time. Automatic image preprocessing with text allows you to reduce the number of errors in character recognition and significantly reduce the time required for this process.

2. Improved recognition accuracy. Automatic image preprocessing helps to get rid of noise and other artifacts, which improves the quality of character recognition.

3. The fight against distortion. The technology of automatic image preprocessing allows you to deal with various distortions, such as perspective distortions, distortions due to scanning, distortions due to poor print quality.

4. Optimization of the recognition process. Image preprocessing helps to optimize the recognition process, reducing the cost of computing resources and increasing the speed of recognition.

5. Improving the efficiency of the system. The combination of these factors leads to an overall increase in the effectiveness of the results.

An analysis of the current state of automatic preprocessing of digital copies of sources of personal origin shows that none of the existing developed solutions allows us to fully respond to the requests that face this specific area. Certain features that exist in this kind of work force these actions to be performed manually, which leads to a slowdown in the overall process. The development of a specific solution based on machine learning and artificial intelligence methods, with the targeted use of open libraries, contributes to achieving a better result. This provision has passed the stage of successful testing on a massive volume of scanned copies from the funds of the project for digitizing ego documents "Lived" at the European University in St. Petersburg and received positive feedback from specialists.

As a result, the developed software solution for automated processing of scanned images before text recognition is a necessary stage for converting paper documents into electronic format. This process not only improves image quality, reducing the likelihood of OCR errors and improving the speed of document conversion, but also simplifies the work of organizations and allows you to work with large volumes of paper documents. At the same time, the developed solution has some limitations that can be corrected with further optimization using more advanced methods of trainable optical character recognition.

References
1. Miroshnichenko, M. A., Shevchenko, YU. V., Ohrimenko, R. S. (2020). Preservation of the historical heritage of state archives by digitizing archival documents. Vestnik Akademii znanij, 37, 188-194. doi:10.24411/2304-6139-2020-10163.
2. Kutkin, A. V., Nazarov, A. N. (2022). Digitization of documents in the archives of the Russian Federation: analysis of the equipment and software used. Vestnik VNIIDAD, 6, 41-52. doi:10.55970/26191601_2022_6_41.
3. Reshet'ko, K. M., Halamej, K. N. (2021) Application of artificial intelligence in the banking sector. Potencial rossijskoj ekonomiki i innovacionnye puti ego realizacii, 2, 87-89.
4. Chursina, A. A. (2022) Russian practice of digital processing of historical sources: directions and results. Cifrovoe izmerenie novoj social'noj real'nosti: sbornik nauchnyh studencheskih statej (pp. 167-176). Moscow: Finansovyj universitet pri Pravitel'stve Rossijskoj Federacii.
5. Murakas, R. (2021). Digitization of historical materials of social sciences research as a source of modern research data. Kommunikaciya v social'no-gumanitarnom znanii, ekonomike, obrazovanii: Materialy V Mezhdunarodnoj nauchno-prakticheskoj konferencii (pp. 107-110). Minsk: Belorusskij gosudarstvennyj universitet.
6. Vaksina, I. R., Kanev, A. I., Latypova, K. N. (2022) Optical character recognition of handwritten texts and tabular data. The trend of development of science and education, 86-1, 45-49. doi:10.18411/trnio-06-2022-15.
7. Nesterov, A. S. (2020). Market analysis of modern optical character recognition (OCR) information systems. Studencheskij vestnik, 25-3, 82-85.
8. Shabanov, A. V. (2013). Image processing when creating digital copies of manuscripts with fading text. Trudy GPNTB SO RAN, 5, 213-218.
9. Maksimov, V. YU., Klyshinskij, E. S., Antonov, N. V. (2016). The problem of understanding in artificial intelligence systems. Novye informacionnye tekhnologii v avtomatizirovannyh sistemah, 19, 43-60.
10. Gevorkyan, M. N., Demidova, A. V., Demidova, T. S., Sobolev, A. A. (2019). Review and comparative analysis of machine learning libraries for machine learning. Discrete and Continuous Models and Applied Computational Science, 27-4, 305-315. doi:10.22363/2658-4670-2019-27-4-305-315.
11. Burmistrov, A. V., Il'ichev, V. YU. (2021). Recognition of objects in images using basic Python tools and the opencv library. Nauchnoe obozrenie. Tekhnicheskie nauki, 5, 15-19.
12. Favorskaya, M. N. (2016). Hough transform for recognition tasks. DSPA: Voprosy primeneniya cifrovoj obrabotki signalov, 6-4, 826-830.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

Review of the article Application of software methods for automated processing of sources of personal origin Journal: Historical Informatics The subject of research in the presented article is the application of software methods for automated processing of sources stored in various funds. The article is devoted to the problem of processing and converting handwritten and typewritten texts of personal origin into electronic format. The author describes the specifics of working with such documents based on specific examples. In fact, the study is devoted to the analysis of the current state of the preparation of scanned copies of historical sources for the recognition process, which is a serious problem, since it is the sources of personal origin that differ in a significant number of visual "noises" that complicate the machine processing process. The author considers the search for a software solution for automating the scanning and processing process to be an equally important task, and also identifies the main categories of tasks necessary to solve this problem. The research methodology is reasonable and optimal for analyzing the problem. A set of methods, both general scientific and highly professional, was used, which allowed us to offer a software solution to simplify the process of digitizing documents of personal origin. The relevance of this study is due to the fact that in modern realities, the volume of scanned historical sources in state, regional and private archives, museums, libraries and many other organizations has significantly increased. This systematic work is valuable both for researchers and for the institutions themselves. The emergence of new, more advanced software tools requires constant changes in working methods, and therefore the presented article is also of great practical importance. The scientific novelty of the work is unconditional. The author develops and presents the results of testing a new software solution that simplifies and speeds up the process of digitizing scanned sources of personal origin. The author convincingly demonstrates that the development of a specific solution based on machine learning and artificial intelligence methods, with the targeted use of open libraries, contributes to achieving a better result. This is confirmed by the result of processing an array of ego documents "Lived" at the European University in St. Petersburg. The automated preprocessing of documents, carried out in the way proposed by the author, contributes to improving the quality of the created digital copies. The style of work meets the high requirements of a scientific approach to the presentation of research results. It is characterized by consistency, strict consistency of presentation, semantic accuracy, informative saturation, and objectivity. The structure of the presentation does not cause complaints and is characterized by the interconnectedness of the parts, the logical transitions from one section to another. The proposed illustrations are informative, important for understanding the conclusions of the article, and demonstrate the sequence of proposed actions. In terms of content, this article is a logically completed study of an urgent problem, carried out through the use of a set of scientific methods. The article contains a reference to previous scientific papers on this topic, provides a qualified assessment of the previously obtained results – both positive and negative. The author correctly identifies the deficiencies of previous research and offers his own methodology for solving the problem of automated processing of sources of personal origin. The software solution proposed by the author is justified and confirmed by practice. The author convincingly demonstrates the fact that the development of a specific solution based on machine learning and artificial intelligence methods, with the targeted use of open libraries, contributes to achieving a better result, while emphasizing at the same time that the proposed solution has limitations that can be eliminated in further work using the methods of trainable optical character recognition. The bibliography includes 12 sources devoted to a specific problem. The sources are quoted in the text of the article. The article may be useful to specialists, as the conclusions are of undoubted practical importance. Also, the article will undoubtedly arouse the interest of a wide range of readers.