Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Sociodynamics
Reference:

Trace as a Sign: Epistemology and Methodology of Digital Data in Sociology

Matveev Mikhail Sergeevich

ORCID: 0000-0002-5378-6559

Postgraduate student, Department of Sociology and Political Science, St. Petersburg State Technical University 'LETI'

195276, Russia, Saint Petersburg, Kultury, 25, sq. 184

mikhail.matveev97@gmail.com
Other publications by this author
 

 

DOI:

10.25136/2409-7144.2025.4.74248

EDN:

MUOPRO

Received:

24-04-2025


Published:

01-05-2025


Abstract: The article is dedicated to the analysis of epistemological and methodological problems related to the integration of digital traces into empirical sociology. The relevance of the topic is determined by the growth in the volume of digital data and the necessity for their meaningful integration into the social sciences. As experience from other researchers shows, the analysis of digital behavioral data currently tends to attract criticism. This directly leads to the goal of this article—to analyze and identify the epistemological and methodological complexities of integrating digital traces into the sociological tradition, as well as to demonstrate that working with such data types is situated within a much broader historical and theoretical framework. Furthermore, the task is to justify the necessity of an interpretative approach to any volumes of digital data and emphasize the need for contextualization and critical reflection at all stages of research. The methodological basis of the article consists of general scientific methods—theoretical and methodological analysis, comparison, and generalization of scientific sources on the research problem. As a result of the research, the necessity for a comprehensive approach has been substantiated, which suggests a diverse basis for sources of digital behavioral data, a combination of quantitative analysis of digital data with interpretation, consideration of platform specificity, possible algorithmic selection in the formation of the sample of extracted data, and the need to check the authenticity of the data for automated activity. It was concluded that a critical attitude towards digital data is necessary at all stages of research and that these data should be understood as signs requiring scientific reflection rather than as ready-made empirical facts. Nevertheless, it has been identified that such a perspective can also be traced in earlier key works on non-reactive research strategies in sociology. Based on the conducted work, recommendations and examples of studies using an interpretative framework were proposed, both at the level of qualitative strategies for digital research and on scales understood in the broader discourse as Big Data.


Keywords:

digital footprints, digital sociology, data interpretation, methodology of sociological research, epistemology of digital data, Big Data, social media, non-reactive strategy, empirical sociology, digital behavioral data

This article is automatically translated.

Introduction

Without making loud statements that digital data has become a fundamental feature of modern society, but only citing as an example the fact that individuals, interacting with computer technologies, generate 463 exabytes of data every day, we can conclude that the intention to analyze this data has already arisen in various institutions of modern society[1]. Indeed, digital behavioral data is already being used by marketers in the context of "knowledgeable capitalism" as well as in government. There was even a discourse about the impending crisis in sociology in the spirit of the epistemological "end of theory." In 2008, K. Anderson published an article "The End of Theory", where he stated that from now on, the focus should be on data that is so easily accessible and easily analyzed by computer methods. Theories and hypotheses are no longer needed, and correlations on large data samples come to the fore [2].

Narratives in some techno-optimistic publications on the subject, describing all the potential advantages, and the words that finally "the sociologist's dream has come true - to get to the traces that remain from people's actions, regardless of the intentions of researchers" after a long period of time have not changed the theoretical and methodological skepticism about the use of digital traces in empirical sociology. As K. Guba noted in her article: researchers rather perceive them as an object for criticism than contribute to their development [3].

The purpose of this article is to analyze and identify the epistemological and methodological difficulties of integrating digital footprints into the sociological tradition, as well as to show that working with this type of data is within a much broader historical and theoretical framework and to justify the need for an interpretative approach.

Discussion

First of all, it is worth giving a working definition of "digital footprints" – data characterizing an individual's network activity [4]. As an empirical source, they quite follow the logic that they arise, unlike, for example, survey results, regardless of the researcher's will— clicks, posts, comments, likes, visible interactions on social networks, etc. Therefore, a number of works classify behavioral digital data, based on the source of their occurrence, as non-active data [5],[6]. However, despite the fact that the appearance of such sociological information does not require the individual's participation in the study and, therefore, he does not become aware of the fact of participation in the study, which reduces the risk of the Hawthorne effect (according to E. Giddens), it is worth noting that digital footprints should be classified as conditionally reactive or low-reactive: yes, users of social networks they may not know that the footprints they leave will be used for some specific scientific purpose, but one way or another they manage their self-presentation on the Web in their daily practices, which means that their behavior can also be "reactive" — simply in relation not to the researcher, but to the audience, however, the "naturalness" of such a snapshot is social. the validity is preserved.

This leads to the first difficulty, as such, the non-active tradition has not yet become part of the canonical research arsenal in sociology.: This is due to the lack of methodological "controllability" of the data obtained, the lack of methodological work for their use, and a certain level of methodological conservatism: The prevailing scientific style of sociology ... presupposes the use of survey tools and statistics to test pre-formulated hypotheses" [3].

The transition to the digital age has only intensified this problem: there is more and more accumulated, visible and analyzed data, marketers and business analysts have already integrated digital footprints into their non-active research practices, because, according to M. Buravoy and his colleagues, they are in a more advantageous position than academic sociology, being the custodians, because it is The entirety of the data is aggregated on their servers, if it is, for example, a large technology company, and less "responsible researchers", since commercial structures evaluate data from a practical point of view, and not from the point of view of a strictly academic approach to verification [7].

Sociologists are faced with the fact that now it is necessary to develop a level of interdisciplinary training, including mastering new programming languages for: working with APIs, web scraping, thematic modeling. That is, to literally rediscover access to sociological information.

Another important issue is the validity and representativeness of digital footprints, as they rarely contain information about gender, age, or education. In the classical methodology, such data is the basis for any survey. To this should be added the presence of bots and fake accounts, which, with automated analysis on a large volume, fall into the data sample. Difficulties in combining the results of digital analysis with other data (for example, survey data) to ensure verification and cross-validation [6]. Finally, there is a problem related to the essential difference between virtual and virtual reality: for example, can we say that each other's presence as "friends" on one of the social media is a sign of real social interaction. M. B. Bogdanov and I. B. Smirnov described this as constructive validity. That is, in the course of research work, the researcher must constantly ask himself the question: "Does his data really reflect what he wants to study?"[5].

If we put aside the problems associated with the lack of competence of academic researchers and the small number of research centers dealing with digital methods, since this problem will be solved by itself when the subject area overcomes theoretical and methodological problems and is able to make an essential contribution to sociological knowledge, it seems that in order to answer the question it is necessary to recognize that digital Footprints are not just a "ready-made" empirical artifact that has both scale and at the same time preserves a unit of social information, but always an object of interpretation, contextualization and recognition.

For this, it is worth using broader historical optics. One of the first to use the concept of "trace" in sociology was S. Rokkan. He understood it as data created by the social process itself: from simple material evidence through various artifacts to a variety of symbolic representations and ideas recorded in drawings, stories, or messages [8].

One of the key empirical researchers of that era, sociologist P. Lazarsfeld, did not directly use the concept of "traces"; nevertheless, he followed the logic of correlating "indicators", "symptoms", "hints" and "signs" with a more fundamental state of affairs, the social reality that led to the appearance of these indicators: "This is the same the very process where men throughout history have asked their lovers, "Do you really love me?"[9].

Sociologist Yu. In his seminal work on non-reactive measures, Webb and colleagues described three types of data resulting from such a strategy: found data, extracted data, and captured data [10]. By the term "found data," the authors mean material inadvertently left behind by subjects and groups during their lifetime. The found data (compressed grass, discarded objects, etc.) are defined precisely as traces." The "extracted data" is characterized by Webb and colleagues as materials intentionally created by individuals and groups to achieve their goals. They can be public (laws, ordinances, newspaper articles, billboards, songs) or private (family photos, letters to friends, personal notes). They distinguish between "current records", which are archival materials covering long periods (data collected for administrative purposes, sales data, media materials that appear in regular form) and "episodic records", which are intermittent (a sentence, several letters, several novels). The extracted data, as conceived by the authors, show how events and meanings are socially constructed.

"Captured data" is defined as behavior and non-verbal cues such as movements, postures, gestures, and even conversations "on the spot" recorded using surveillance techniques. These are not permanent, but ephemeral data that arise during social interactions and disappear instantly, so they need to be recorded by researchers. Examples are: analyzing nonverbal behavior to understand the social dynamics of a group, listening to conversations in the market between sellers and buyers.

In another subject area, among the researchers who adhere to a similar research logic, one can mention the historian K. Ginzburg, who developed the paradigm of detecting chains of evidence traces or any indications for scientific reflection and interpretation[11]. It is worth noting that in historical science, theorization in this subject area has a longer history: back in the 19th century, researchers proposed a distinction between: unintentional traces and intentionally created evidence, transmitted consciously: literally, traditions [12]. Of course, due to the peculiarity of the research object, the trace becomes a methodological "bridge", something that cannot be extracted directly, but needs to be reconstructed.

At the epistemological level, combining all the above examples, one can see the common thing that is often overlooked by proponents of purely computer methods, automation and datafication, that the trace is not something self—explanatory, its meaning, useful to the researcher, is created in context and learned through interpretation. It is the active activity of the researcher that will make it possible to bring together the artifacts of social behavior itself and knowledge about it.

That is, digital footprints should be understood not as "meaning, but as "data potentially containing it": "like" does not always equal "support", and clicking on a link does not always mean interest. From a philosophical point of view, a trace is a sign that is directly related to an event or object, but at the same time it is a substitution: it speaks about something missing, which the researcher does not see directly. G. Rava, using concepts. J. Derrida and W. Eco, studied the phenomenon of Internet trolling and postulated the difference between intentional and unintentional signs, the former of which are signs in the strict sense, and the latter are something undecodable at the moment, but potentially they contain a certain amount of meaning, ready to reveal itself at any moment. According to the researcher, digital traces of trolls seem meaningless, they are a collection of provocative comments or teasing, but as a result, the analysis showed that the goal is to "incite a sense of insecurity and suspicion within the virtual community [13, p. 269]. In this sense, the study of digital silence is extremely interesting.: What does silence mean in public digital spaces? How does the disappearance of a post, "read but not replied", or changing an avatar without a message become socially significant actions that affect communication? Automated quantitative content analysis will miss the absence of a post, losing an important semantic block. From an epistemological point of view, it is important to note that in this case, abandoning the classical empirical paradigm is impossible: the theoretical framework constructs models of explanation, and without them digital data does not become scientific knowledge — they need context and interpretation. Finding a correlation is not an explanation of causation, and the data itself may well be subject to distortion. The above-mentioned logic is Anderson's "Data first, then conceptual models" is not a liberation from theory, it is a rejection of understanding.

Of course, such an interpretive framework seems to be extremely effective "in studying closed or poorly studied communities" in the virtual space [4]. Ethnographic research approaches already exist that combine qualitative methods and quantitative analysis of digital footprints: network analysis, thematic modeling, and ethnographic observation to study professional online communities have made it possible to perceive discussions less pragmatically and capture the emotional nuances of communications [14].

This approach fully fits into the dual strategy proposed by B. Latour and his colleagues: in order to obtain an inseparable fabric of social life, one must use qualitative and quantitative methods: observing and understanding the logic of the local using mathematical categories and mapping them and identifying patterns using computer methods: that is, not "reading by numbers", and use the numbers as a grid to focus the reading. This will make it possible to observe not ready-made structures, but the formation of meaning and interaction in real time [15].

But does this mean that when it comes to the scale of both analysis and conclusions in research, following a contextual methodology is impossible? No, because the reconstruction of meaning is necessary, otherwise digital footprints lose their heuristic potential. T. Shelton and colleagues analyzed data from social networks about the movements of people in the city. The sample size of 500,000 posts showed that purely digital data unevenly represents social groups: richer areas create more footprints. And this raises another important research question: what does "digital silence" mean? residents of certain areas, in this sense, digital footprints (or lack thereof) it already has a potential meaning [16].

The next aspect that needs to be considered in automated uploads and analysis is: This is a critical approach to the source of digital footprints (the platform, its existing algorithms and technical protocols of interaction). Moreover, as V. Chun convincingly showed in a study of user practices, in various social media, "like" and "repost" initially have different meanings: from a random action, consent or assessment, to a gesture of presence. And then it becomes ritualized actions that form and maintain identity within one platform: for example, a person as a user of one of the platforms should always evaluate the profile of his virtual interlocutor before starting communication [17]. Thus, it seems necessary to differentiate between different types of footprints, at least intentional and unintentional footprints, and to build further interpretation based on this approach. In this sense, a more general classification of footprints proposed by a number of researchers is interesting: according to the principle of their natural origin or design for research purposes (A/B testing and controlled experiments on online platforms), or according to the principle of non-premeditation and intention. In the first case, logs, clicks, search queries, and cookies act as remnants of other activities, and images, texts, and videos uploaded by the user act as communicative acts [18] [19]

In this context, it is important to take into account algorithmic issues. This study of social media shows that technical structures, including the so-called news feeds, create information bubbles, excluding entire categories of content from user accessibility, therefore, the lack of communicative traces is not related to the activity of the sample under study, and the analyzed behavioral data has already been pre-selected and transformed by the digital environment. That is, it is necessary to compare the visible data and the mechanisms of their formation [20]. In the same sense, additional work with the uploaded data is also needed, taking into account the problems of automated accounts (all kinds of bots), which significantly distorts the structure of communication between real users.

And finally, use mixed methods, as at the level of the real-virtual dichotomy. For example, the study of Z. Tufekci, in which quantitative digital footprints became the basis for hypotheses, and field observations tested how much online activity correlated with the actual involvement of participants in social movements [21]. Or, as an alternative, automatic analysis is manual verification, when, for example, thematic modeling (highlighting topics in large text arrays) is only a stage of manual interpretation to clarify the meanings in topics of interest to the researcher. Or a strategy where automatic network analysis (for example, through the NetworkX Python libraries) will help you see patterns of interactions, the degree of engagement, and identify opinion leaders. However, their specific communication practices and contexts need to be additionally verified through observation - manual sampling and analysis of microcases.

Conclusion

Thus, digital footprints and their analysis in modern sociology are not the complete opposite of classical reactive data, but rather their continuation in a different environment — technologically mediated, algorithmically controlled and fragmented. The main thing in this sense remains the rejection of the idea of such data as "ready—made knowledge" and the transition to a research position in which analysis is always an act of interpretation, contextualization and recognition. A trace is not a fact, but a potential sign that becomes significant only in a certain situation, through certain optics.

It is possible that some degree of institutional inertia in modern sociology is associated with a lack of awareness of emerging practices in working with this type of data. Although a historical and methodological review has shown that working with footprints - from a physical remnant to a digital artifact — has always been a part of humanitarian and sociological knowledge, the main thing is the thesis that can be traced everywhere: a researcher is not just a data collector, but a reader of footprints..

Uploading tens of thousands of comments from social media can be useful, but without understanding what exactly makes these comments footprints is not sociology, but, for example, the practice of teaching textual neural network models to model human communication, and rather belongs to the field of computer science. And digital sociology, in this sense, seems to be not just a new direction, but also a point of return to the key issues of the entire discipline.: What are we observing? How do we understand this? what makes an observable meaningful?

References
1. World Economic Forum. (2019). How much data is generated each day? Retrieved April 11, 2025, from https://www.weforum.org/stories/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/
2. Anderson, C. (2008, June 23). The end of theory: The data deluge makes the scientific method obsolete. Wired.
3. Guba, K. (2018). Big data in sociology: New data, new sociology? Sociological Review, 1, 213-236. https://doi.org/10.17323/1728-192X-2018-1-213-236
4. Nikolaenko, G. A., & Fedorova, A. A. (2017). Non-reactive strategy: The applicability of unobtrusive methods of sociological information collection in the conditions of Web 2.0 using digital ethnography and big data as an example. Sociology of Power, 4, 39-53. https://doi.org/10.22394/2074-0492-2017-4-36-54
5. Bogdanov, M. B., & Smirnov, I. B. (2021). Opportunities and limitations of digital traces and machine learning methods in sociology. Public Opinion Monitoring: Economic and Social Changes, 1, 27-48. https://doi.org/10.14515/monitoring.2021.1.1760
6. Saponova, A. V., & Kulikov, S. P. (2021). Integration of survey data and digital traces: Overview of main methodological approaches. Sociology: 4M, 53, 117-147. https://doi.org/10.19181/4m.2021.53.4
7. Savage, M., & Burrows, R. (2007). The coming crisis of empirical sociology. Sociology, 41(5), 885-899. https://doi.org/10.1177/0038038507080443
8. Rokkan, S. (1966). Comparing nations: The use of quantitative data in cross-national research. In R. L. Merritt & S. Rokkan (Eds.), Comparative cross-national research: The context of current efforts (pp. 3-25). Yale University Press.
9. Lazarsfeld, P. F. (1953). A conceptual introduction to latent structure analysis. In P. F. Lazarsfeld (Ed.), Mathematical thinking in the social sciences (pp. 349-387). The Free Press.
10. Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (1966). Unobtrusive measures: Nonreactive research in the social sciences. Rand McNally.
11. Ginzburg, C. (1979). Clues: Roots of a scientific paradigm. Theory and Society, 7(3), 273-288. https://doi.org/10.1007/bf00207323
12. Bernheim, E. (1908). Lehrbuch der historischen Methode und der Geschichtsphilosophie: mit Nachweis der wichtigsten Quellen und Hilfsmittel zum Studium der Geschichte (6th ed.). Duncker & Humblot.
13. Rava, G. (2023). Traces and their (in)significance. In C. Ciborra & A. Caliandro (Eds.), What people leave behind: Digital footprints as socio-material resource in the age of big data (pp. 329-343). Palgrave Macmillan.
14. Barkhatova, L. A. (2020). Structural features of communication among Russian sociologists: A case study of an online community. Public Opinion Monitoring: Economic and Social Changes, 5, 204-221. https://doi.org/10.14515/monitoring.2020.5.1656
15. Venturini, T., & Latour, B. (2010). The social fabric: Digital traces and quali-quantitative methods. In Proceedings of Future en Seine. Cap Digital.
16. Shelton, T., Poorthuis, A., & Zook, M. (2015). Social media and the city: Rethinking urban socio-spatial inequality using user-generated geographic information. Landscape and Urban Planning, 142, 198-211.
17. Chun, W. H. K. (2016). Updating to remain the same: Habitual new media. MIT Press.
18. Golder, S. A., & Macy, M. W. (2014). Digital footprints: Opportunities and challenges for online social research. Annual Review of Sociology, 40(1), 129-152. https://doi.org/10.1146/annurev-soc-071913-043145
19. Arosio, L. (2022). What people leave behind online: Digital traces and web-mediated documents for social research. In F. Comunello, F. Martire, & L. Sabetta (Eds.), What people leave behind: Digital traces in context (pp. 311-324). Springer. https://doi.org/10.1007/978-3-031-11756-5_20
20. Gillespie, T. (2014). The relevance of algorithms. In T. Gillespie, P. J. Boczkowski, & K. A. Foot (Eds.), Media technologies: Essays on communication, materiality, and society (pp. 167-194). MIT Press.
21. Tufekci, Z. (2014). Big questions for social media big data: Representativeness, validity and other methodological pitfalls. In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM-14) (pp. 505-514). AAAI Press.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The peer-reviewed article "Footprint as a Sign: Epistemology and Methodology of Digital Data in Sociology" deals with the epistemological and methodological difficulties associated with the integration of digital footprints into sociological research. The author considers digital footprints as an object of interpretation, emphasizing the need to create a theoretical framework for their comprehension within the framework of modern sociology. In this paper, neither the methodology nor the research methods are stated. It can be considered that the author uses the analysis of scientific literature as a research method, including classical and modern works on sociology, philosophy of knowledge, methodology and digital research. It is these works that form the theoretical and methodological foundation for the study of non-reactive data and the interpretation of traces, reflect discussions about the role of big data and digital methods in sociology, and highlight modern challenges, in particular data ethics, algorithmic bias, and the interpretation of "silence" in the digital environment. The topic of the article is extremely relevant in the context of the digitalization of society and the growing volume of data generated by users. Modern social life is increasingly moving online: social networks, the platform economy, and digital public services. This leads to an explosive growth in the amount of data that can be used to study human behavior. However, their analysis requires new methodological approaches, which the author discusses. The scientific novelty of the work lies in the comprehensive analysis of digital footprints through the prism of a historical and methodological perspective, as well as in the proposal of an interpretive approach that combines qualitative and quantitative methods. The author critically rethinks K.'s thesis. Anderson's "the end of theory," emphasizing the importance of theoretical reflection even in the age of big data. The article defines the concept of digital footprints as a set of complex artifacts of digital activity that need to be interpreted in relation to a specific context and depend on a variety of technological and social factors. This definition highlights the dual nature of digital footprints — objective, due to the nature of the data itself, and subjective, due to the need for further understanding and decoding. The paper consistently examines the problems of trace classification, validity, representativeness, and algorithmic distortions. In general, it seems possible to state that this article has undoubted scientific significance and significantly enriches the methodological apparatus of sociological analysis of digital empiricism. The article has a clear structure: introduction, discussion and conclusion. The strong point of the study is the examples of studies demonstrating the effectiveness of mixed methods. The article is written in an academic style and meets the requirements of a scientific publication. The list of works includes 21 references to the works of Russian and foreign researchers. The bibliography combines works on sociology, history, methodology of science, and digital research. The sources are structured and given with complete bibliographic descriptions. In general, it is worth recognizing that the appeal to the main opponents is present in due measure. The article will be of interest to sociologists interested in research methodology and data analysis. Thus, the article "Trace as a sign: epistemology and methodology of digital data in sociology" has scientific and theoretical significance. The work can be published.