Skip to content

COVID-19 Preprints: How the Topics Change (May Update)

From disease symptoms to asymptomatic transmission modeling based on mobile data

When society is in a global emergency, researchers are all the more inspired to publish their findings fast and open-access. An available option is to publish a preprint, a paper that has not yet received quality evaluation but quickly becomes available online. In this post we show how the preprints on the novel coronavirus SARS-CoV-2 and COVID-19 (the disease caused by it) split into topics, and how these topics have changed between January and May 2020. See page 3 for our methods and data description.

In this overview, we refer to systematic reviews and meta-analyses where possible. Still, we should stress that preprints report the research that has not been certified through peer review and thus should not be used to guide policies or practice.

In the April overview, we have identified three recurring areas of research on the novel coronavirus:

  • virology and molecular biology, discussing the virus itself;
  • clinical medicine, discussing the virus-related diseases and clinical characteristics;
  • epidemiology and public health, discussing the virus transmission and containment measures.

A month later, there is more data on preprints, and we can consider them in more detail. We use thematic modeling and analyze preprints’ titles and abstracts. The algorithm statistically estimates how closely the words in these texts are related, and automatically groups them into clusters. We then interpret these clusters as substantive topics.

We use the structural topic modeling algorithm. For each document, it shows which topics are specific to it, and for each topic – which words are the most relevant for it. Thus, we can evaluate how prominent a topic is in our texts. The algorithm also allows us to see how the distribution of topics is affected by the characteristics of the text: in our case, the platform where the preprint is published, and the date of publication.

Topics of Preprints

On the texts of abstracts and titles, we built a model that identified 18 topics in the pool of preprints. In Figure 1, these topics are sorted by how pronounced their presence is in the data. Each topic is accompanied by the five most relevant terms.

Figure 1. Prevalence of the topics distinguished from the descriptions of preprints on the novel coronavirus, published from January 15 to May 17, 2020
The topics are described with the five most relevant terms, reduced to stems. Click on the picture to see the full resolution (opens in a new tab)

The most popular topic can be interpreted as socio-economic contexts and the consequences of the pandemic and quarantine. The preprints where it is present discuss, for example, the preparedness of national health systems for the epidemic – as Craig, Kalanxhi, and Hauck (2020) do studying the equipment and the number of intensive care units in Africa.

Research is also emerging about how quarantine affects crime. Campedelli, Aziani, and Favarin (2020) and Ashby (2020) report a decrease in crime rates in U.S. cities. Logically, the number of robberies is reduced, but the number of thefts and burglaries is almost unchanged. The number on domestic violence cases does not change either, and in the UK it is even decreasing – however, Halford et al. (2020) explain this by the fact that the victims, being forced to constantly be near their abusers, simply cannot report crimes to the police.

The narrowest topic stands for the conditions affecting the virus transmission: using personal protective equipment, especially masks, as well as environmental factors – temperature and humidity. Authors of systematic reviews emphasize that the effectiveness of cloth masks directly depends on whether they are used appropriately and standardized to fit tightly to face (Mondal, Das, and Goswami 2020). Since clinical trials have not yet been conducted about the effectiveness of masks in the coronavirus epidemic context, Wei et al. (2020) review such studies in the context of flu-like diseases. The authors find that wearing a mask reduces the risk of developing the disease, especially when everyone is doing so, regardless of the presence of symptoms.

The model also distinguishes such a topic as contact tracing via mobile applications and with the data on mobility. During an epidemic, it is important to identify infected people as early as possible, and mobile applications can provide data on interactions between people much faster. Even if only 20% of the population uses the application, it still turns out to be more effective than traditional methods of contact tracing through interviews with patients (Kretzschmar et al. 2020).

Another topic can be interpreted as the influence of social media on behavior – through the dissemination of (mis)information. Milani (2020), using Facebook data, explores how adopting physical distancing practices depends on cross-border social networks. The author shows that risk perception and social behavior are influenced by stories from abroad that people read on social media, especially from Italy and the USA.

Thematic Contexts

The contexts of topics discussed in the preprints can be estimated from Figure 2. The size of the label in this network corresponds to the relative popularity of the respective topic in the corpus of preprints, as in Figure 1. The thickness of lines shows the strength of the association between the topics, based on their common occurrence in preprints.

Figure 2. Correlation network of topics distinguished from the descriptions of preprints on the novel coronavirus published from January 19 to May 17, 2020
Links indicate that the topics have appeared together in the same preprint abstract(s), with the width of the links corresponding to the strength of the association (only the links weighted more than 0.05 are shown). Label sizes correspond to the overall popularity of the topics. Click on the map to see the full resolution (opens in a new tab)

For instance, we can see that issues of mental health – anxiety, perceived risks – are sometimes also discussed in the context of (mis)information coming from Twitter and other social media.

The topic of modeling is noticeably related to the topic of non-pharmaceutical interventions (NPIs), in the context of modeling the effects of quarantine and social isolation. It is also related to the already mentioned topic of contact tracing via mobile devices since the mobile data is also used to build models.

As an example, modeling the NPIs in the case of Boston, Aleta et al. (2020) combine mobility data from smartphones, and demographic data. Researchers conclude that due to simultaneously introduced non-pharmaceutical measures, virus testing, and contact tracing, we can identify and quarantine 9% of asymptomatic infection spreaders. In turn, due to lower rates of virus transmission, it becomes possible to lift restrictions on economic activity, while avoiding he healthcare system overload.

A nearly separate group is formed by clinical medicine topics – disease severity, comorbidities and risk factors, and infection symptoms. Biological topics about the virus genome, molecular mechanisms of virus binding in the cell, and its inhibition also stand apart.

You can also interactively explore the content of the topics and the closeness between them. The visualization via the link shows which topics exclusively cover certain terms and which terms are the most relevant for each topic. For example, in Figure 3 we highlight the stem isol, and it appears to be the most characteristic of the topic 5 (NPIs), where it refers to social distancing measures, and for the topic 14 (viral genome), where the isolation of virus is discussed.

Figure 3. Map of topics proximity (multidimensional scaling, example)
The sizes of the circles correspond to the prominence of the topic in the data (in this case, how much the topic is represented in preprints containing the term isol-; highlighted is the topic that corresponds to this word the most).
The bar graph on the right shows 30 words that characterize the highlighted topic the best, with a given exclusivity parameter λ. The lower the value of the parameter, the rarer words are displayed on the right – terms that are unique to the highlighted topic. The higher the value of the parameter, the more frequent and also general words are displayed. The red bars show how often the word appears in the highlighted topic, and the blue bars show how often it appears in the whole corpus

By May, among the preprints about the new coronavirus, the topics related to clinical medicine and virology are still prominent. The block of topics on the pandemic spread is now split into two. Some researchers apply epidemiological models to study the virus transmission and the effects of non-pharmaceutical interventions. Other preprints are based on statistical data, analyzing the socio-economic and psychological contexts and consequences of the epidemic.

Read on page 2 how the relative prominence of topics changes over time, and what topics are special to the preprints published in May.

Pages: 1 2 3