Skip to main content

Data Due Diligence and COVID-19

The COVID-19 pandemic, like all crises, has been revelatory. It has uncovered capacity limits in public health systems around the world, while also highlighting how quickly people can mobilize to help each other. The pandemic has also ushered in a golden age of data dashboards and rapid-fire analysis. We want to know how many people are infected, how many people have died from the virus, what groups are being infected, where they’re located, and how the virus spreads. You can track the number of cases on your phone and get a sense of a country’s vulnerability broadly based on other risk factors.

Moving beyond initial epidemiological metrics, more data do not mean we’re more informed. It makes sense that countries with many informal sector workers, few doctors, or serious budgetary constraints may be less able to tackle the disease and, therefore, may be at higher risk. Yet, it is also true that the quality of those metrics—employment sectors, health coverage, and public finance—varies. And sometimes, it varies a lot.

A key role for those of us who work with social science data on a daily basis (but aren’t public health officials or epidemiologists) is to ensure we’re adding insight and nuance into the ways novel coronavirus is likely to affect people, institutions, and the environment. The relevance of the data to a given research question or decision is just as critical now as it was before the pandemic.

As the international development community ramps up its response to the global pandemic, it’s worth revisiting some best practices when it comes to being a savvy data consumer.

  • Maps look neat, but they aren’t always the best way to convey information. Simply looking at a map of where people live doesn’t really tell us anything more than what we already know. Yes, population density is relevant to COVID-19; that’s why we need social distancing. However, you don’t need an interactive map of Nairobi or Manila to tell you that a lot of people live there. It may be better to know whether the data can be disaggregated by age, ethnicity, or income. That kind of information can really inform decision making.
  • When can be just as important as what since a lot of demographic metrics are infrequently collected. Specifying a good model or making a useful graph requires relevant variables. In some countries, we only get detailed demographic data—the kind of information that is critical for understanding COVID-19 risk and future impacts—every few years or just once a decade. If you’re given a data product containing variables like the poverty rate or asset ownership, it should include the collection periods. When it comes to data from different sources collected at different points in time, you want to know whether you’re working with grapes or raisins.
  • It’s easy to combine (and confuse) macro - and micro-level data. Are you looking at national, county, village, or individual measures? Macro-level measures such as religious tolerance rankings or GDP might be relevant within the context of how much trust people have in certain authorities or the overall capacity of a country to support health systems. Simply overlaying such metrics with individual-level data ignores how much variation there is within countries. It’s critical to deal with this issue for inferential analysis, but it is just as important if you’re reporting out descriptive statistics.

As with any data-driven analysis, the question you’re asking shapes how you treat the data. This isn’t an exhaustive list of best practices—I didn’t get into sampling, various researcher biases, or other important issues. The key point is that there’s a role for social scientists as the pandemic continues to affect non-health outcomes. Given the stakes, we need to make sure we’re all up to the task.