Space Apps COVID-19 Challenge

Human Factors



1. Introduction

The Challenge

The emergence and spread of infectious diseases, like COVID-19, are on the rise. Can you identify patterns between population density and COVID-19 cases and identify factors that could help predict hotspots of disease spread?

Explanation

The emergence and spread of infectious diseases, like COVID-19, may well continue. Many factors, both environmental and anthropogenic, can contribute to this trend. This challenge explores human activities that may be directly or indirectly related to the spread of COVID-19 locally and around the globe.

Numerous factors can contribute to the spread of infectious diseases, including but not limited to: trade and travel, social activities that increase one’s risk of exposure, and the lack of proper hygiene infrastructure. Do geographic or temporal patterns from COVID-19 disease mapping reveal insights into human factors that may be related to the spread of the disease? Could human activities that impact the environment play an indirect role in furthering COVID-19 spread? Are certain activities correlated with specific disease presentations or increased severity?

Your challenge is to identify patterns between human activity and COVID-19 cases and identify factors that could help predict hotspots of disease spread.

Considerations

  • Consider measuring density during COVID-19 by integrating space-based assets (such as satellite communications and Earth observations) with Earth-based infrastructure (such as buildings) to identify potential COVID-19 hotspots
  • You may consider clustering in urban cities versus rural areas
  • Consider derived social determinants of health (SDOH), population activity densities due to weather and weather events, and the related SDOH effects on the transmission and predictions of COVID-19.
  • Are specific activities related to increased number of susceptible or exposed populations? What factors increase a population’s vulnerability to disease?

2. Requirements

This cell is intended for the import of the packages and libraries necessary for the development of the problem.

It is necessary to have all the libraries installed, most of them can be installed using the `pip` command in the terminal.
In [ ]:
##!pip install descartes
In [1]:
# Data processing libraries
import pandas as pd
import numpy as np
import math

# Libraries for the graphing of data
import matplotlib.pyplot as plt
import seaborn as sns;
sns.set()

import plotly.express as px
import plotly.graph_objects as go

# Allows graphics to be generated at higher resolution
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# Allows you to ignore the warnings in the notebook
import warnings
warnings.simplefilter('ignore')

# Set a wider notebook width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

# We import the library that allows us to work with dates
from datetime import datetime, timedelta

# Library to work with geo data
import geopandas as gpd
from geopandas.tools import geocode

We load the countries dataset into a geojson file and convert it into a dataframe of pandas. This operation has the purpose of allowing us to work in a faster and more efficient way with the data set, and thus, be able to make a good exploratory analysis of it.

In [2]:
world_map = gpd.read_file('./data/countries.geojson')
world_map.rename(columns={"ADMIN": "country", "ISO_A3": "country_code"}, inplace=True)
# Delete Antarctica
world_map = world_map[world_map['country']!='Antarctica']
# Correct names
world_map['country'] = world_map['country'].apply(lambda c: c.replace(" ", "_"))

We have collected different information to analyze and understand the coronavirus data of https://coronavirus.jhu.edu/map.html, https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data, https://geohealth.hhs.gov/

In [3]:
df = pd.read_csv('./data/covid19-cases-worldwide.csv')
df['dateRep'] = df['dateRep'].apply(lambda d: datetime.strptime(d, "%d/%m/%Y").date())

Delete null or NaN values in the data

In [4]:
df.dropna(axis=0, inplace=True)

We renamed some of the variables and eliminated those that may be less interesting a priori

In [5]:
df.rename(columns={"dateRep": "date", "countriesAndTerritories": "countries", "popData2018": "population", "continentExp": "continent"}, inplace=True)
It is necessary to execute all the cells to be able to visualize the graphics since most of them are interactive.

3. Human Factors

Here you can see some of the records that make up our data. We can see that we have the date of registration, the cases, the deaths, in which country these data have been obtained and then the continent to which this country belongs. We also have geographical information such as the country code to reference it later with a geographical map and the level of population that presents the country of the corresponding record, but this last value refers to the year 2018 so it is not fully updated but serves to give us an idea and to take it into consideration.

In [6]:
df.head()
Out[6]:
date day month year cases deaths countries geoId countryterritoryCode population continent
0 2020-05-30 30 5 2020 623 11 Afghanistan AF AFG 37172386.0 Asia
1 2020-05-29 29 5 2020 580 8 Afghanistan AF AFG 37172386.0 Asia
2 2020-05-28 28 5 2020 625 7 Afghanistan AF AFG 37172386.0 Asia
3 2020-05-27 27 5 2020 658 1 Afghanistan AF AFG 37172386.0 Asia
4 2020-05-26 26 5 2020 591 1 Afghanistan AF AFG 37172386.0 Asia

In the next cell we can see the distribution of the numerical variables, so we can see at a glance if there are values that escape from normality or all data is correct. For example, you can see represented both in the cases variable and in the deaths variable that we have minimum values below 0 due to the corrections that have been made in some countries, so this must be corrected, setting them to 0 when they are going to be used in some of the graphs these data.

In [7]:
df.describe()
Out[7]:
day month year cases deaths population
count 19866.000000 19866.000000 19866.000000 19866.000000 19866.000000 1.986600e+04
mean 16.415685 3.639787 2019.996678 296.942616 18.367210 5.018976e+07
std 8.743235 1.375458 0.057545 1771.117081 125.668596 1.727964e+08
min 1.000000 1.000000 2019.000000 -2461.000000 -1918.000000 1.000000e+03
25% 9.000000 3.000000 2020.000000 0.000000 0.000000 2.119275e+06
50% 17.000000 4.000000 2020.000000 3.000000 0.000000 9.630959e+06
75% 24.000000 5.000000 2020.000000 50.000000 1.000000 3.369995e+07
max 31.000000 12.000000 2020.000000 48529.000000 4928.000000 1.392730e+09

We will now conduct an analysis of the cases that have been confirmed on each of the continents.

In [8]:
ax = sns.relplot(x="date", y="cases", col="continent", col_wrap=3, kind="line", data=df)
ax.set(xlabel='Date', ylabel='Cases');

In these graphs we can observe that the continents that present more cases throughout the time are those that present greater density of population as they are Asia, Europe and America, whereas the remaining continents when presenting a density of population inferior to these, do not present a high number of cases but these values are not null, reason why also the virus has had incidence.

In [9]:
ax = sns.relplot(x="date", y="deaths", col="continent", col_wrap=3, kind="line", data=df)
ax.set(xlabel='Date', ylabel='Deaths');

We can see the same thing happening with those infected by the virus. The greater the number of cases presented, the greater the number of deaths on this continent. As we can see, the continents with the highest number of infections have presented the highest number of deaths.

3.1 Population Density

The idea is to try to analyze the population density that each continent presents, as well as the cases and deaths, with the objective of being able to demonstrate that the relation that we have previously commented exists.

In [10]:
dic_pop = {}
dic_cases = {}
dic_deaths = {}

for c in df.continent.unique():
    dic_pop[c] = (df[df.continent == c].groupby('countries')['population'].max().sum())
    dic_cases[c] = (df[df.continent == c].groupby('countries')['cases'].sum().sum())
    dic_deaths[c] = (df[df.continent == c].groupby('countries')['deaths'].sum().sum())
    
df_population = pd.DataFrame.from_dict(dic_pop, orient='index', columns=['population']).reset_index()
df_population.rename(columns={"index": "Continent", "population": 'Population'}, inplace=True)

df_cases = pd.DataFrame.from_dict(dic_cases, orient='index', columns=['cases']).reset_index()
df_cases.rename(columns={"index": "Continent", "cases": 'Cases'}, inplace=True)

df_death = pd.DataFrame.from_dict(dic_deaths, orient='index', columns=['deaths']).reset_index()
df_death.rename(columns={"index": "Continent", "deaths": 'Deaths'}, inplace=True)

df_continent = df_population.merge(df_cases, on="Continent")
df_continent = df_continent.merge(df_death, on="Continent")

Once we have all the information grouped together we will try to obtain the correlation between population, infected and dead for each of the continents.

In [11]:
df_continent
Out[11]:
Continent Population Cases Deaths
0 Asia 4.468916e+09 1068544 29542
1 Europe 7.644023e+08 1917491 172421
2 Africa 1.268520e+09 135064 3922
3 America 1.005667e+09 2769337 158866
4 Oceania 4.015206e+07 8626 132
In [12]:
plt.figure(figsize=(12,6))
sns.heatmap(df_continent.corr(), square=True, annot = True)
plt.show()

With the correlation matrix obtained between the population variables, infections and deaths, we get the information that between cases and deaths there is a direct correlation. This is due to the fact that the more cases of infections that occur, the greater the probability that there will be cases of deaths. Whereas if there are hardly any cases of contagion or these are minimal, it is very difficult for cases related to deaths to occur. With respect to the variable population we cannot ensure the same, since it makes sense to say that the greater the population density the greater the probability of infection or death, but this does not really have to be the case. Since there may be a continent that has a high population density but nevertheless they are all in rural areas. As a result, the population is spread throughout the entire geography of the continent, thus favoring the non spread of the virus and reducing infections and deaths.

We will now analyze each of the populations of the different continents and the cases of infection and death, to see if we can obtain more information about them.

In [13]:
import plotly.express as px
fig = px.pie(df_continent, values='Population', names='Continent', title='Mundial population registry in 2018: '+ str(df_continent.Population.sum()) + '')
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()