World Health Organization provides a wide range of data available for download in different formats.
The data is accessible through their website: http://www.who.int/gho/en/
In this example we will be working with Pandas DataFrame to organize the data. As an example I am going to work on suicide rates throughout the world.

In [1]:

import pandas as pd

By using ‘read_csv’ function, suicide crude rates (per 100,000 people) data is assigned to pandas object.

In [2]:

suicideData = pd.read_csv('SuicBoth.csv')
suicideData.head()

Out[2]:

	Country	Sex	2015	2010	2005	2000
0	Afghanistan	Both sexes	5.5	5.2	5.4	4.8
1	Albania	Both sexes	4.3	5.3	6.3	6.0
2	Algeria	Both sexes	3.1	3.4	3.6	3.0
3	Angola	Both sexes	20.5	20.7	20.0	18.4
4	Antigua and Barbuda	Both sexes	0.0	0.2	1.6	2.3

Total population (in thousands).

In [3]:

populData = pd.read_csv('Population.csv')
populData.head()

Out[3]:

	Country	2015	2013
0	Afghanistan	32526.6	30682.5
1	Albania	2896.7	2883.3
2	Algeria	39666.5	38186.1
3	Andorra	70.5	75.9
4	Angola	25022.0	23448.2

Now that we have two pandas objects, we can start merging the data together by the countries. We are using ‘inner’ method, the intersection of keys from both frames, while preserving the order of the left keys.

In [4]:

mergedData = pd.merge(suicideData, populData, on=['Country'], how='inner')
mergedData.head()

Out[4]:

	Country	Sex	2015_x	2010	2005	2000	2015_y	2013
0	Afghanistan	Both sexes	5.5	5.2	5.4	4.8	32526.6	30682.5
1	Albania	Both sexes	4.3	5.3	6.3	6.0	2896.7	2883.3
2	Algeria	Both sexes	3.1	3.4	3.6	3.0	39666.5	38186.1
3	Angola	Both sexes	20.5	20.7	20.0	18.4	25022.0	23448.2
4	Antigua and Barbuda	Both sexes	0.0	0.2	1.6	2.3	91.8	90.0

I separately downloaded the population data from 2010.

In [5]:

populData2010 = pd.read_csv('Population2010.csv')
populData2010.head()

Out[5]:

	Country	Year	Population (in thousands) total
0	Afghanistan	2010	27962.2
1	Albania	2010	2901.9
2	Algeria	2010	36036.2
3	Andorra	2010	84.4
4	Angola	2010	21220.0

In [6]:

mergedData = pd.merge(mergedData, populData2010, on=['Country'], how='inner')
mergedData.head()

Out[6]:

	Country	Sex	2015_x	2010	2005	2000	2015_y	2013	Year	Population (in thousands) total
0	Afghanistan	Both sexes	5.5	5.2	5.4	4.8	32526.6	30682.5	2010	27962.2
1	Albania	Both sexes	4.3	5.3	6.3	6.0	2896.7	2883.3	2010	2901.9
2	Algeria	Both sexes	3.1	3.4	3.6	3.0	39666.5	38186.1	2010	36036.2
3	Angola	Both sexes	20.5	20.7	20.0	18.4	25022.0	23448.2	2010	21220.0
4	Antigua and Barbuda	Both sexes	0.0	0.2	1.6	2.3	91.8	90.0	2010	87.2

Overall death rate (per thousand people).

In [7]:

deathData = pd.read_csv('DeathRates.csv')
deathData.head()

Out[7]:

	Country	2013
0	Afghanistan	7.7
1	Albania	9.4
2	Algeria	5.7
3	Andorra	8.4
4	Angola	13.9

In [8]:

mergedData = pd.merge(mergedData, deathData, on=['Country'], how='inner')
mergedData.head()

Out[8]:

	Country	Sex	2015_x	2010	2005	2000	2015_y	2013_x	Year	Population (in thousands) total	2013_y
0	Afghanistan	Both sexes	5.5	5.2	5.4	4.8	32526.6	30682.5	2010	27962.2	7.7
1	Albania	Both sexes	4.3	5.3	6.3	6.0	2896.7	2883.3	2010	2901.9	9.4
2	Algeria	Both sexes	3.1	3.4	3.6	3.0	39666.5	38186.1	2010	36036.2	5.7
3	Angola	Both sexes	20.5	20.7	20.0	18.4	25022.0	23448.2	2010	21220.0	13.9
4	Antigua and Barbuda	Both sexes	0.0	0.2	1.6	2.3	91.8	90.0	2010	87.2	6.8

Now that we have all the tables merged, we can clear the columns that we are not going to use, and rename the other column headers.

In [9]:

del mergedData['Sex'], mergedData['2005'], mergedData['2000'], mergedData['Year']
mergedData = mergedData.rename(columns={'2015_x':'2015_s','2010':'2010_s','2005':'2005_s',
                                        '2000':'2000_s','2015_y':'2015_p','2013_x':'2013_p',
                                        'Population (in thousands) total':'2010_p','2013_y':'2013_d'})
mergedData.head()

Out[9]:

	Country	2015_s	2010_s	2015_p	2013_p	2010_p	2013_d
0	Afghanistan	5.5	5.2	32526.6	30682.5	27962.2	7.7
1	Albania	4.3	5.3	2896.7	2883.3	2901.9	9.4
2	Algeria	3.1	3.4	39666.5	38186.1	36036.2	5.7
3	Angola	20.5	20.7	25022.0	23448.2	21220.0	13.9
4	Antigua and Barbuda	0.0	0.2	91.8	90.0	87.2	6.8

We will need to do some calculations prior working on the plots. The death rates data is for 2013, but the suicide rates data for that year is not available. ‘suiAve’ is an average in between 2010 and 2015 data. ‘suiPerDeath’ is the percentage of the suicides from the total death count. ‘deaPerPop’ is the percentage of death to the total population.

In [10]:

mergedData['suiAve'] = (mergedData['2010_s'] + mergedData['2015_s']) / 2
mergedData['suiPerDeath'] = (mergedData['suiAve'] * 100) / (mergedData['2013_d'] * 100)
mergedData['deaPerPop'] = mergedData['2013_d'] / 10
mergedData.head()

Out[10]:

	Country	2015_s	2010_s	2015_p	2013_p	2010_p	2013_d	suiAve	suiPerDeath	deaPerPop
0	Afghanistan	5.5	5.2	32526.6	30682.5	27962.2	7.7	5.35	0.694805	0.77
1	Albania	4.3	5.3	2896.7	2883.3	2901.9	9.4	4.80	0.510638	0.94
2	Algeria	3.1	3.4	39666.5	38186.1	36036.2	5.7	3.25	0.570175	0.57
3	Angola	20.5	20.7	25022.0	23448.2	21220.0	13.9	20.60	1.482014	1.39
4	Antigua and Barbuda	0.0	0.2	91.8	90.0	87.2	6.8	0.10	0.014706	0.68

Some countries in our pandas object are having long names, and wont be displayed properly on the plots. We can use ‘for’ loop and manually search for countries longer than 25 chars.

In [11]:

for i in mergedData.index:
    if len(mergedData['Country'][i]) > 25:
        print([i], mergedData['Country'][i])

[19] Bolivia (Plurinational State of)
[44] Democratic People's Republic of Korea
[45] Democratic Republic of the Congo
[76] Iran (Islamic Republic of)
[89] Lao People's Democratic Republic
[106] Micronesia (Federated States of)
[136] Saint Vincent and the Grenadines
[161] The former Yugoslav republic of Macedonia
[172] United Kingdom of Great Britain and Northern Ireland
[173] United Republic of Tanzania
[178] Venezuela (Bolivarian Republic of)

Or we can do it through pandas DataFrame.loc function.

In [12]:

mergedData['Country'].loc[((mergedData['Country'].str.len()) > 25)]

Out[12]:

19                      Bolivia (Plurinational State of)
44                 Democratic People's Republic of Korea
45                      Democratic Republic of the Congo
76                            Iran (Islamic Republic of)
89                      Lao People's Democratic Republic
106                     Micronesia (Federated States of)
136                     Saint Vincent and the Grenadines
161            The former Yugoslav republic of Macedonia
172    United Kingdom of Great Britain and Northern I...
173                          United Republic of Tanzania
178                   Venezuela (Bolivarian Republic of)
Name: Country, dtype: object

Once the data is sorted, we can manually replace the needed cells.

In [13]:

mergedData.loc[19, 'Country'] = 'Bolivia'
mergedData.loc[44, 'Country'] = 'North Korea'
mergedData.loc[45, 'Country'] = 'DR Congo'
mergedData.loc[76, 'Country'] = 'Iran'
mergedData.loc[89, 'Country'] = 'Laos'
mergedData.loc[106, 'Country'] = 'Micronesia'
mergedData.loc[136, 'Country'] = 'St. Vincent and Gren.'
mergedData.loc[161, 'Country'] = 'FYR Macedonia'
mergedData.loc[172, 'Country'] = 'UK'
mergedData.loc[173, 'Country'] = 'Tanzania'
mergedData.loc[178, 'Country'] = 'Venezuela'

The data then can be exported to the format of interest (in this case as Excel file), for further processing, and plotting.

In [14]:

from pandas import ExcelWriter

writer = ExcelWriter('mergedData.xlsx')
mergedData.to_excel(writer,'mergedData')
writer.save()

Downloadable content:
Population.csv
Population2010.csv
DeathRates.csv
SuicBoth.csv
mergedData.xlsx

codeWithMax

codeWithMax

Data import with Python, using pandas DataFrame – Part 1

Importing CSV data in Python

Data visualization with Python and Matplotlib – Part 2

Leave a Reply Cancel reply

codeWithMax

Data import with Python, using pandas DataFrame – Part 1

Importing CSV data in Python

Data visualization with Python and Matplotlib – Part 2

Leave a Reply Cancel reply

Related posts

Plotting Error Bars in Python using Matplotlib and Numpy Random

Basic Image Recognition with Built-in Models in Keras

Basic Example of a Neural Network with TensorFlow and Keras

Pandas – Tips and Tricks – df.loc, df.iloc