DATS 6103 - Individual Project 2 - Sabina Azim:

Analysis of the World Happiness Report

Project Purpose:

The World Happiness Report has been published every year since 2012 on March 20th, or as the United Nations calls it - International World Happiness Day. The official World Happiness Report website describes the report as "...a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be." Each country is given a 'Happiness Score' which is essentially based off of six different factors: GDP per capita, Social Support (Family), Health (Life Expectancy), Freedom to make life choices, Generosity, and Perceptions of Government Corruption (Trust). Each country is compared to a "Dystopia" nation, which consists of the lowest possible values for each variable.

I will be analyzing the happiest and least happiest countries from 2016-2020, what factors weigh heaviest when it comes to calculating the Happiness Score, and if a country's Happiness Ranking has any correlation to their suicide mortality rate. I chose this topic because I actually found out very recently that the World Happiness Report existed and I thought this would be a great opportunity for me to look into the data in the report. To be honest, when I heard about the report I was pretty skeptical about how they can really measure 'happiness' which is why I chose to compare the scores to suicide rates in these countries - I figured this would give me a good basis to understand the validity of the scores.

Data Sources:

For this project I used 8 seperate data sets - 5 for the World Happiness Report data (2016-2020) which came from Kaggle, one for suicide mortality rates, one for GDP per capita, and one for life expectancy, all three of which came from the World Bank Database. I got the World Happiness Report data sets from Kaggle because I was able to find ones that were already pretty much clean and all I would have to do is narrow down a few columns and merge them. For the other datasets, the World Bank was an easy place to find all of them and I figured it is probably the most accurate data I can find on a global scale.

  1. World Happiness Reports - https://www.kaggle.com/yamaerenay/world-happiness-report-preprocessed
  2. Suicide Mortality Rates - https://data.worldbank.org/indicator/SH.STA.SUIC.P5
  3. GDP Per Capita - https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
  4. Life Expectancy at Birth, total (Years) - https://data.worldbank.org/indicator/SP.DYN.LE00.IN

Importing and Cleaning Data:

In [1]:
#importing the libraries I will be needing throughout this project 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#i used plotly offline so that whoever is viewing this will not need any sort of passwords/keys to view it 
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
import plotly.graph_objs as go
import plotly.express as px
import seaborn as sns
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')
In [2]:
#Reading in the different csv files using read_csv since they are csv files - skipping the first couple rows on 
#the world bank data sets to fix the formatting 
report_2016 = pd.read_csv('2016_report.csv')
report_2017 = pd.read_csv('2017_report.csv')
report_2018 = pd.read_csv('2018_report.csv')
report_2019 = pd.read_csv('2019_report.csv')
report_2020 = pd.read_csv('2020_report.csv')
suicide_info = pd.read_csv('suicide_data.csv', skiprows = 4)
gdp_per_cap = pd.read_csv('gdp_per_cap.csv', skiprows = 4)
life_expect = pd.read_csv('life_expectancy.csv', skiprows = 4)

#Removes scientific notation where needed 
pd.set_option('display.float_format', lambda x: '%.2f' % x)
In [3]:
#adding a year column to each happiness report dataframe
report_2016['Year'] = 2016
report_2017['Year'] = 2017
report_2018['Year'] = 2018
report_2019['Year'] = 2019
report_2020['Year'] = 2020
In [4]:
#changing columns name in 2016 and 2019 data frames so that all of the columns match up 
report_2016 = report_2016.rename(columns={'family':'social_support'})
report_2019 = report_2019.rename(columns={'family':'social_support'})
In [5]:
#renaming the columns to make sure that the columns across all 5 reports match up exactly and also to add some
#capitalization/spacing where needed 
def rename_columns(df):
    df = df.rename(columns={'country':'Country','happiness_score':'Happiness Score','gdp_per_capita': 'Economy (GDP Per Capita)',
                            'social_support':'Social Support','health': 'Health', 'freedom':'Freedom', 'government_trust': 'Trust', 
                            'generosity':'Generosity', 'dystopia_residual':'Dystopia Residual', 
                           'continent':'Continent'})
    return df
In [6]:
#Calling the function I created above on each of the dataframes 
report_2016 = rename_columns(report_2016)
report_2017 = rename_columns(report_2017)
report_2018 = rename_columns(report_2018)
report_2019 = rename_columns(report_2019)
report_2020 = rename_columns(report_2020)
In [7]:
#appending the 5 datasets together 
whr_final = report_2016.append([report_2017, report_2018, report_2019, report_2020])
In [8]:
#after appending the dataets together we get something that looks like this - all of the data together in one 
#dataframe from 2016-2020
whr_final
Out[8]:
Country Happiness Score Economy (GDP Per Capita) Social Support Health Freedom Trust Generosity Dystopia Residual Continent Year
0 Switzerland 7.59 1.40 1.35 0.94 0.67 0.42 0.30 2.52 Europe 2016
1 Iceland 7.56 1.30 1.40 0.95 0.63 0.14 0.44 2.70 Europe 2016
2 Denmark 7.53 1.33 1.36 0.87 0.65 0.48 0.34 2.49 Europe 2016
3 Norway 7.52 1.46 1.33 0.89 0.67 0.37 0.35 2.47 Europe 2016
4 Canada 7.43 1.33 1.32 0.91 0.63 0.33 0.46 2.45 North America 2016
... ... ... ... ... ... ... ... ... ... ... ...
133 Botswana 3.48 1.00 1.09 0.49 0.51 0.10 0.03 0.26 Africa 2020
134 Tanzania 3.48 0.46 0.87 0.44 0.51 0.20 0.27 0.72 Africa 2020
135 Rwanda 3.31 0.34 0.52 0.57 0.60 0.49 0.24 0.55 Africa 2020
136 Zimbabwe 3.30 0.43 1.05 0.38 0.38 0.08 0.15 0.84 Africa 2020
137 Afghanistan 2.57 0.30 0.36 0.27 0.00 0.00 0.14 1.51 Asia 2020

690 rows × 11 columns

The following data will be filtered out and looking at only 2016. My suicide data analysis will focus only on 2016 because it is the most recent information I could find on a global scale that gave me reliable data on suicide mortality rates.

In [9]:
#Creating a list to filter out the columns I want to have in my suicide dataframe. 
columns = ['Country Name','2016']
suicide_data = suicide_info[columns] 
suicide_data = suicide_data.rename(columns= {"2016":"Suicide Mortality Rate (per 100,000 population)", 
                                             "Country Name": "Country"})
suicide_data.head()
Out[9]:
Country Suicide Mortality Rate (per 100,000 population)
0 Aruba nan
1 Afghanistan 4.70
2 Angola 4.70
3 Albania 6.30
4 Andorra nan
In [10]:
#Using the same columns list from above to filter out the gdp dataframe as well. 
gdp_pc = gdp_per_cap[columns]
gdp_pc = gdp_pc.rename(columns= {"2016":"GDP Per Capita", "Country Name":"Country"})
gdp_pc.head()
Out[10]:
Country GDP Per Capita
0 Aruba 28281.35
1 Afghanistan 547.23
2 Angola 3506.07
3 Albania 4124.06
4 Andorra 37474.67
In [11]:
#Now lastly just filtering out the life expectancy dataframe as well, the same way I did suicide rates and gdp.
life_expect_df = life_expect[columns]
life_expect_df = life_expect_df.rename(columns= {"2016":"Life Expectancy in Years", "Country Name":"Country"})
life_expect_df.head()
Out[11]:
Country Life Expectancy in Years
0 Aruba 75.87
1 Afghanistan 63.76
2 Angola 59.92
3 Albania 78.19
4 Andorra nan
In [12]:
#Creating a dataframe that consists of the happiness score, suicide rate, gdp per capita, and life expectancy for
#each country in the happiness report in 2016. I am ONLY using the countries in the happiness report for this so
#that everything is consistant and no data is missing. 

#Renaming country names in the happiness report to match the official country names that the World Bank uses
df_happy1 = report_2016.replace({'Country': {'Venezuela':'Venezuela, RB', 'Hong Kong':'Hong Kong SAR, China',
                                        'South Korea':'Korea, Rep.', 'Egypt':'Egypt, Arab Rep.', 
                                        'Russia':'Russian Federation', 'Yemen':'Yemen, Rep.', 
                                        'Congo (Kinshasa)':'Congo, Dem. Rep.', 'Congo (Brazzaville)': 'Congo, Rep.',
                                        'Iran': 'Iran, Islamic Rep.', 'Laos':'Lao PDR', 'Macedonia':'North Macedonia',
                                        'Kyrgyzstan': 'Kyrgyz Republic', 'Syria':'Syrian Arab Republic',
                                        'Slovakia': 'Slovak Republic', 'Ivory Coast':'Cote d\'Ivoire'}})
In [13]:
#Dropping countries that are in the WHR but not part of the World Bank data. 
df_happy2 = df_happy1.set_index('Country')
df_happy2 = df_happy2.drop(['Palestinian Territories'])
In [14]:
#Creating a list of all of the countries in the World Happiness Report so that this can be used to filter out 
#the countries from the datasets from the World Bank Database. 
df_happy2 = df_happy2.reset_index()
countries_happy_report = df_happy2['Country'].tolist()
In [15]:
#filtering out the Happiness Score and Continent for each country in the WHR
df_happy2 = df_happy2.set_index('Country')
df_hap_score = df_happy2.loc[:,['Happiness Score','Continent']]

#Using the countries_happy_report list I made above, I'm using that to filter out the countries from the countries
#in the World Bank datasets. The World Bank data looks at around 260 countries, but I only wanted data on the 
#countries in the World Happiness Report. 
df_suicide_indexed = suicide_data.set_index('Country')
suicide_df_final = df_suicide_indexed.loc[countries_happy_report]

gdp_pc_indexed = gdp_pc.set_index('Country')
gdp_df_final = gdp_pc_indexed.loc[countries_happy_report]

life_expect_indexed = life_expect_df.set_index('Country')
life_expect_final = life_expect_indexed.loc[countries_happy_report]
In [16]:
#resetting the index for all of the dataframes so that 'Country' becomes a column for me to merge them on 
hap_score_final = df_hap_score.reset_index()
suicide_df_final = suicide_df_final.reset_index()
gdp_df_final = gdp_df_final.reset_index()
life_expect_final = life_expect_final.reset_index()
In [17]:
#merging all of the dataframes together to get df_merged which you can see below 
df_merged = pd.merge(hap_score_final, suicide_df_final)
df_merged = pd.merge(pd.merge(df_merged, gdp_df_final), life_expect_final)
df_merged
Out[17]:
Country Happiness Score Continent Suicide Mortality Rate (per 100,000 population) GDP Per Capita Life Expectancy in Years
0 Switzerland 7.59 Europe 17.20 80172.23 83.60
1 Iceland 7.56 Europe 14.00 61466.80 82.20
2 Denmark 7.53 Europe 12.80 54664.00 80.85
3 Norway 7.52 Europe 12.20 70459.18 82.41
4 Canada 7.43 North America 12.50 42322.48 81.90
... ... ... ... ... ... ...
132 Afghanistan 3.58 Asia 4.70 547.23 63.76
133 Rwanda 3.46 Africa 6.70 748.50 67.93
134 Benin 3.34 Africa 9.90 1087.29 60.88
135 Burundi 2.90 Africa 9.10 282.19 60.53
136 Togo 2.84 Africa 9.60 597.47 60.22

137 rows × 6 columns

Let's start by taking a look at the Happiness Scores across the globe:

In [18]:
#I decided to display the happiness scores on a Plotly Choropleth map. I used the country names as the locations
#and each color represents where on the scale the Happiness score of each country falls. I also added a slider
#so that we can see how the happiness scores change year by year 
fig = px.choropleth(whr_final, locations = "Country", locationmode ="country names", 
                    color="Happiness Score",animation_frame="Year", animation_group="Country",
                    title ="Happiness Score by Country", width=900, height=600, 
                    projection="natural earth", #makes the edges a rounded shape like the earth rather than square
                    color_continuous_scale=px.colors.sequential.thermal) #setting a color scale 

fig.update_layout(geo=dict(showocean=True, oceancolor='lightblue')) #updating layout to show oceans 
fig.show()

Happiness Scores by Continent:

In [19]:
#creating vertical box plots to show a breakdown of happiness scores by continent...most plotly express graphs are 
#pretty similar when it comes to setting them up - I just used .box to get the boxplots instead 
fig = px.box(whr_final, x="Continent", y="Happiness Score", animation_frame="Year", 
             animation_group="Country", points="all", color="Continent", title='Happiness Scores by Continent', 
             width = 900, height = 600)
fig.update_layout(showlegend=False) #setting showlegend to false since it's not needed in this case 
fig.show()

The Top & Bottom 10 Happiest Countries :

In [20]:
#importing library to make subplots with plotly 
from plotly.subplots import make_subplots
#getting the top 10 for every year by using .head and calling 10 to retrieve 10 values 
head_2016 = report_2016.head(10)
head_2017 = report_2017.head(10)
head_2018 = report_2018.head(10)
head_2019 = report_2019.head(10)
head_2020 = report_2020.head(10)

#just creating very simple line graphs to display the information for top 10 countries using the dataframes I 
#created above 
fig = make_subplots(rows=3, cols=2, vertical_spacing = 0.20) #sets number of columns, rows and spacing b/w plots 
fig.add_trace(go.Scatter(x=head_2016['Country'], y=head_2016['Happiness Score'], name='2016'), row=1, col=1)
fig.add_trace(go.Scatter(x=head_2017['Country'], y=head_2017['Happiness Score'], name='2017'), row=1, col=2)
fig.add_trace(go.Scatter(x=head_2018['Country'], y=head_2018['Happiness Score'],name='2018'), row=2, col=1)
fig.add_trace(go.Scatter(x=head_2019['Country'], y=head_2019['Happiness Score'], name='2019'), row=2, col=2)
fig.add_trace(go.Scatter(x=head_2020['Country'], y=head_2020['Happiness Score'], name='2020'), row=3, col=1)

#setting the measurements of the plots and title 
fig.update_layout(height = 800, width=800, title_text= 'Top 10 Happiest Countries')

fig.show()
In [21]:
#getting the bottom 10 for every year by using .tail and calling 10 to retrieve 10 values 
tail_2016 = report_2016.tail(10)
tail_2017 = report_2017.tail(10)
tail_2018 = report_2018.tail(10)
tail_2019 = report_2019.tail(10)
tail_2020 = report_2020.tail(10)

#doing exactly what I did previously again but this time using my dataframes for the bottom countries per year
fig = make_subplots(rows=3, cols=2, vertical_spacing = 0.20)
fig.add_trace(go.Scatter(x=tail_2016['Country'], y=tail_2016['Happiness Score'], name='2016'), row=1, col=1)
fig.add_trace(go.Scatter(x=tail_2017['Country'], y=tail_2017['Happiness Score'], name='2017'), row=1, col=2)
fig.add_trace(go.Scatter(x=tail_2018['Country'], y=tail_2018['Happiness Score'],name='2018'), row=2, col=1)
fig.add_trace(go.Scatter(x=tail_2019['Country'], y=tail_2019['Happiness Score'], name='2019'), row=2, col=2)
fig.add_trace(go.Scatter(x=tail_2020['Country'], y=tail_2020['Happiness Score'], name='2020'), row=3, col=1)

fig.update_layout(height = 800, width=800, title_text= 'Bottom 10 Happiest Countries')

fig.show()

How strong is the correlation between each factor and the Happiness Score?

In [22]:
#defining a functoni that can be used to call any of the 6 factors and compare them to the happiness score to see
#how high the correlation is between the two. Again I added the slider so that you can go through the correlation
#by factor and by year. 
def plot_happiness_vs_variable(variable):
    df = whr_final.loc[:,["Happiness Score", variable, 'Year']]
    fig = px.scatter(df, x="Happiness Score", y=variable, animation_frame="Year", animation_group= variable,
                 trendline="ols", title="Happiness Score vs." + " " + str(variable), width=800, height=600,
                 color_discrete_sequence=['purple'])
    #adjusting the size of the markers in the scatter plot 
    fig.update_traces(marker=dict(size=8),
                  selector=dict(mode="markers"))
    fig.show()

#callingn the function with each of the 6 factors 
plot_happiness_vs_variable('Economy (GDP Per Capita)')
plot_happiness_vs_variable('Social Support')
plot_happiness_vs_variable('Health')
plot_happiness_vs_variable('Freedom')
plot_happiness_vs_variable('Trust')
plot_happiness_vs_variable('Generosity')

Are any of the factors highly correlated to eachother?

In [23]:
#creating dataframes containing the correlation between each of the factors using .corr - I dropped the columns 
#that would not have any kind of correlation associated to them 
corr_2016= report_2016.drop(columns=['Year', 'Dystopia Residual']).corr()
corr_2017=report_2017.drop(columns=['Year']).corr()
corr_2018=report_2018.drop(columns=['Year']).corr()
corr_2019=report_2019.drop(columns=['Year', 'Dystopia Residual']).corr()
corr_2020 = report_2020.drop(columns=['Year', 'Dystopia Residual']).corr()

plt.figure(figsize=(20, 15))

#using seaborn to create heatmaps by year as subplots, cmap sets the color scale I want, subplots are set using
#matplotlib - first number is number of columns, second is number of rows, and third is the number of the plot
#out of the total amount of plots 
plt.subplot(2,3,1)
plt.title('2016',fontsize=15)
sns.heatmap(corr_2016, square=True, cmap="magma")

plt.subplot(2,3,2)
plt.title('2017', fontsize=15)
sns.heatmap(corr_2017, square=True, cmap="magma")

plt.subplot(2,3,3)
plt.title('2018', fontsize=15)
sns.heatmap(corr_2018, square=True, cmap="magma")

plt.subplot(2,3,4)
plt.title('2019', fontsize=15)
sns.heatmap(corr_2019, square=True, cmap="magma")

plt.subplot(2,3,5)
plt.title('2020', fontsize=15)
sns.heatmap(corr_2020, square=True, cmap="magma")

plt.tight_layout(pad=1.0) #setting the space between each subplot 
plt.show()

Comparing Happiness Scores to Suicide Mortality Rates

As I mentioned above, we are ONLY looking at 2016 for the following analysis. Let's start by looking at the global Suicide Mortality Rates.

In [24]:
#did exactly what I did for the previous chloropleth map except this time my data is only for one year - 2016, and
#I just used my dataframe containing the suicide data for 2016
fig = px.choropleth(suicide_data, locations="Country",
                    color='Suicide Mortality Rate (per 100,000 population)', locationmode ="country names",
                    labels ={'Suicide Mortality Rate (per 100,000 population)':'Suicide Mortality Rate'},
                    title = "Suicide Mortality Rate (Per 100,000 Population)",
                    projection = "natural earth",
                    color_continuous_scale=px.colors.sequential.YlOrRd)

fig.update_layout(geo=dict(showocean=True, oceancolor='lightblue'))
fig.show()

Looking at Happiness Scores vs. Suicide Rates of the Countries in the WHR:

In [25]:
#creating a scatter plot to show happiness scores vs suicide rates by continent - setting the color will 
#make sure the points are by continent 
fig = px.scatter(df_merged, x="Happiness Score", y="Suicide Mortality Rate (per 100,000 population)", 
                 color="Continent",
                 title = "Happiness Score vs. Suicide Rates For Countries in the WHR",
                 width=900, height=500, hover_name="Country")

#adjustig marker settings - adding a darker outline and making them a bit larger 
fig.update_traces(marker=dict(size=9,
                             line=dict(width=2,
                                       color="DarkSlateGrey")),
                  selector=dict(mode="markers"))                                            
fig.show()

Do the factors that have a large impact on a country's happiness have any correlation to their suicide mortality rates?

In [26]:
#Since GDP per capita and life expectancy seem to be the factors that play into a country's Happiness Score the 
#most, I thought it would be interesting to look at if there is any correlation between suicide rates and 
#GDP per capita and life expectancy. 

#creating a function to be used to call a given variable vs suicide rate, similar to the previous one I used for 
#happiness score vs given variable. 
def suicide_vs_variable(variable):
    fig = px.scatter(df_merged, x="Suicide Mortality Rate (per 100,000 population)", y=variable, 
                 color='Continent', 
                 title = "Suicide Rate vs " + str(variable),
                 width=900, height=500)
    #adjusting marker size and outlines 
    fig.update_traces(marker=dict(size=9,
                             line=dict(width=2,
                                    color="DarkSlateGrey")),
                  selector=dict(mode="markers"))                                         
    fig.show()

#using the function to display suicide rates vs GDP per capita and suicide rates vs life expectancy 
suicide_vs_variable('GDP Per Capita')
suicide_vs_variable('Life Expectancy in Years')

Conclusions:

There are a few different conclusions that can be drawn from my analysis so I will break those into 2 different parts, one viewing the report as a whole and the other as it relates to suicide mortality rates.

  1. When analyzing just the World Happiness Report itself, we can see that Europe and North America seem to have some of the happiest countries, while Africa and Asia have a lot of the least happiest countries. This wasn't too shocking to me, but what was surprising were the weight that each of the 6 factors held when it came to calculating the World Happiness Scores. I knew that GDP per capita would probably have the heaviest weight on happiness score, but I was shocked at how low the correlation between Government Corruption and the happiness score was.
  1. The most shocking finding of all was how countries with lower happiness scores, for the most part, had lower suicide rates and vice versa. After conducting my analysis, here are some of the reasons I think that is:

    • First off, is that the factors and the weight of each factor just aren't effectively measuring the 'happiness' of individuals. When I compared the suicide rates to GDP per capita and life expectancy, we saw that a lot of countries with fairly low GDP per capita amounts had low suicide rates, and we even saw a similar outcome when it came to life exepctancy and suicide rates. So maybe the notion that wealthier countries are happier simply isn't the case and measuring something so arbitrary as happiness by using only six factors isn't actually leading to accurate depictions of global happiness.
    • If we look at the countries with lower suicide rates (and lower happiness scores), most of them tend to be the less wealthy countries. These countries just may not have the means to keep track of suicide mortality rates the same way that wealthier countries do. This could be the same reason why it was very difficult for me to find up to date suicide information for a lot of the countries on the lower end of the happiness scale.
    • Going along with this,in a lot of Asian and African countries, suicide, and mental health in general for that matter, are still very taboo topics. It's definitely upsetting to think this, but many suicides might be going unreported due to the fact that it is frowned upon in their societies.

Future Predictions:

What I'm really interested to see is next year's World Happiness Report, which will take in to account data from AFTER Covid-19's emergence. I think that Covid-19 will have a significant impact on happiness across the globe - almost every factor taken into account in the calculation of the Happiness score has been affected by Covid-19, especially Economy and Health, which are the largest factors that play into the score.

If you look at the reports by year - European and North American countries consistantly come in as being the 'happiest' countries, yet these also happen to be the countries that are dealing with Covid-19 the worst. I think that the 2021 report will actually have some significant changes in the most and least happiest countries which we saw was not the case in the past couple of years.