The World Happiness Report has been published every year since 2012 on March 20th, or as the United Nations calls it - International World Happiness Day. The official World Happiness Report website describes the report as "...a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be." Each country is given a 'Happiness Score' which is essentially based off of six different factors: GDP per capita, Social Support (Family), Health (Life Expectancy), Freedom to make life choices, Generosity, and Perceptions of Government Corruption (Trust). Each country is compared to a "Dystopia" nation, which consists of the lowest possible values for each variable.
I will be analyzing the happiest and least happiest countries from 2016-2020, what factors weigh heaviest when it comes to calculating the Happiness Score, and if a country's Happiness Ranking has any correlation to their suicide mortality rate. I chose this topic because I actually found out very recently that the World Happiness Report existed and I thought this would be a great opportunity for me to look into the data in the report. To be honest, when I heard about the report I was pretty skeptical about how they can really measure 'happiness' which is why I chose to compare the scores to suicide rates in these countries - I figured this would give me a good basis to understand the validity of the scores.
For this project I used 8 seperate data sets - 5 for the World Happiness Report data (2016-2020) which came from Kaggle, one for suicide mortality rates, one for GDP per capita, and one for life expectancy, all three of which came from the World Bank Database. I got the World Happiness Report data sets from Kaggle because I was able to find ones that were already pretty much clean and all I would have to do is narrow down a few columns and merge them. For the other datasets, the World Bank was an easy place to find all of them and I figured it is probably the most accurate data I can find on a global scale.
#importing the libraries I will be needing throughout this project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#i used plotly offline so that whoever is viewing this will not need any sort of passwords/keys to view it
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
import plotly.graph_objs as go
import plotly.express as px
import seaborn as sns
from IPython.display import display
import warnings
warnings.filterwarnings('ignore')
#Reading in the different csv files using read_csv since they are csv files - skipping the first couple rows on
#the world bank data sets to fix the formatting
report_2016 = pd.read_csv('2016_report.csv')
report_2017 = pd.read_csv('2017_report.csv')
report_2018 = pd.read_csv('2018_report.csv')
report_2019 = pd.read_csv('2019_report.csv')
report_2020 = pd.read_csv('2020_report.csv')
suicide_info = pd.read_csv('suicide_data.csv', skiprows = 4)
gdp_per_cap = pd.read_csv('gdp_per_cap.csv', skiprows = 4)
life_expect = pd.read_csv('life_expectancy.csv', skiprows = 4)
#Removes scientific notation where needed
pd.set_option('display.float_format', lambda x: '%.2f' % x)
#adding a year column to each happiness report dataframe
report_2016['Year'] = 2016
report_2017['Year'] = 2017
report_2018['Year'] = 2018
report_2019['Year'] = 2019
report_2020['Year'] = 2020
#changing columns name in 2016 and 2019 data frames so that all of the columns match up
report_2016 = report_2016.rename(columns={'family':'social_support'})
report_2019 = report_2019.rename(columns={'family':'social_support'})
#renaming the columns to make sure that the columns across all 5 reports match up exactly and also to add some
#capitalization/spacing where needed
def rename_columns(df):
df = df.rename(columns={'country':'Country','happiness_score':'Happiness Score','gdp_per_capita': 'Economy (GDP Per Capita)',
'social_support':'Social Support','health': 'Health', 'freedom':'Freedom', 'government_trust': 'Trust',
'generosity':'Generosity', 'dystopia_residual':'Dystopia Residual',
'continent':'Continent'})
return df
#Calling the function I created above on each of the dataframes
report_2016 = rename_columns(report_2016)
report_2017 = rename_columns(report_2017)
report_2018 = rename_columns(report_2018)
report_2019 = rename_columns(report_2019)
report_2020 = rename_columns(report_2020)
#appending the 5 datasets together
whr_final = report_2016.append([report_2017, report_2018, report_2019, report_2020])
#after appending the dataets together we get something that looks like this - all of the data together in one
#dataframe from 2016-2020
whr_final
#Creating a list to filter out the columns I want to have in my suicide dataframe.
columns = ['Country Name','2016']
suicide_data = suicide_info[columns]
suicide_data = suicide_data.rename(columns= {"2016":"Suicide Mortality Rate (per 100,000 population)",
"Country Name": "Country"})
suicide_data.head()
#Using the same columns list from above to filter out the gdp dataframe as well.
gdp_pc = gdp_per_cap[columns]
gdp_pc = gdp_pc.rename(columns= {"2016":"GDP Per Capita", "Country Name":"Country"})
gdp_pc.head()
#Now lastly just filtering out the life expectancy dataframe as well, the same way I did suicide rates and gdp.
life_expect_df = life_expect[columns]
life_expect_df = life_expect_df.rename(columns= {"2016":"Life Expectancy in Years", "Country Name":"Country"})
life_expect_df.head()
#Creating a dataframe that consists of the happiness score, suicide rate, gdp per capita, and life expectancy for
#each country in the happiness report in 2016. I am ONLY using the countries in the happiness report for this so
#that everything is consistant and no data is missing.
#Renaming country names in the happiness report to match the official country names that the World Bank uses
df_happy1 = report_2016.replace({'Country': {'Venezuela':'Venezuela, RB', 'Hong Kong':'Hong Kong SAR, China',
'South Korea':'Korea, Rep.', 'Egypt':'Egypt, Arab Rep.',
'Russia':'Russian Federation', 'Yemen':'Yemen, Rep.',
'Congo (Kinshasa)':'Congo, Dem. Rep.', 'Congo (Brazzaville)': 'Congo, Rep.',
'Iran': 'Iran, Islamic Rep.', 'Laos':'Lao PDR', 'Macedonia':'North Macedonia',
'Kyrgyzstan': 'Kyrgyz Republic', 'Syria':'Syrian Arab Republic',
'Slovakia': 'Slovak Republic', 'Ivory Coast':'Cote d\'Ivoire'}})
#Dropping countries that are in the WHR but not part of the World Bank data.
df_happy2 = df_happy1.set_index('Country')
df_happy2 = df_happy2.drop(['Palestinian Territories'])
#Creating a list of all of the countries in the World Happiness Report so that this can be used to filter out
#the countries from the datasets from the World Bank Database.
df_happy2 = df_happy2.reset_index()
countries_happy_report = df_happy2['Country'].tolist()
#filtering out the Happiness Score and Continent for each country in the WHR
df_happy2 = df_happy2.set_index('Country')
df_hap_score = df_happy2.loc[:,['Happiness Score','Continent']]
#Using the countries_happy_report list I made above, I'm using that to filter out the countries from the countries
#in the World Bank datasets. The World Bank data looks at around 260 countries, but I only wanted data on the
#countries in the World Happiness Report.
df_suicide_indexed = suicide_data.set_index('Country')
suicide_df_final = df_suicide_indexed.loc[countries_happy_report]
gdp_pc_indexed = gdp_pc.set_index('Country')
gdp_df_final = gdp_pc_indexed.loc[countries_happy_report]
life_expect_indexed = life_expect_df.set_index('Country')
life_expect_final = life_expect_indexed.loc[countries_happy_report]
#resetting the index for all of the dataframes so that 'Country' becomes a column for me to merge them on
hap_score_final = df_hap_score.reset_index()
suicide_df_final = suicide_df_final.reset_index()
gdp_df_final = gdp_df_final.reset_index()
life_expect_final = life_expect_final.reset_index()
#merging all of the dataframes together to get df_merged which you can see below
df_merged = pd.merge(hap_score_final, suicide_df_final)
df_merged = pd.merge(pd.merge(df_merged, gdp_df_final), life_expect_final)
df_merged
#I decided to display the happiness scores on a Plotly Choropleth map. I used the country names as the locations
#and each color represents where on the scale the Happiness score of each country falls. I also added a slider
#so that we can see how the happiness scores change year by year
fig = px.choropleth(whr_final, locations = "Country", locationmode ="country names",
color="Happiness Score",animation_frame="Year", animation_group="Country",
title ="Happiness Score by Country", width=900, height=600,
projection="natural earth", #makes the edges a rounded shape like the earth rather than square
color_continuous_scale=px.colors.sequential.thermal) #setting a color scale
fig.update_layout(geo=dict(showocean=True, oceancolor='lightblue')) #updating layout to show oceans
fig.show()
#creating vertical box plots to show a breakdown of happiness scores by continent...most plotly express graphs are
#pretty similar when it comes to setting them up - I just used .box to get the boxplots instead
fig = px.box(whr_final, x="Continent", y="Happiness Score", animation_frame="Year",
animation_group="Country", points="all", color="Continent", title='Happiness Scores by Continent',
width = 900, height = 600)
fig.update_layout(showlegend=False) #setting showlegend to false since it's not needed in this case
fig.show()
#importing library to make subplots with plotly
from plotly.subplots import make_subplots
#getting the top 10 for every year by using .head and calling 10 to retrieve 10 values
head_2016 = report_2016.head(10)
head_2017 = report_2017.head(10)
head_2018 = report_2018.head(10)
head_2019 = report_2019.head(10)
head_2020 = report_2020.head(10)
#just creating very simple line graphs to display the information for top 10 countries using the dataframes I
#created above
fig = make_subplots(rows=3, cols=2, vertical_spacing = 0.20) #sets number of columns, rows and spacing b/w plots
fig.add_trace(go.Scatter(x=head_2016['Country'], y=head_2016['Happiness Score'], name='2016'), row=1, col=1)
fig.add_trace(go.Scatter(x=head_2017['Country'], y=head_2017['Happiness Score'], name='2017'), row=1, col=2)
fig.add_trace(go.Scatter(x=head_2018['Country'], y=head_2018['Happiness Score'],name='2018'), row=2, col=1)
fig.add_trace(go.Scatter(x=head_2019['Country'], y=head_2019['Happiness Score'], name='2019'), row=2, col=2)
fig.add_trace(go.Scatter(x=head_2020['Country'], y=head_2020['Happiness Score'], name='2020'), row=3, col=1)
#setting the measurements of the plots and title
fig.update_layout(height = 800, width=800, title_text= 'Top 10 Happiest Countries')
fig.show()
#getting the bottom 10 for every year by using .tail and calling 10 to retrieve 10 values
tail_2016 = report_2016.tail(10)
tail_2017 = report_2017.tail(10)
tail_2018 = report_2018.tail(10)
tail_2019 = report_2019.tail(10)
tail_2020 = report_2020.tail(10)
#doing exactly what I did previously again but this time using my dataframes for the bottom countries per year
fig = make_subplots(rows=3, cols=2, vertical_spacing = 0.20)
fig.add_trace(go.Scatter(x=tail_2016['Country'], y=tail_2016['Happiness Score'], name='2016'), row=1, col=1)
fig.add_trace(go.Scatter(x=tail_2017['Country'], y=tail_2017['Happiness Score'], name='2017'), row=1, col=2)
fig.add_trace(go.Scatter(x=tail_2018['Country'], y=tail_2018['Happiness Score'],name='2018'), row=2, col=1)
fig.add_trace(go.Scatter(x=tail_2019['Country'], y=tail_2019['Happiness Score'], name='2019'), row=2, col=2)
fig.add_trace(go.Scatter(x=tail_2020['Country'], y=tail_2020['Happiness Score'], name='2020'), row=3, col=1)
fig.update_layout(height = 800, width=800, title_text= 'Bottom 10 Happiest Countries')
fig.show()
#defining a functoni that can be used to call any of the 6 factors and compare them to the happiness score to see
#how high the correlation is between the two. Again I added the slider so that you can go through the correlation
#by factor and by year.
def plot_happiness_vs_variable(variable):
df = whr_final.loc[:,["Happiness Score", variable, 'Year']]
fig = px.scatter(df, x="Happiness Score", y=variable, animation_frame="Year", animation_group= variable,
trendline="ols", title="Happiness Score vs." + " " + str(variable), width=800, height=600,
color_discrete_sequence=['purple'])
#adjusting the size of the markers in the scatter plot
fig.update_traces(marker=dict(size=8),
selector=dict(mode="markers"))
fig.show()
#callingn the function with each of the 6 factors
plot_happiness_vs_variable('Economy (GDP Per Capita)')
plot_happiness_vs_variable('Social Support')
plot_happiness_vs_variable('Health')
plot_happiness_vs_variable('Freedom')
plot_happiness_vs_variable('Trust')
plot_happiness_vs_variable('Generosity')
#creating dataframes containing the correlation between each of the factors using .corr - I dropped the columns
#that would not have any kind of correlation associated to them
corr_2016= report_2016.drop(columns=['Year', 'Dystopia Residual']).corr()
corr_2017=report_2017.drop(columns=['Year']).corr()
corr_2018=report_2018.drop(columns=['Year']).corr()
corr_2019=report_2019.drop(columns=['Year', 'Dystopia Residual']).corr()
corr_2020 = report_2020.drop(columns=['Year', 'Dystopia Residual']).corr()
plt.figure(figsize=(20, 15))
#using seaborn to create heatmaps by year as subplots, cmap sets the color scale I want, subplots are set using
#matplotlib - first number is number of columns, second is number of rows, and third is the number of the plot
#out of the total amount of plots
plt.subplot(2,3,1)
plt.title('2016',fontsize=15)
sns.heatmap(corr_2016, square=True, cmap="magma")
plt.subplot(2,3,2)
plt.title('2017', fontsize=15)
sns.heatmap(corr_2017, square=True, cmap="magma")
plt.subplot(2,3,3)
plt.title('2018', fontsize=15)
sns.heatmap(corr_2018, square=True, cmap="magma")
plt.subplot(2,3,4)
plt.title('2019', fontsize=15)
sns.heatmap(corr_2019, square=True, cmap="magma")
plt.subplot(2,3,5)
plt.title('2020', fontsize=15)
sns.heatmap(corr_2020, square=True, cmap="magma")
plt.tight_layout(pad=1.0) #setting the space between each subplot
plt.show()
#did exactly what I did for the previous chloropleth map except this time my data is only for one year - 2016, and
#I just used my dataframe containing the suicide data for 2016
fig = px.choropleth(suicide_data, locations="Country",
color='Suicide Mortality Rate (per 100,000 population)', locationmode ="country names",
labels ={'Suicide Mortality Rate (per 100,000 population)':'Suicide Mortality Rate'},
title = "Suicide Mortality Rate (Per 100,000 Population)",
projection = "natural earth",
color_continuous_scale=px.colors.sequential.YlOrRd)
fig.update_layout(geo=dict(showocean=True, oceancolor='lightblue'))
fig.show()
#creating a scatter plot to show happiness scores vs suicide rates by continent - setting the color will
#make sure the points are by continent
fig = px.scatter(df_merged, x="Happiness Score", y="Suicide Mortality Rate (per 100,000 population)",
color="Continent",
title = "Happiness Score vs. Suicide Rates For Countries in the WHR",
width=900, height=500, hover_name="Country")
#adjustig marker settings - adding a darker outline and making them a bit larger
fig.update_traces(marker=dict(size=9,
line=dict(width=2,
color="DarkSlateGrey")),
selector=dict(mode="markers"))
fig.show()
#Since GDP per capita and life expectancy seem to be the factors that play into a country's Happiness Score the
#most, I thought it would be interesting to look at if there is any correlation between suicide rates and
#GDP per capita and life expectancy.
#creating a function to be used to call a given variable vs suicide rate, similar to the previous one I used for
#happiness score vs given variable.
def suicide_vs_variable(variable):
fig = px.scatter(df_merged, x="Suicide Mortality Rate (per 100,000 population)", y=variable,
color='Continent',
title = "Suicide Rate vs " + str(variable),
width=900, height=500)
#adjusting marker size and outlines
fig.update_traces(marker=dict(size=9,
line=dict(width=2,
color="DarkSlateGrey")),
selector=dict(mode="markers"))
fig.show()
#using the function to display suicide rates vs GDP per capita and suicide rates vs life expectancy
suicide_vs_variable('GDP Per Capita')
suicide_vs_variable('Life Expectancy in Years')
There are a few different conclusions that can be drawn from my analysis so I will break those into 2 different parts, one viewing the report as a whole and the other as it relates to suicide mortality rates.
The most shocking finding of all was how countries with lower happiness scores, for the most part, had lower suicide rates and vice versa. After conducting my analysis, here are some of the reasons I think that is:
What I'm really interested to see is next year's World Happiness Report, which will take in to account data from AFTER Covid-19's emergence. I think that Covid-19 will have a significant impact on happiness across the globe - almost every factor taken into account in the calculation of the Happiness score has been affected by Covid-19, especially Economy and Health, which are the largest factors that play into the score.
If you look at the reports by year - European and North American countries consistantly come in as being the 'happiest' countries, yet these also happen to be the countries that are dealing with Covid-19 the worst. I think that the 2021 report will actually have some significant changes in the most and least happiest countries which we saw was not the case in the past couple of years.