Data

World Happiness Report data

The original dataset WHR.xls was gathered primarily from The World Happiness Report created by the United Nations Sustainable Development Solutions Network (UNSDSN). Data was collected by asking respondents to value their lives today on a 0 to 10 scale, with the worst possible life as a 0 and the best possible life as a 10. This gave us a chance to compare happiness levels and inequality in different parts of the world.

Variable definition

Outcome and factors that we’re interested in：

Life Ladder: Happiness score or subjective well-being, with steps numbered from 0 at the bottom to 10 at the top.(we renamed it as happiness_score in dataset)
LOG GDP per capita: the log value of GDP per capita (variable name gdp) in purchasing power parity (PPP).(we renamed it as gdp in dataset)
Social support: the national average of the binary responses (either 0 or 1) to the GWP(Gallup World Poll) question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”(we renamed it as social_support in dataset)
Healthy life expectancy at birth: Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository.(we renamed it as life_expectance in dataset)
Freedom to make life choices: the national average of responses (either 0 or 1) to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” (we renamed it as freedom in dataset)
Generosity: the residual of regressing national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.(we renamed it as generosity in dataset)
Perceptions of corruption: the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?” The overall perception is just the average of the two 0-or-1 responses.(we renamed it as corruption in dataset)
Positive affect: the average of three positive affect measures in GWP: happiness, laugh and enjoyment in the GWP, respectively the responses to “Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Happiness?”, “Did you smile or laugh a lot yesterday?”, and “Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Enjoyment?”.(we renamed it as positive_affect in dataset)
Negative affect: the average of three negative affect measures like worry, sadness and anger in GWP, respectively the responses to “Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Worry?”, “Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Sadness?”, and “Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Anger?” (we renamed it as negative_affect in dataset)

Data Tidy

data_all = read_excel("./data/WHR.xls")%>% 
 janitor::clean_names() 

sum_na = function(x){
  sum = sum(is.na(x))
  sum}

missing_number = map(data_all,sum_na) %>% 
  as.data.frame()  

final_data = 
  data_all %>% 
  filter(year %in% c(2007:2018)) %>% 
  select(country_name:negative_affect)

write.csv(final_data,file="data/final_data_all_country.csv",quote=F,row.names = F)

I decide to keep variables with the number of missing values less than 100 and the observations from 2007-2018. This dataset final_data_all_country.csv contains 11 variables with 1588 observations from 164 countries without deleting any missing values.

Combining different datasets

happy1 = read_csv("data/final_data_all_country.csv") %>% 
  janitor::clean_names() %>% 
  unique %>% 
  mutate(label = str_c("<b>Happiness: ", round(life_ladder,2), 
                      "</b><br>Country : ", country_name,
                      sep = ""),
        code = countrycode(country_name, 'country.name', 'iso3c'),## match code
        code = replace_na(code,"XKX")) %>% 
  rename("country" = country_name,
         "happiness_score" = life_ladder,
         "gdp" = log_gdp_per_capita,
         "life_expectance" = healthy_life_expectancy_at_birth,
         "freedom" = freedom_to_make_life_choices,
         "corruption" = perceptions_of_corruption) 



location = read_csv("data/concap.csv") %>% 
  janitor::clean_names() %>% 
  select(country_name,capital_latitude,capital_longitude) %>% 
  rename("country" = country_name)

happy_location_raw = left_join(happy1, location, by = "country")

# Find countries whose locations of capitals could not match
 list <-which(rowSums(is.na(happy_location_raw %>% 
        select(capital_latitude))) > 0)
 data_all_na <- happy_location_raw[list,] 
 
 happy_location_lost =  
 happy_location_raw[list,] %>% 
   mutate(capital_latitude = case_when(
    country == "Congo (Brazzaville)" ~ -4.267778,
    country == "Ivory Coast" ~ 6.85,
    country == "Gambia" ~ 13.466667,
    country == "North Cyprus" ~ 35.183333,
    country == "Palestinian Territories" ~ 31.516667,
    country == "Taiwan Province of China"~ 25.066667,
    country == "Somaliland region"~ 9.55,
    country == "Congo (Kinshasa)"~ -4.316667,
    country == "Hong Kong S.A.R. of China"~ 114.10000
    ),
    capital_longitude = case_when(
    country == "Congo (Brazzaville)" ~ 15.291944,
    country == "Ivory Coast" ~ 5.3,
    country == "Gambia" ~ -16.6,
    country == "North Cyprus" ~ 33.366667,
    country == "Palestinian Territories" ~ 34.45,
    country == "Taiwan Province of China"~ 121.516667,
    country == "Somaliland region"~ 44.05,
    country == "Congo (Kinshasa)"~ 15.316667,
    country == "Hong Kong S.A.R. of China"~ 22.20000)
    ) 
    
happy_location = 
  bind_rows(happy_location_raw,happy_location_lost) %>% 
  drop_na(capital_latitude,capital_longitude)
write.csv(happy_location,file="data/happy_location.csv",quote=F,row.names = F)

Furthermore, I find capital longitude, latitude dataset and suicide dataset to draw global animation map and conduct relationship analysis.

Adding new variables

Develop level

I divide these countries into developing and developed country by the criteria that some economists prefer to see a country with per capita GDP of at least $25,000 to be declared as developed.

happy = 
  read_csv("data/final_data_all_country.csv") %>% 
  rename( "country" = country_name,
           "happiness_score" = life_ladder,
           "gdp" = log_gdp_per_capita,
           "life_expectance" = healthy_life_expectancy_at_birth,
           "freedom" = freedom_to_make_life_choices,
           "corruption" = perceptions_of_corruption)  %>% 
  unique() %>% 
  mutate(o_gdp = exp(gdp)) %>%
  mutate(develop = ifelse(o_gdp > 25000, "developed", "developing"))

Continent

Asia = 
  c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan Province of China",
    "Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan",
    "South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong S.A.R., China", "Philippines",
    "Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam",
    "Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories",
    "Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia",
    "Cambodia", "Afghanistan", "Yemen", "Syria")

Europe = 
  c("Norway", "Denmark", "Iceland", "Switzerland", "Finland",
    "Netherlands", "Sweden", "Austria", "Ireland", "Germany",
    "Belgium", "Luxembourg", "United Kingdom", "Czech Republic",
    "Malta", "France", "Spain", "Slovakia", "Poland", "Italy",
    "Russia", "Lithuania", "Latvia", "Moldova", "Romania",
    "Slovenia", "North Cyprus", "Cyprus", "Estonia", "Belarus",
    "Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro",
    "Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia",
    "Bulgaria", "Albania", "Ukraine")

North_America = 
  c("Canada", "Costa Rica", "United States", "Mexico",  
    "Panama","Trinidad and Tobago", "El Salvador", "Belize", "Guatemala",
    "Jamaica", "Nicaragua", "Dominican Republic", "Honduras",
    "Haiti")

Sorth_America =  
  c("Chile", "Brazil", "Argentina", "Uruguay",
    "Colombia", "Ecuador", "Bolivia", "Peru",
    "Paraguay", "Venezuela")

Australia = 
  c("New Zealand", "Australia")

happy = happy %>% 
  mutate(continent = case_when(
    country %in% Asia ~ "Asia",
    country %in% Europe ~ "Europe",
    country %in% North_America ~ "North America",
    country %in% Sorth_America ~ "South America",
    country %in% Australia ~ "Australia",
    TRUE ~ as.character("Africa")
  ))

write.csv(happy,file="data/happy.csv",quote=F,row.names = F)

I divide these countries into five continents including Asia, Europe, North America, South America and Australia.

suicide = read_csv("data/sucide.csv") %>% 
  select(country,year,sucide_rate = `suicides/100k pop`) %>% 
  filter(year >= 2007) %>% 
  arrange(desc(year))
write.csv(suicide, file = "data/suicide_clean.csv")