Rock and roll deaths

Leading causes of death

Dataset

Here we examine data from [wikipedia - Lists of deaths in rock and roll] (https://en.wikipedia.org/wiki/List_of_deaths_in_rock_and_roll) dumped to csv using wikitables. The following cleaning was applied: Age ranges set to lower value, multi deaths split to line each, formatted and corrected date column, location column edited to give consistent country data. The final dataset is here: DEATH/list_of_deaths_in_rock_and_roll_cleaned.csv.gz.

Column	Type	Original	Description
Name	str	y	Name, and affiliation
Age	int?	y	Age at death
Date	str	y	Date string
Location	str	~	Place of death
Cause of death	str	y	Death cause
Clean Age	int	n	int
Clean Date	str	n	YYYY-MM-DD

The word cloud above was generated using the Cause of death column in the dataset.

Setup environment and load data

%matplotlib inline

# from functools import reduce
import datetime
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import tabulate
from IPython.core.display import display, HTML

# Helpers
def disp_df(df):
    """Show html table with dataframe"""
    display(HTML(tabulate.tabulate(df, headers=df.columns, tablefmt='html')))

# Read data
death = pd.read_csv("DEATH/list_of_deaths_in_rock_and_roll_cleaned.csv")
# Overwrite with clean values
death["Age"] = death["CleanAge"]
death["Date"] = pd.to_datetime(death["CleanDate"])
# Drop unused columns
death = death[["Name", "Age", "Date", "Location", "Cause of death"]]
# Add column with year of birth
death["Born"] = death["Date"].dt.year - death["Age"]
# Set index to date of death
death.index = death['Date']
# Add year, month
death["Year"] = death["Date"].dt.year
death["Month"] = death["Date"].dt.month.fillna(-1).astype(int)
death["WeekDay"] = death["Date"].dt.weekday.fillna(-1).astype(int)
# Add age and born counted in decades
def decadize(col):
    return ((col // 10)*10).fillna(-1).astype(int)
death["Decade"] = decadize(death["Date"].dt.year)
death["AgeDecade"] = decadize(death["Age"])
death["BornDecade"] = decadize(death["Born"])
# Drop data with invalid date
death = death[death["Decade"] > 0]

Date of death

ax = death["Year"].plot.hist(
    figsize=(15, 5),
    bins=int(death["Year"].max() - death["Year"].min()),
    edgecolor="none",
    title="Number of deaths per year")
_ = ax.set_xlim(1950, 2016)
_ = ax.set_xlabel("Year")
_ = ax.set_ylabel("Number of deaths")

png

The data very clearly increases fast with time. Still, even an exponential trend would not have predicted the amount seen in 2016. Let’s take a look at how when in the year deaths occur has changed with time. If there is no effect we expect each month to have 1/12 (8.3%) of deaths.

ax = death.groupby(["Year", "Month"]).size().unstack(level=-1).apply(
    lambda x: x * 100. / x.sum(), axis=1).plot.area(
        stacked=True,
        cmap="plasma",
        figsize=(15, 5),
        title="Monthly proportion of deaths versus year of death")
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
ax.set_ylim(0, 100)
_ = ax.set_ylabel("Proportion [%]")
_ = ax.set_xlabel("Year")

png

In the plot it is clear that early data has low statistics and thus is very noisy. Still in later there are on occasion strong peaks and valleys. The largest jumps seems to occur around the shift of the year. Now, we examine deaths by month and weekday, broken down by decade instead.

ax = death.groupby(["Month", "Decade"]).size().unstack(level=-1).plot.bar(
    stacked=True,
    width=1.0,
    figsize=(15, 5),
    cmap="plasma",
    title="Deaths / month by decade of death")
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
_ = ax.set_ylabel("Number of deaths")
_ = ax.set_xlim(-0.5, 11.5)

ax = death.groupby(["WeekDay", "Decade"]).size().unstack(level=-1).plot.bar(
    stacked=True,
    width=1.0,
    figsize=(15, 5),
    cmap="plasma",
    title="Deaths / week day by decade of death")
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
_ = ax.set_ylabel("Number of deaths")
_ = ax.set_xlim(-0.5, 6.5)

png

Clearly deaths occured most in Jan/Dec and July, with very few in May and Sept. For Jan the increase seems extra strong during the 2010s. Most days occur on Fridays, and the least deaths on Saturdays.

Age at death

To get an overview, we break the data down by age in decades at time of death, and then break each of those by birth decade. This gives a simultaneous, indicates most common birth decades as well as most common age.

ax = death[death["Decade"]>0].groupby(["AgeDecade", "BornDecade"]).size().unstack(
    level=-1).plot.bar(stacked=True,
                       cmap="plasma",
                       figsize=(15, 5),
                       width=1.0,
                       title="Number of deaths by age and birth decade")
_ = ax.set_ylabel("Number of deaths")
_ = ax.set_xlabel("Age at death")

png

From the colors we see most people were born in the 1940’s and 50’s, and from the bars that most people died around 60 years of age. We also see quite few teenage and very few centennial deaths. Lets zoom in to death by age under 50.

ax = death[death["Age"] < 51].groupby(["Age", "BornDecade"]).size().unstack(
    level=-1).plot.bar(
        stacked=True,
        cmap="plasma",
        figsize=(15, 5),
        width=1.0,
        title="Number of deaths by age (under 50) and birth decade")
_ = ax.set_ylabel("Number of deaths")
_ = ax.set_xlabel("Age at death")

png

Here we see the famous club 27 peak, with disproportionately many deaths at 27 years of age. By comparing decade of birth to neighboring bars, it is clear that the biggest contribution is from people born in the 1940s. Not until 45 years of age is death this high again.

Location

The last comma separated entry in Location field should be country, it is invalid for many entries. We clean the data and extract the top ten most common countries.

death['Country'] = death["Location"].str.strip().str.lower().str.split(",").str[
    -1].str.strip().replace('', np.nan)
d10 = death['Country'].dropna().value_counts().head(10)
disp_df(pd.DataFrame(d10))

	Country
usa	1309
england	146
canada	30
australia	22
france	17
germany	17
sweden	14
spain	12
japan	8
mexico	7

The data set is highly skewed towards the USA.

Cause of death

death["Cause"] = death["Cause of death"].str.strip().str.lower()
c10 = death["Cause"].dropna().value_counts().head(10)
disp_df(pd.DataFrame(c10))
cancer = death[death["Cause"].str.contains("cancer").fillna(False)]["Cause"].value_counts().head(10)
disp_df(pd.DataFrame(cancer))

	Cause
heart attack	176
cancer	125
lung cancer	79
heart failure	53
car accident	50
natural causes	38
stroke	32
pneumonia	31
suicide	30
heroin overdose	26

	Cause
cancer	125
lung cancer	79
prostate cancer	23
liver cancer	22
pancreatic cancer	14
breast cancer	8
throat cancer	7
stomach cancer	7
esophageal cancer	7
colon cancer	6

The most common causes of deaths are cancers (341) and heart attacks. Most cancers are unspecified, of the specified lung cancer is the dominating cause.