A link to the github repo for this project is located at the end.
When the last season finale of The Walking Dead ended with the death of an unknown character, mass speculation ensued. The fans’ theories about who was killed have been based on everything from details in the graphic novel to the shooting schedules of the actors. In this post, I’ll take a different approach and use data to see if it can help correctly identify the victim.
Getting to the Data
One of the great things about The Walking Dead is its large fan base and the accompanying fan wiki. Fan wikis in general are a great resource for data because they are often detailed, accurate, up to date, and--most of all--easy to scrape. The Walking Dead wiki contains a page with the season each character debuted, their current status (alive, dead, or undead), and several other demographic markers.
Using Python and BeautifulSoup makes scraping the fan wiki easy. For this project I start by looping through each character’s page and collected the following information: name, age, gender, ethnicity, and their series lifespan. Once I’ve collected this data, I clean up the age (rounding down to the nearest 10) and simplify the ethnicity to the following races: Asian, Black, Hispanic, and White. I also calculate each character’s series age, which is the number of episodes they were alive for (e.g., If they were introduced and killed in the same episode, then they would have a series age of 1). I also create a variable titled season_birth which is the season they were introduced in. A common theory amongst Walking Dead fans is that characters from season 1 tend to live longer than those who joined the cast in later seasons.
With this data collected and some variables generated, let's start doing some basic data profiling.
A Quick Look
One of the first things I want to look at are boxplots of the characters’ series lifespan across the different demographics. This will give us an idea if age or gender play a role in survival.
It appears that women in this post-apocalyptic world tend to do better than men. They have a median lifespan that is almost twice as high as their male counterparts (10 episodes to 6). One thing to note about this graph is that there are actually 5 men that are in those two dots at the top.
One of the first things that stands out is how few data points there are for Asian and Indian characters. Of the races that do have data, Hispanics have the lowest median series lifespan of 5 episodes.
I think it's interesting that in the 20-year-old group, not one character who has appeared in over 14 episodes has died.
Also, who would have guessed that 70-year-olds have the highest median series lifespan (excluding the 10 and under group which is only Judith). Of course, we're only looking at three data points: Hershel Greene, Natalie Miller, and Bob Miller.
Ok, it's time to see what some modeling can do for us.
Making the Model
For this project, I'm going to use a decision tree and train it to predict the series age of a character based on their race, gender, age, and season they premiered in. Essentially, I'm making a model to predict how many episodes a character should last on the show; then I will see who of our 11 possible victims has outlived their predicted value by the longest time. For example, if the model says that Abraham should have a series age of 30 episodes but we know he's been in 50 episodes, then he has an excess life span of 20 episodes. If no one else has an excess lifespan that high, then the model is pointing to him as Negan's victim.
For the decision tree, I set the minimum leaf value to 3, meaning that a terminal node has to have three or more characters in it. I'm also only feeding the model data on characters who have already died. For example, Michonne has been in 65 episodes so far, but that could go up to 100 or stop at 66. We don't know. The jury is still out on how long she and other characters will live, so that's why I'm holding that data off to the side for now.
Creating the model produces the following tree diogram and table of predictions:
The first variable our model splits on is what season the characters premiered. In this case, characters that started in season 3 or earlier are on the left and the others are on the right. You can see that the left side is much darker, indicating that people who premiered before season 4 live longer.
|Name||Series Age||Tree Prediction||Excess Life|
All right, so we have our model and the data has spoken. According to these results, Rick Grimes is the character who is most likely to have been Negan’s victim.
A Grain of Salt
Before we take these results to Vegas, there are a couple of things we should consider. One of the main problems is that there isn't that much data to go off of. Sure, The Walking Dead has a large cast by most TV series standards; the wiki site has pages on 200+ characters. However, this isn’t that much data when it comes to modeling. If your model is trying to use age and race, that data gets subdivided. In the case of Glen, you would want to look at other Asian males in their mid-twenties, of which there have only been two: Jiro who lasted only a single episode and Kal, who is still alive, so we don't really know how many episodes he will last. This bring up another point: we're trying to predict the life expectancy of characters, but many of them are still alive so we don't know how long they will live. By only including data on those that have died, we're skewing the prediction down.
That being said, what can we salvage? If we go back to the box plot on the characters’ life spans, we can see that no character who appears in more than 52 episodes has died. It's as if they enter an “unkillable zone” fueled by fan popularity (e.g., If Daryl dies we riot). So what if we just look at those that are below this threshold? That leaves Eugene, Rosita, Abraham, and Aaron. In this group, Eugene and Abraham are practically tied with excess life values of 35.33 and 35.29 episodes. The difference of 0.04 episodes amounts to about a minute and a half of screen time.
So there we have it, either Eugene or Abraham. Stepping away from the data for a moment, I would probably guess that Abraham is the one that’s killed. Or maybe all of this is wrong, and it’s Glyn because that’s more true to the comics. Either way, we’ll just have to wait until October 23 to find out.