A year ago, when I turned 30, I decided to start noting down the films and television that I watch.
“Now have I seen that before?”
“Was that film any good?”
“Which season did I get up to in that show?”
No longer would I wrack my brain over these questions! You might think they suggest I’ve got a pretty terrible memory — and you’d be right. However, I’m a millenial: if it’s not written down, I’m not retaining it.
In this past year, I’ve actively watched a staggering 107 films and 372 episodes of TV series (and that’s not even counting the passive Friends-on-in-the-background). Naturally, as an inquisitive data analyst, I wanted to do more with that data once I’d recorded it. With the help of some open IMDb datasets, I’ve visualised the data to try to answer some questions.
In this post I’ll talk a bit about the data collection, the data transformation, the data visualisation and what I’ve found out. Further detail for some of the steps can be found on GitHub, including all of the code should you wish to do something similar yourself.
Recording the Data
I created a Google Sheet with two tabs: one for films and one for TV series. I recorded the data by filling these in shortly after I watched each ‘thing’. The GitHub readme outlines exactly what each column means.
In the films tab, each row represents a film that I’ve watched.
In the series tab, each row represents a season of a TV series that I’ve watched. I didn’t record individual episodes watched (that felt like overkill) and this has some implications later on. Nor did I give the series/seasons a rating; I didn’t feel I could meaningfully rate an entire season!
I’ve since discovered that both Netflix and Prime can provide your viewing history, potentially deeming my own recordings redundant. However, these aren’t provided in the easiest format to work with (for example, Prime’s can’t be exported as a csv) and this still wouldn’t allow for capturing my ratings, or for recording when I watched a DVD.
Transforming the Data
I used R to import my own ‘personal’ watch data, combine it with the IMDb datasets and reshape it into a structure suitable for visualisation. This process primarily used functions from R’s tidyverse libraries.
For each film or show, the IMDb data was used to obtain the release year, the runtime and the public rating. I did explore some of the other data available, such as a film’s director or who starred in a particular episode of a TV series, but I’ve abandoned that for the time being.
The main goal of the ‘reshaping’ was to turn my two Google Sheet tabs (and the relevant IMDb data) into a single table, in which each row represents either a film or an episode that I’ve watched. This combined table was written back to the same Google Sheet as a third tab.
Film and Series Names
When I recorded the data I used shorthand for the names of films and series, providing enough information to be able to find the IMDb ID afterwards but not worrying too much about formatting. Once joined to the IMDb data, I was able to use the ‘proper’ names that IMDb provides: my ‘Harry Potter 7’ becomes ‘Harry Potter and the Deathly Hallows: Part 1’.
My recorded TV series data was at season-level (1 row per season) whereas the output here is at episode-level (1 row per episode). One of the IMDb datasets contains a list of episodes for each series and season, and so a left join (on ‘series name’ and ‘season number’) turns my 1 row for ‘The Office Season 6’ into 26 rows. If I haven’t finished that particular season yet, I use my ‘Latest Episode’ column to remove rows for episodes that I haven’t seen.
TV Episode Watch Dates
Since I only recorded the dates on which I started and finished a given season of a series, I had to estimate the date on which I watched each individual episode using interpolation: given that I know the date on which I watched the first and last episodes, I spread the remaining episodes as equally as possible in between those dates.
This means that my watch dates for (most) TV episodes are inaccurate, and so certain analyses (e.g. on which days of the week do I watch the most TV?) will be invalid. However, some analyses (e.g. in which months do I watch the most TV?) will be broadly unaffected.
TV Episode Release Years
Disappointingly, the IMDb datasets only provide the start year and end year for an entire TV series, and not the release years for individual seasons or episodes. Therefore I had to estimate the year in which each episode (or season) was released.
For this I took a similar approach to the interpolation above: given that I know the years on which the first and last seasons were released, I spread the remaining seasons as equally as possible in between those years.
Visualising the Data
To uncover interesting patterns I wanted to visualise the data. I considered a few different tools for this:
I opted to use Tableau. It’s drag-and-drop, making it incredibly quick and useful for ‘exploring’ data, and it is free for anyone to build their own charts and share them publicly. You can read data in from any Google Sheet but you can’t do much more ‘transformation’ once the data is in Tableau. However, that didn’t cause too many problems. The process of visualising the data in Tableau was straightforward:
- Download and open Tableau Public
- Create a Data Source — effectively a table linked to the relevant Google Sheet tab
- Tweak the data types for some of the columns
- Create measures and define their formats, e.g. Hours Watched = sum([Runtime])/60
- Make some charts and add them to a dashboard
- Publish the dashboard(s) to Tableau Public
Note that using Tableau via Tableau Public is a bit clunky — you don’t save any files locally, and so each time you want to save your work you have to publish it to the Tableau Public server. It would be slicker with a Tableau licence and the standard Tableau desktop app.
However, it’s still reproducible: if somebody wants to download the Tableau Public workbook from the server, open it in Tableau Public, and repoint the data source at their own Google Sheet, then (in theory) it should update with all of their data. Mine isn’t particularly polished, and there’s no code for me to easily show or justify any decisions I’ve made, but if somebody wanted to see how the measures are defined then they could do that once they’ve opened the workbook.
I like a long film
I tended to give longer films higher ratings than shorter films. There’s probably an underlying factor here: I’m less likely to commit to a 3-hour epic unless I’m sure I’ll really like it!
IMDb ratings aren’t a great indicator for me
There was some correlation between my ratings and IMDb ratings — the orange line shows an increasing median IMDb rating as my rating increases — but not a great one. Within each of my own rating categories, there was still a pretty large spread of IMDb ratings.
I won’t necessarily enjoy a film just because it has a good IMDb rating, and conversely I may well enjoy a film even if it has a bad IMDb rating (Eurovision, I’m looking at you). It probably didn’t help that I gave almost every Star Wars film a 5…
I’m conscious that the visualisation is pretty naff if you’re trying to pick out a single film! However, having experimented with a variety of ways of showing the relationship between these two variables — whilst also maintaining the ability to ask “oh which film was that?” — I decided that this did the job well enough. I did jitter the IMDb ratings to mitigate the overlap, which helped a bit.
I’m more likely to watch an old film than an old series
The vast majority of our series watchtime was spent on TV from 2010 onwards. While 2010 was the top decade for film watchtime, it had a more prominent tail through the 00s, 90s and 80s than series did.
The US Office wasn’t as all-consuming as I thought
Having watched all of the US Office in 2020, it still feels as though it’s pretty much all we’ve watched in the past year. However, at the tracking start point — May 2020 — we were mid-way through. It turns out that the US Office only makes up 16% of our total series watchtime in the past year.
Despite watching 21 different series this year, over 50% of that time has been spent on just 4 series.
The 80s produced the best films
I was most likely to rate a film 5 if it was released in the 80s. The 70s did pretty well too, with all of the films getting a rating of 4 or 5.
The 2020s is an interesting one, with over 50% of watched films getting a rating of just 2 or 3. I think we may watch a new film just because it’s new, whereas for “not new” films we’re more likely to be selective over its quality.
A very small sample size here (2 films!) but it appears I’m not a big fan of the 50s…
Saturday night is (the favoured) movie night
This wasn’t a surprise, but Saturday was the most popular day of the week to watch a film.
What did surprise me was the number of films we watched on a Sunday. More, even, than on a Friday. Perhaps trying to fool ourselves into thinking it’s not a school night…
School nights are comfort-viewing nights
Whilst most films were films I was watching for the first time, I was much more likely to choose a trusty safe favourite on a school night than on a weekend night.
There’s probably interesting stuff to be found amongst the writer/director IMDb datasets, as well as in the IMDb genre (which is currently a comma-separated list of genres per film/episode, so needs a bit of work doing).
I plan to continue tracking my watch history so that I can do some year-on-year comparisons next year. Given that the past year has been quite abnormal, I’d certainly hope that my viewing levels drop!
Suggested improvements or other comments are very welcome.