Do male IMDb users dislike Simpson’s episodes with more Lisa screentime?

and is it specifically young men?

Max Bo, Jamie Rosenberg

This notebook investigates 2 hypotheses that my good friend Jamie Rosenberg pitched to me one Friday evening. Fortunately, he had legitimately acquired much Simpsons ratings data from IMDb prior to the removal of their per age and gender cohort ratings on the 2020-08-22.

The Golden Age

Firstly, we must ask: which episodes? The Simpsons has a well documented “golden age”, commonly regarded as ending on episode 203 (S09E25) Natural Born Kissers:

field value
Production code ``
Episode sequence number ``
Title ``

We can see a fairly obvious partition in quality:

When does Lisa speak?

For every Golden Age episode, let’s take a look at the script and identify every single instance Lisa speaks:

And now as a proportion of all words spoken per episode. We see that episode (), , is the episode with the most Lisa lines in it (%):

From that, let’s plot the relationship between ratings and the proportion of words spoken by Lisa. Each dot’s y-axis is the the average vote of an entire cohort.

For example, episode (__), for cohort / has an average rating of . The regression line is NOT against these average cohort ratings, but against the “synthetic” data points that would otherwise be at the specified rating levels. Each dot is scaled by total vote count for that episode in that cohort.

Stats

model <- lm(rating ~ lisaProportion * age * gender, data = df)

We can conclude that male IMDb users DO like Lisa Simpson. We are however unable to conclude that there’s an interaction between age and gender on ratings (such that young men dislike Lisa more, etc.)


Implementation

Golden age

Style the episode slider (well, all sliders)

goldenAgeFinalEpisodeIndex = goldenAgeFinalEpisodeSequenceNumber - 1
<style>
  input[type="range"] { accent-color: hsl(0, 100%, 50%); } /* use my red for sliders */
</style>

When did the user specify the “golden age” cutoff?

settledgoldenAgeFinalEpisodeSequenceNumber = settle(viewof goldenAgeFinalEpisodeSequenceNumber)
settledGoldenAgeFinalEpisodeIndex = settledgoldenAgeFinalEpisodeSequenceNumber - 1

Lisa proportion

Pull in all the lines:

Drop and rename fields:

lines = metaLines.map((d) => {
  return {
    episode_sequence_number: d.episode_id,
    timestamp_in_ms: d.timestamp_in_ms,
    speaking_line: d.speaking_line,
    character: d.raw_character_text,
    word_count: d.word_count
  };
})

Script episode sequence numbers are 1-indexed:

[...new Set(lines.map((d) => d.episode_sequence_number))].toSorted(
  (a, b) => a - b
)

Filter down to just golden age lines:

goldenAgeLines = lines.filter(
  (d) =>
    d.episode_sequence_number <= settledgoldenAgeFinalEpisodeSequenceNumber &&
    d.speaking_line &&
    !isNaN(d.word_count) &&
    d.word_count < 100_000 // we have 3 weird outliers with massive wordcounts, see below:
)
lines.filter((d) => d.word_count > 100_000)

This is a bit of a hack, but this is so that Lisa always appears first in Stack transform ordering.

sortedByLisaGoldenAgeLines = goldenAgeLines.sort((a, b) => {
  const aIsLisa = a.character === "Lisa Simpson";
  const bIsLisa = b.character === "Lisa Simpson";
  return bIsLisa - aIsLisa;
})

For each episode number, compute the proportion that Lisa speaks out of all the words.

We need this to sort the y-domain of all our charts:

sortedByLisaProportionLisaEpisodes = Object.entries(
  episodeSequenceNumberToLisaProportion
)
  .toSorted((a, b) => b[1] - a[1])
  .map((d) => Number(d[0]))
ratings = FileAttachment("all_ratingsbydemo_20200824_115720.tsv").tsv({
  typed: true
})

Our IMDb data set doesn’t have episode numbers, only production codes. So we need to derive a production code -> episode number mapping:

productionCodeToEpisodeSequenceNumber = {
  const mapping = {};
  [...new Set(ratings.map((d) => d.seNum))].toSorted().forEach((seNum, i) => {
    mapping[seNum] = i + 1;
  });

  return mapping;
}

We also construct a lookup table of production code to episode metadata (we’re mainly interested in the episode title):

episodeSequenceNumberToMetadata = {
  const acc = {};

  for (const d of episodeMetadata) {
    acc[d.id] = d;
  }

  return acc;
}

Join

Now we join our IMDb ratings, metadata and the information we gathered from the scripts. We also:

  1. Derive a “cohort” which we use to key a bunch of computations later on
  2. Filter the age cohorts to those that we care about
AGE_DOMAIN = new Set(["all", "aged_under_18", "aged_18_29", "aged_30_44", "aged_45_plus"])
GENDER_DOMAIN = new Set(["males", "all", "females"])
tidy = ratings
  .map((d) => {
    return {
      ...d,
      productionCode: d.seNum,
      episodeSequenceNumber: productionCodeToEpisodeSequenceNumber[d.seNum]
    };
  })
  .map((d) => ({
    episodeSequenceNumber: d.episodeSequenceNumber,
    productionCode: d.productionCode,
    ...parseSeasonAndEpisode(d.productionCode),
    lisaProportion:
      episodeSequenceNumberToLisaProportion[
        productionCodeToEpisodeSequenceNumber[d.productionCode]
      ],
    rating: d.rating,
    votes: d.votes,
    title: episodeSequenceNumberToMetadata[d.episodeSequenceNumber]?.title,
    gender: d.cat_b === "gender" ? d.cat_c : d.cat_b,
    age: d.cat_b === "gender" ? "all" : d.cat_c,
    cohort:
      d.episodeSequenceNumber + "|" + d.cat_a + "|" + d.cat_b + "|" + d.cat_c
  }))
  .filter((d) => AGE_DOMAIN.has(d.age) && GENDER_DOMAIN.has(d.gender))
cohortToAverageRatingAndTotalVotesPerEp = {
  const cohortTotalVotes = {};
  const cohortTotalScore = {};
  const cohortAverageRating = {};

  for (const d of tidy) {
    cohortTotalVotes[d.cohort] = (cohortTotalVotes[d.cohort] ?? 0) + d.votes;
    cohortTotalScore[d.cohort] =
      (cohortTotalScore[d.cohort] ?? 0) + d.votes * d.rating;
  }

  for (const cohort of Object.keys(cohortTotalScore)) {
    cohortAverageRating[cohort] =
      cohortTotalScore[cohort] / cohortTotalVotes[cohort];
  }

  return { cohortAverageRating, cohortTotalVotes };
}
function parseSeasonAndEpisode(productionCode) {
  const match = productionCode.match(/S(\d+)E(\d+)/i);
  if (match) {
    return {
      season: parseInt(match[1]),
      episode: parseInt(match[2])
    };
  }
  throw new Error("failed parse");
}

Now we join even more episode in, which is the average rating for that episode for that specific cohort:

cohortAverages = {
  const intermediate = tidy.map((d) => ({
    ...d,
    cohortAverageRating:
      cohortToAverageRatingAndTotalVotesPerEp.cohortAverageRating[d.cohort],
    cohortTotalVotes:
      cohortToAverageRatingAndTotalVotesPerEp.cohortTotalVotes[d.cohort]
  }));

  return _.uniqBy(intermediate, "cohort");
}

exploded creates “synthetic” datapoints based on the number of votes a certain cohort gave for a certain rating. We use this to generate our linear regression lines.

exploded = {
  const acc = [];

  for (const d of tidy) {
    for (let i = 0; i < d.votes; ++i) {
      acc.push(d);
    }
  }

  return acc;
}
cohortAllCohortAverages = cohortAverages
  .filter((d) => d.gender === "all" && d.age === "all")
  .filter((d) => d.cohortAverageRating)

Statistical analysis

I built this linear model in R and then asked Claude to, character for character, replicate the output with equivalent JavaScript implementation.

# Install rjson if not already installed
if (!require("rjson", quietly = TRUE)) {
  install.packages("rjson", repos = "https://cloud.r-project.org")
  library(rjson)
}

# Read the exploded data from JSON using readLines for better performance
cat("Reading JSON file...\n")
json_lines <- readLines("exploded.json")
json_str <- paste(json_lines, collapse = "")
cat("Parsing JSON...\n")
json_data <- fromJSON(json_str)
cat("JSON loaded:", length(json_data), "entries\n\n")

# Convert list to data frame, handling NULLs
lisaProp <- unlist(lapply(json_data, function(x) if(is.null(x$lisaProportion)) NA else x$lisaProportion))
ratings <- unlist(lapply(json_data, function(x) if(is.null(x$rating)) NA else x$rating))
ages <- unlist(lapply(json_data, function(x) if(is.null(x$age)) NA else x$age))
genders <- unlist(lapply(json_data, function(x) if(is.null(x$gender)) NA else x$gender))

df <- data.frame(
  lisaProportion = lisaProp,
  rating = ratings,
  age = ages,
  gender = genders,
  stringsAsFactors = FALSE
)

# Remove any NA rows
df <- df[complete.cases(df), ]
# Convert to factors for modeling
df$rating <- as.numeric(df$rating)
df$lisaProportion <- as.numeric(df$lisaProportion)
df$age <- as.factor(df$age)
df$gender <- as.factor(df$gender)

model <- lm(rating ~ lisaProportion * age * gender, data = df)
anova(model)