The extent of each line represents the degree to which that character is more likely to use that word compared to other characters.
Notice how Cap talks about people (especially Tony). T’Challa (Black Panther)’s speech is marked by noble topics, opposite of Spiderman, who bumbles around like the teenager that he is. Hulk (Bruce Banner) and Clint (Hawkeye) both are notable for referring to Nat (Black Widow), although for different reasons. Vision and Scarlet Witch talk about very similar themes, which might explain why they seem to gravitate toward each other. Thor’s got his mind set on the bigger picture, leading directly into the events to come in Infinity War. Loki, Unsurprisingly, is the character most likely to talk about power. Ultron wants power in an entirely different way, and is more poetic.
All of these patterns were identified by Elle O’Brien, who uses neural networks to generate predictive text for Botnik Studios. The visualization project was initiated during a meetup of Data Viz Jam Sessions, hosted by Nancy Organ.
Here are the R packages that we will use:
library(dplyr)
library(grid)
library(gridExtra)
library(ggplot2)
library(reshape2)
library(cowplot)
library(jpeg)
library(extrafont)
Some people say that it’s bad form to use this “clear everything” line. I do it routinely at the top of a script to make sure that when I run it, it doesn’t depend on any objects that I accidentally left in the workspace.
rm(list = ls())
dir_images <- "C:\\Users\\Matt\\Documents\\R\\Avengers"
setwd(dir_images)
windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))
character_names <- c("black_panther","black_widow","bucky","captain_america",
"falcon","hawkeye","hulk","iron_man",
"loki","nick_fury","rhodey","scarlet_witch",
"spiderman","thor","ultron","vision")
image_filenames <- paste0(character_names, ".jpg")
read_image <- function(filename){
char_name <- gsub(pattern = "\\.jpg$", "", filename)
img <- jpeg::readJPEG(filename)
return(img)
}
all_images <- lapply(image_filenames, read_image)
names(all_images) <- character_names
Here’s an example of how easy it is, using those names
# clear the plot window
grid.newpage()
# draw to the plot window
grid.draw(rasterGrob(all_images[['vision']]))
This was collected by Elle O’Brien, using some fancy text mining analysis on the movie scripts.
I know that you won’t be able to download it on your own computer using this line (because you don’t have the file), but maybe Elle might share it. If she wants to share it here, I’ll update this page.
load("Avengers_word_data.RData")
capitalize <- Vectorize(function(string){
substr(string,1,1) <- toupper(substr(string,1,1))
return(string)
})
proper_noun_list <- c("clint","hydra","steve","tony",
"sam","stark","strucker","nat","natasha",
"hulk","tesseract", "vision",
"loki","avengers","rogers", "cap", "hill")
# Run the capitalization function
word_data <- word_data %>%
mutate(word = ifelse(word %in% proper_noun_list, capitalize(word), word)) %>%
mutate(word = ifelse(word == "jarvis", "JARVIS", word))
Notice that the simplified character names from before don’t match the nicely formatted character names in the text dataframe
unique(word_data$Speaker)
## [1] "Black Panther" "Black Widow" "Bucky"
## [4] "Captain America" "Falcon" "Hawkeye"
## [7] "Hulk" "Iron Man" "Loki"
## [10] "Nick Fury" "Rhodey" "Scarlet Witch"
## [13] "Spiderman" "Thor" "Ultron"
## [16] "Vision"
character_labeler <- c(`black_panther` = "Black Panther",
`black_widow` = "Black Widow",
`bucky` = "Bucky",
`captain_america` = "Captain America",
`falcon` = "Falcon", `hawkeye` = "Hawkeye",
`hulk` = "Hulk", `iron_man` = "Iron Man",
`loki` = "Loki", `nick_fury` = "Nick Fury",
`rhodey` = "Rhodey",`scarlet_witch` ="Scarlet Witch",
`spiderman`="Spiderman", `thor`="Thor",
`ultron` ="Ultron", `vision` ="Vision")
one for display (pretty) and one for simple organization and referring to image file names (simple)
convert_pretty_to_simple <- Vectorize(function(pretty_name){
# pretty_name = "Vision"
simple_name <- names(character_labeler)[character_labeler==pretty_name]
# simple_name <- as.vector(simple_name)
return(simple_name)
})
# convert_pretty_to_simple(c("Vision","Thor"))
# just for fun, the inverse of that function
convert_simple_to_pretty <- function(simple_name){
# simple_name = "vision"
pretty_name <- character_labeler[simple_name] %>% as.vector()
return(pretty_name)
}
# example
convert_simple_to_pretty(c("vision","black_panther"))
## [1] "Vision" "Black Panther"
word_data$character <- convert_pretty_to_simple(word_data$Speaker)
character_palette <- c(`black_panther` = "#51473E",
`black_widow` = "#89B9CD",
`bucky` = "#6F7279",
`captain_america` = "#475D6A",
`falcon` = "#863C43", `hawkeye` = "#84707F",
`hulk` = "#5F5F3F", `iron_man` = "#9C2728",
`loki` = "#3D5C25", `nick_fury` = "#838E86",
`rhodey` = "#38454E",`scarlet_witch` ="#620E1B",
`spiderman`="#A23A37", `thor`="#323D41",
`ultron` ="#64727D", `vision` ="#81414F" )
avengers_bar_plot <- word_data %>%
group_by(Speaker) %>%
top_n(5, amount) %>%
ungroup() %>%
mutate(word = reorder(word, amount)) %>%
ggplot(aes(x = word, y = amount, fill = character))+
geom_bar(stat = "identity", show.legend = FALSE)+
scale_fill_manual(values = character_palette)+
scale_y_continuous(name ="Log Odds of Word",
breaks = c(0,1,2)) +
theme(text = element_text(family = "Franklin"),
# axis.title.x = element_text(size = rel(1.5)),
panel.grid = element_line(colour = NULL),
panel.grid.major.y = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white",
colour = "white"))+
# theme(strip.text.x = element_text(size = rel(1.5)))+
xlab("")+
coord_flip()+
facet_wrap(~Speaker, scales = "free_y")
avengers_bar_plot
But I want to plot something more ambitious. We want the character images to show through the bars.
The idea is to display the image only in the area of the bar, cutting it off at the bar endpoint.
To do this, we will display a transparent bar, and then at the bar endpoint, plot a white bar extending to the plot edge, to cover up the rest of the picture
In the data, frame, we now want to complement the numeric values with remainder of value needed to reach the overall max (plus a bit more), so that when combining value with remainder, everything adds to the same maximum number, forming a combination of lines with uniform length.
max_amount <- max(word_data$amount)
word_data$remainder <- (max_amount - word_data$amount) + 0.2
word_data_top5 <- word_data %>%
group_by(character) %>%
arrange(desc(amount)) %>%
slice(1:5) %>%
ungroup()
This ensures that there are two entries for each character-word pair; one for the real amount (“amount”), and one for picking up where that ends, extending out to a common max value (“remainder”).
This will collapse the “amount” and “remainder” into a single column called “variable”, indicating which value it is, and another column “value” containing the number from each of those values.
word_data_top5_m <- melt(word_data_top5, measure.vars = c("amount","remainder"))
variable
is the marker of whether the value is the real amount, or the compensated amount.Now we put those levels in an ordered factor, reversed from how it was determined in the melt function. Otherwise, the “amount” and “remainder” will show up in reverse order on the plot.
word_data_top5_m$variable2 <- factor(word_data_top5_m$variable,
levels = rev(levels(word_data_top5_m$variable)))
The only argument is character, declared in simple form (“black_panther” instead of “Black Panther”)
plot_char <- function(character_name){
# example: character_name = "black_panther"
# plot details that we might want to fiddle with
# thickness of lines between bars
bar_outline_size <- 0.5
# transparency of lines between bars
bar_outline_alpha <- 0.25
#
# The function takes the simple character name,
# but here, we convert it to the pretty name,
# because we'll want to use that on the plot.
pretty_character_name <- convert_simple_to_pretty(character_name)
# Get the image for this character,
# from the list of all images.
temp_image <- all_images[character_name]
# Make a data frame for only this character
temp_data <- word_data_top5_m %>%
dplyr::filter(character == character_name) %>%
mutate(character = character_name)
# order the words by frequency
# First, make an ordered vector of the most common words
# for this character
ordered_words <- temp_data %>%
mutate(word = as.character(word)) %>%
dplyr::filter(variable == "amount") %>%
arrange(value) %>%
`[[`(., "word")
# order the words in a factor,
# so that they plot in this order,
# rather than alphabetical order
temp_data$word = factor(temp_data$word, levels = ordered_words)
# Get the max value,
# so that the image scales out to the end of the longest bar
max_value <- max(temp_data$value)
fill_colors <- c(`remainder` = "white", `value` = "white")
# Make a grid object out of the character's image
character_image <- rasterGrob(all_images[[character_name]],
width = unit(1,"npc"),
height = unit(1,"npc"))
# make the plot for this character
output_plot <- ggplot(temp_data)+
aes(x = word, y = value, fill = variable2)+
# add image
# draw it completely bottom to top (x),
# and completely from left to the the maximum log-odds value (y)
# note that x and y are flipped here,
# in prep for the coord_flip()
annotation_custom(character_image,
xmin = -Inf, xmax = Inf, ymin = 0, ymax = max_value) +
geom_bar(stat = "identity", color = alpha("white", bar_outline_alpha),
size = bar_outline_size, width = 1)+
scale_fill_manual(values = fill_colors)+
theme_classic()+
coord_flip(expand = FALSE)+
# use a facet strip,
# to serve as a title, but with color
facet_grid(. ~ character, labeller = labeller(character = character_labeler))+
# figure out color swatch for the facet strip fill
# using character name to index the color palette
# color= NA means there's no outline color.
theme(strip.background = element_rect(fill = character_palette[character_name],
color = NA))+
# other theme elements
theme(strip.text.x = element_text(size = rel(1.15), color = "white"),
text = element_text(family = "Franklin"),
legend.position = "none",
panel.grid = element_blank(),
axis.text.x = element_text(size = rel(0.8)))+
# omit the axis title for the individual plot,
# because we'll have one for the entire ensemble
theme(axis.title = element_blank())
return(output_plot)
}
plot_x_axis_text <- paste("Tendency to use this word more than other characters do",
"(units of log odds ratio)", sep = "\n")
sample_plot <- plot_char("black_panther")+
theme(axis.title = element_text())+
# x lab is still declared as y lab
# because of coord_flip()
ylab(plot_x_axis_text)
sample_plot
Because as the numbers get higher, the odds get really high (we won’t get into the math here); transforming them to a log scale constrains the amount of variability that you need to display on the screen.
In case you want to convert these log odds to a simple probability, here is a function:
logit2prob <- function(logit){
odds <- exp(logit)
prob <- odds / (1 + odds)
return(prob)
}
logit2prob(seq(0, 2.5, 0.5))
## [1] 0.5000000 0.6224593 0.7310586 0.8175745 0.8807971 0.9241418
diff(logit2prob(seq(0, 2.5, 0.5)))
## [1] 0.12245933 0.10859925 0.08651590 0.06322260 0.04334474
and explored some details of it,
Let’s apply that function to list of all characters, putting all the plots into one list object.
all_plots <- lapply(character_names, plot_char)
### Function to extract the axis title from the plot
not just the text, but the actual thing drawn.
You can choose which axis (x/y) you want to extract
get_axis_grob <- function(plot_to_pick, which_axis){
# plot_to_pick <- sample_plot
tmp <- ggplot_gtable(ggplot_build(plot_to_pick))
# tmp$grobs
# find the grob that looks like
# it would be the x axis
axis_x_index <- which(sapply(tmp$grobs, function(x){
# for all the grobs,
# return the index of the one
# where you can find the text
# "axis.title.x" or "axis.title.y"
# based on input argument `which_axis`
grepl(paste0("axis.title.",which_axis), x)}
))
axis_grob <- tmp$grobs[[axis_x_index]]
return(axis_grob)
}
px_axis_x <- get_axis_grob(sample_plot, "x")
px_axis_y <- get_axis_grob(sample_plot, "y")
grid.newpage()
grid.draw(px_axis_x)
# grid.draw(px_axis_y)
big_plot <- arrangeGrob(grobs = all_plots)
Take that big ensemble of plots, and insert the x axis underneath, since each plot doesn’t have an x axis, and we want a single axis for all of them.
Note that the big plot is 10x as tall as the axis.
big_plot_w_x_axis_title <- arrangeGrob(big_plot,
px_axis_x,
heights = c(10,1))
grid.newpage()
grid.draw(big_plot_w_x_axis_title)
Oh no! those plots all take up slightly different amount of page space because the words have different lengths.
It looks a little messy.
Normally, we ensure neat and aligned arrangement of plots using facet_grid() or facet_wrap(), but we can’t use those here, because each plot gets its own custom background image that can’t be mapped to a facet as would other columns in a data frame (because the image isn’t actually part of the data frame).
so that plot axes are aligned vertically
big_plot_aligned <- cowplot::plot_grid(plotlist = all_plots, align = 'v', nrow = 4)
big_plot_w_x_axis_title_aligned <- arrangeGrob(big_plot_aligned,
px_axis_x,
heights = c(10,1))
grid.newpage()
grid.draw(big_plot_w_x_axis_title_aligned)
ggsave(big_plot_w_x_axis_title_aligned,
file = "Avengers_Word_Usage.png",
width = 12, height = 6.3)
Matt Winn
I’m a hearing scientist/audiologist who works at the Univ of Washington (soon: the Univ of Minnesota). Nope, that has nothing to do with the Avengers, but if you help to understand challenges and solutions for human communication, you’re a hero to me ;)