Notes and Methodology

This Oscar dataset is pretty massive and didn't come straight out as it is, in a machine readable and clean format. So I thought it might be useful to talk about my sources and explain what manipulations I made with to clean up the data. For time reasons, the following steps are outlined pretty schematic, to give a general idea of the challenges, choices and interpretations I made. But you can contact me if you have any questions, and I'd be happy to explain things more in details.

Source of the data: The Academy Awards Official Database. I scraped all the single nominations announced in each Award Category + Year combination.

First problem: each nomination can have more than one nominee (For example: Scientific And Engineering Award of 2014 went"To EMMANUEL PRÉVINAIRE, JAN SPERLING, ETIENNE BRANDT and TONY POSTIAU for their development of the Flying-Cam SARAH 3.0 system."). In my sheet, I wanted a single nominee per line, so I extracted them. As a result, for example, there is a data point called 'EMMANUEL PRÉVINAIRE - Scientific And Engineering Award of 2014', one for 'JAN SPERLING - Scientific And Engineering Award of 2014'. To distinguish nominations with multiple nominees from those with single nominees, I weighted each nominee by the number of other nominees that shared the award with him/her. So, for example, the datapoints referring to EMMANUEL PRÉVINAIRE, JAN SPERLING, ETIENNE BRANDT and TONY POSTIAU each have a "Nomination Weight" of 0.25.

When calculating the % of female and male nominees per year or award category, I made both a calculation using the raw numbers (for example: # female nominees out of the total nominees) and a weighted one (for example: sum of "nomination weight" of female nominees, divided by total sum of "nomination weight" for all nominees").

On this note: obviously, only people were counted when calculating % of female and male nominees. Therefore, if a nomination has, for example, as nominees MGM, Marilyn Monroe and James Dean, each of the three will have a "nomination weight" of 0.33. However only two were added up to calculate male/female ratios (total: 0.66). For this reason, it makes sense only to compare the percentages, rather than the raw numbers of male/female nominees/winners.

Second problem: Award categories changed name and structure through time. I tried to normalize them as best as I could. (For transparency: each nomination contains also a column with the original award category wording).

Third problem: The scraped list contains also nominations that were withdrawn, revoked... I signal them with a "Yes" in the column "Revoked/Withdrawn", so they are in the dataset. However I excluded them from the calculations of male/female ratios and of # of nominations/wins per movie.

Fourth problem: The scraped nominee list includes movies, people, companies, countries... I categorized each nominee accordingly, so that only the people/countries/studios...could be selected to make statistics.

Fifth problem: Determining the gender of each of the 7K+ unique people nominees. The Academy Awards Official Database lets you generate a query listing only female nominees. I scraped this and use it to mark with "Female" the relevant nominees. However not all nominees that aren't included in this list are automatically to be marked as male (they are just not registered as male in the database).  I used the genderdize API to see which of these names had a high chance of being male. And for the remaining ambiguous ones, whose names could be both of female and male ones, I manually googled the profiles to find out.

Sixth problem: Pulling in more data about each movie (image, plot, director...). Source: IMDb (and the amazing OMDb API). I made a scrape form IMDb listing all the movies that were nominated in an Academy Award, together with their IMDb unique ID (the 'tt123456' in the url of each movie). I used this unique ID to pull in data from OMDb. (Note: matching movies mentioned in the Academy Awards' nomination with their IMDb title isn't always easy: sometimes IMDb used a different title.)

Made with Silk

Silk is a place to explore the world through data. Silk displays data as beautiful interactive charts, maps and web pages. Create your own free Silk now.