Global Name Data

This entry was posted by Adam Hyland on June 03, 2013 in Data Visualization and Bocoup.

Open Data, Open Gender Tracker and the Open Web

Open data is a core element to building data visualization on the Open Web, driving our ability to reproduce, tinker with and expand visualizations. In a tool like Open Gender Tracker, open data becomes invaluable. Open Gender Tracker relies on name and gender mappings in order to ascertain the gender of an author or source. Today, we are releasing Global Name Data, a dataset of birth name-gender mapping which we believe to be the most comprehensive in the world.

Most gender classification tools draw name-gender mappings from sets of data which are highly reduced, such as top-1000 lists or IPUMS census samples. While this data is useful, it is woefully insufficient when attempting to classify gender for uncommon names. Likewise, the geographic limitations of these datasets make them unviable for an international service.

In deploying Open Gender Tracker to a citizen media service with authors from around the globe, we quickly realized our gender byline classifier service needed to handle uncommon names and provide reliable estimates for gender across multiple countries. To that end, we gathered open data from the United States and the United Kingdom to create a machine-readable resource for baby names and gender. Between the US and the UK we have data on gender mappings for 101,749 unique names arising from records on 336,267,178 births—an order of magnitude improvement over the next best name dataset.

Global Name Data

Each government name dataset we used was released under an open content license. But open licensing is only one step toward open data. The Social Security Administration provides data in a common machine readable format, but the various statistical agencies in the United Kingdom used a variety of formats and presentation. Global Name Data collects each of these datasets and presents them in a common, machine-readable format for easy use and re-use. Additionally, we believe reproducibility is a core component to open data. We have provided tools (written in R) to download and parse the data from each source.

Databases of birth names are used for a variety of projects. Open Gender Tracker uses name data to track byline and mentions by gender in news articles. Laura and Martin Wattenberg used baby name data to build one of the earliest popular interactive visualizations on the web, the Baby Name Voyager. Political campaigns use name databases to estimate likely age and income of voters from voter roll databases. Each of these use cases relies upon accurate and rich datasets for names across a number of dimensions (not just gender).

With that in mind, let’s take a closer look at how we can access the data. If you have R installed you can follow the instructions on the README to install the package. Once we do that, it’s simple to load the package into an R session and explore the data.


## Import the library and the data
## If you haven't installed the package
## follow the instructions on the readme



#     Name  F   M Year
# 1  Aaron  0 102 1880
# 2     Ab  0   5 1880
# 3  Abbie 71   0 1880
# 4 Abbott  0   5 1880
# 5   Abby  6   0 1880
# 6    Abe  0  50 1880

We’re looking at the actual birth data for US Births from 1880 to 2011. From here we can do more than classify gender, we can track popularity (and androgyny!) of names, name conventions and overall demographic changes. We can quickly get a sense of the baby boom using this data1.


## Count births by gender/year
yearBirths2 <- function(data = usnames) {
  countBy <- function(
    x = c("Male", "Female")) {
    # Contingency table for births
    births <- with(data,
                   group = Year))
    # rowsums will unfortunately store groups in
    # the name of the output vector
    out <- data.frame(Year = rownames(births),
                      Births = unname(births),
                      Gender = match.arg(x))
  data.out <- rbind(countBy("M"), countBy("F"))
## we can also use yearBirths() from the namedata package

mw.df <- yearBirths2()

# A very basic plot using ggplot2.
# aes() allows us to describe mapping of variables
# similar to if we used
# with(mw.df, plot(x = Year, y = Births))
plotBabies <- function(data) {
  p <- ggplot(data) +
    geom_line(aes(x = Year,
                  y = Births,
                  colour = Gender),
              size = 1.7) +
    xlim(1930, 2010) + xlab("") +
    theme(legend.position = "top")
plotBabies(data = mw.df)
Births per year in the United States by gender.

Importantly, we can also use this data to determine likely genders for a name! One of the most common questions I was asked when building this project was “can you tell me how many men or women are named ‘Pat’ in the United States?” Thanks to the wonder of open data and a little code, we can answer that.


## Answering bar bets

# We have a function in namedata, nameMetric()
# which can compute the gender breakdown
# of a name (or names)
pat.df <- nameMetric(data = usnames,
                     names = "Pat",
                     metric = "Neutral") ## It's Pat!

# you can look at the source for nameMetric()
# but the computation for neutrality is simple:
# 1 - abs(0.5 - male.births / total.births) * 2
# scaling isn't strictly necessary but it gives
# a more straightforward, interpretation

# A somewhat more complicated plot now
patPlot <- function(data) {
  # hack to make the sizing look slightly better
  births <- data[, "Births"]
  data[, "Lagged"] <- ifelse(c(0, diff(births)) >= 0,
                             c(0, births))
  ggp <- ggplot(data,
                aes(x = Year,
                    y = Neutral,
                    size = Births,
                    colour = Births))
  # guide_legend() can be used to combine scales
  # of different types (e.g. size, color) into
  # one legend
  b.g <- guide_legend(title = "Births per Year",
                      direction = "horizontal",
                      title.position = "top",
                      title.hjust = 0.5)
  # apply layers
  # note the differential mappings for geom_point()
  p <- ggp +
    geom_path(guide = FALSE) +
    geom_point(aes(size = Lagged,
                   colour = Lagged)) +
    xlim(1920, 1970) + ylim(0.1, 1)
  # apply styles
  p <- p +
    scale_colour_continuous(guide = FALSE) +
    guides(colour = b.g, size = b.g) +
    theme(legend.position = "bottom") +
    xlab("") + ylab("Androgyny")

patPlot(data = pat.df)
The 50s and 60s showed the most androgynous birth years for "Pat".

Interestingly, the most androgynous “Pat” cohort would’ve been in their early 30s or late 20s when the Saturday Night Live skit premiered. Other than that, the majority people born with the name “Pat” in the United States are women (40,000 births versus 25,000). This breakdown is actually quite anomalous–the staggering majority of names are associated with only one gender at birth. But we can explore that ourselves with some code:


## Count name occurence across years.
## see help('byNameCount') for more info

us.prop <- byNameCount(usnames)

#        Name years.appearing counts.female counts.male prop.male
# 1     Aaban               4             0          31         1
# 2     Aabha               1             7           0         0
# 3     Aabid               1             0           5         1
# 4 Aabriella               1             5           0         0
# 5     Aadam              20             0         150         1
# 6     Aadan               6             0          80         1

## Estimate probably gender
## the estimation function can be changed
## with a different function signature

us.est <- nameBinom(us.prop)

#        Name  ... prob.gender est.male    upper
# 1     Aaban  ...        Male        1 1.020655
# 2     Aabha  ...      Female        0 1.050109
# 3     Aabid  ...        Male        1 1.054572
# 4 Aabriella  ...      Female        0 1.054572
# 5     Aadam  ...        Male        1 1.005061
# 6     Aadan  ...        Male        1 1.009116

The default classifier (Agresti and Coull’s inexact binomial proportion) has better performance in the interior of the distribution than other methods but will report impossible upper/lower confidence intervals for the extreme (> 1 or < 0). For our purposes that’s fine. However, we’ve broken up the count and estimation functions to allow flexibility for end users who may be interested in building their own classifiers with more features such as phoneme classification or last-letter analysis. We can also just make something beautiful2:

Building our gender classifier from open data allows us to not only release the results but also to provide the intermediate data which can be used to track name usage over time, generate n-gram and phenome frequency information, and discover that in the United States, roughly 11,000 children have been born with the name “Baby”3. Global Name Data isn’t a revolution. It isn’t a dramatic project which will change how we deal with data or how quickly governments release data. But it does reduce the friction and uncertainty associated with using name data in your project. We’re excited to see to what you do with this name data!

  1. The use of yearBirths2() is a slight misdirect. While it is certainly possible to clobber variables in the global environment in R, functions exported from packages are namespaced. If we had named our user created function yearBirths we would not have overwritten the package supplied function, only inserted our function into the search path. We could still expose namedata’s version with namedata::yearBirths.

  2. How this was made is left as an exercise for the reader. One hint: it doesn’t use the default estimation method (which wouldn’t result in truncation at 1, 0).

  3. It’s likely that some of these were children who hadn’t been named by the time they were entered into the SSA’s database, but let’s pretend for a moment that people name their kids crazy things. :)

This entry was posted by Adam Hyland on June 03, 2013 in Data Visualization and Bocoup.