Gendered names in Freebase

2009 September 10

I recently put together a handy little app using Freebase Acre. It uses data about people in Freebase to tell you whether a particular given name is mostly used by men or women.

For instance, here’s the results for the name Evelyn:

pie chart

Other interesting results include the women named Jeff or the near 50:50 split on the name Andrea.

Anyway, I wanted to talk a little bit about how I made the app, so that anyone who wants can do likewise.

The Gendered Names app is built using Acre, as I mentioned above. To create a new Acre app, just go to acre.freebase.com and then to the “Apps” menu in the top left and choose “New app”. Alternatively, start with my Gendered Names app and clone it.

The core of the app is a query that looks like this:

[{
  "id":   null,
  "type": "/people/gender",
  "!/people/person/gender": [{
    "type":   "/people/person",
    "name~=": "^Evelyn ",
    "return": "count"
  }]
}]

This says, in effect: “for each gender in Freebase, find all the people who have that gender whose name begins with Evelyn, and count them.”

You can play with Freebase queries in the Query Editor. Here’s a permalink to my gendered name query. A cut-down version of the Query Editor is built into Acre, too.

I then took the results and presented them with the Google Chart API, which is fast becoming a favourite of mine. One problem I had, though, is that it doesn’t much like large numbers. You need to convert them to some other format first, like the extended encoding that allows for numbers up to 4095. Since we have 17,610 people in Freebase named John, even the extended encoding wouldn’t work for this case, so I just passed in percentages rounded to the nearest integer.

That’s really the guts of the gendered names app, though there is also a bit of stuff to handle the parameters passed by the form, and to show a handful of male and female examples with links back to Freebase. Hopefully the code for that is reasonably self-explanatory.

A number of people have suggested to me that I should add some kind of time component, since certain names (Evelyn being a great example) have changed their default gender over time. I’ll definitely add that feature! Another thing I’d like to do is give a scaled response to account for the fact that Freebase has an overall preponderance of men. This query shows the problem:

[{
  "id":   null,
  "type": "/people/gender",
  "!/people/person/gender": [{
    "type":   "/people/person",
    "return": "estimate-count"
  }]
}]

(I used “estimate-count” because “count” times out on this query.) The results show that there are about 4.77 times as many men in Freebase as women. If we scaled the results to take this imbalance into account, the chart for Evelyn would look more like this:

pie chart

My app also provides a simple little API so you can get the name frequency data in JSON and mash it up with whatever else you want. It’s just one many tools for guessing gender based on names, such as Genderyzer and CPAN’s Text::GenderFromName, but each has different strengths and weaknesses so perhaps another tool with a different approach will help in some way.

6 Responses
  1. Bruce Van Allen permalink
    September 10, 2009

    Thanks for your work on this!

    I take a somewhat different approach, which I’ll briefly sketch for added perspective on this challenging problem.

    My need for deriving gender from names is for political organizing, when a strategy could involve targeting voters with gender as one attribute among others such as age, party, and so on. In voter registration records, many people do not provide a gendered title such as Ms, Mr, Mrs, or Miss (American English here). So I needed a way to guess their genders with some reliability.

    I wrote a routine — in Perl, natch — that looks at the names of people who did provide gendered titles, and uses that to guess the gender of those who didn’t. The first thing I saw, of course was the ambiguous names, so I added a statistical threshhold below which the guess wasn’t allowed. Then I noticed the dimension that your method doesn’t account for, besides the limitations you mentioned: running the routine in different geographical areas gave different results. So for example in one county the name Guadalupe would come out as overwhelmingly female usage and in another county it would be too ambiguous to provide a useful guess.

    So my approach differs from yours in that it is based on actual usage within a specific population. Still uses some assumptions, and would only be useful where one has an actual population to look at. But after several years of field testing, this method’s results have been consistently better than others I’ve been shown, measured by fewer people assigned ‘unknown’ gender and by higher degree of accuracy found in the field.

    If I weren’t typing on such small device, I’d offer some code, and I’d like someday to do some more rigorous statistical analysis of thus method. Happy to follow up if anyone is interested.

    • September 10, 2009

      @Bruce Yeah, geographical distribution is a good one. You see it quite clearly with “Andrea” where it is a female name in some languages but a male name in others, leading to around 50% overall, but you could guess pretty strongly one way or the other if you were in eg. Australia (female) or Italy (male).

      Freebase has “place of birth” but that’s less well filled out than name and gender. It would be something to consider, though, as an option.

  2. James permalink
    September 10, 2009

    The outliers often seem to be misclassifications – eg for John and James there’s a few pen-names of women, or in the case of Rebecca M Riordan you got her gender wrong in Genderizer, assuming her O’Reilly profile is correct. Not to pick on you, it just happened to be the first one I came across. For some reason I thought Genderizer/Typewriter averaged three people’s opinions before changing a record, is that not the case?

  3. September 12, 2009

    @James: some of our queues (merge and delete) require three people in consensus before writing, but the genderizer/typewriter don’t. It has to do with how easily an error can be reverted. Anyone can fix Rebecca M Riordan’s gender easily, but it’s much harder to undelete or unmerge. Given that, we went for volume :) As to why I mis-gendered her… I think that just happens sometimes, when you’re blasting through a big set. You get a little dazed ;) I know there was some talk at our last hack day about how to mitigate that effect (ideally in fun ways that don’t slow down data contribution too much).

  4. September 12, 2009

    I am amused by the contrasting results for “Lindsey” and “Lindsay”.

Trackbacks and Pingbacks

  1. Paul Fenwick (pjf) 's status on Thursday, 10-Sep-09 02:38:40 UTC - Identi.ca

Comments are closed.