[Profile picture of Ruben Verborgh]

Ruben Verborgh

Distinguishing between Frank and Nancy

Enriching DBpedia with detected gender information allows for new insights.

Ever looked up a person in an encyclopedia without knowing whether it was a man or a woman? And if you did, was it explicitly mentioned in the article? I’m guessing the answer two both questions is “no”. Gender is of course not that important; we’re interested in people for what they do. Yet at the same time, this particular piece of information is so trivial and obvious that we often just don’t mention it. This means that machines, which require explicit instruction, have no way to determine this elementary fact. Therefore, it’s hard to study even simple statistics in an automated way. This is why the Dutch DBpedia chapter had asked me to experiment with gender extraction for people, based on their Wikipedia pages.

Honestly, gender is not that important. Nevertheless, it’s very present in most cultures. It already starts before birth: the one thing people ask expecting parents about their future child is not whether it will have green, blue or brown eyes, or whether it will be extroverted or rather shy. They want to know: will it be a boy or a girl? Granted, it’s one of the more easy things to find out in the womb, yet hopefully not as much of a life determinant as anything else.

At the same time, gender seems a basic human characteristic we (have learned to) distinguish very rapidly, starting already at a young age. In most languages, words and sentences are used differently depending on the subject’s or object’s gender. And while gender can have severe consequences in many parts of the world, the distinction is in most cases so obvious that it’s too trivial to mention it explicitly.

DBpedia is the Linked Data version of Wikipedia, offering encyclopedic knowledge in a machine-interpretable form. A vast majority of triples in DBpedia is derived from so-called infoboxes on Wikipedia. For instance, if you look up the Wikipedia articles on Frank and Nancy Sinatra, you’ll notice a list of key–value properties in a dedicated box on the right-hand side. Such (semi-)structured information is a blessing for machines, which cannot deal well with natural language yet. Missing is this column is a gender field that states the obvious. Consequently, the automated conversion process from infobox to triples will not generate a gender property—so if we want it, we have to create it in a different way.

Determining gender is easy for humans, but machines need to know explicitly. ©Giovanna Baldini

Finding the people

The first obvious step is to list all people. That’s something DBpedia can help us with. This query gives us people and their Wikipedia pages:

  ?person a foaf:Person;
          foaf:isPrimaryTopicOf ?article.

It turns out there are more than 80.000 people on the Dutch DBpedia, so that’s not the kind of query you want to execute on a remote SPARQL endpoint at once. Luckily, there are alternatives:

  • Access parts of the result by repeatedly asking the query to a SPARQL endpoint using OFFSET and LIMIT (hoping the endpoint does not go down in the meantime).
  • Execute the query with the triple pattern fragments client and let the results stream in one by one.

Since I quite liked the streaming thing (and I wanted a long-running query test for triple pattern fragments anyway), I quickly set up a public triple pattern fragments version of DBpedia 2014 NL and executed the above SPARQL query. See the results.

If you’re wondering why I explicitly query and store the Wikipedia URL, while this could be guessed from the DBpedia URL, well… I don’t like manual URL fiddling, as URLs are supposed to be opaque identifiers. Furthermore, the above query can be reused to find people and their pages in different datasets as well.

Detecting gender in an article

Then I had to come up with a strategy to reliably find the gender of a person through his or her article. While I wrote the script in such a way that various detectors can be used simultaneously, the first detector I thought off seems quite reliable already. The idea is that texts contain signal words providing hints about whether their subject is male or female. As I mentioned in the introduction, many languages distinguish between male and female subjects, for instance, by using pronouns such as “he” instead of “she”. The assumption is that, if one category of words occurs more than others, the subject of the text is likely to belong to that category.

I empirically created a list of signal pronouns for Dutch:

  "male": {
    "hij": 1.0,
    "hem": 1.0,
    "zijn": 0.5
  "female": {
    "zij": 0.7,
    "ze": 0.5,
    "haar": 0.8

Note that I also assigned scores to words to account for polysemy. The Dutch words “hij” and “hem” unambiguously refer to he and him respectively, whereas the word “zijn” could mean both his or to be / are. Therefore, “zijn” is not as reliable as the other words. The female counterparts are more problematic, as they can all have multiple meanings: “zij” (she or they), “ze” (also she or they), and “haar” (her or hair). We therefore assign a lower score to words with multiple meanings, depending on the suspected likelihood of the alternatives.

For each occurrence of the above signal words in a text about a person, we add the given score to the category male or female and then compare their total scores. The option supported by the highest score is most likely to be correct (if the scores have been chosen well).

Evaluating the results

When running the script, the output revealed scores such as:

"Weird Al" Yankovic     { male:    37, female:  2.1 }
A.H. Nijhoff            { male:   3.5, female: 33.3 }
50 Cent                 { male: 107.5, female:  5.7 }
27e regering van Israël { male:   0.5, female:  0.0 }

We spot several interesting things above:

  • First of all, the general cases seem to work. Weird Al is a man, and A.H. Nijhoff (would you have known by the name?) a woman.
  • Funnily enough, the article for the stereotypical alpha male 50 Cent obtains a very high male-to-female ratio (but this has probably more to do with article length).
  • The score of the last item, the 27th government of Israel, is not significantly inclined towards male or female; in fact, the score of 0.5 tells us the only match was the word “zijn”, which might well have been a verb.

This last case is interesting, because it can help us detect mistakes in the current DBpedia dataset. Indeed, the 27th government of Israel is mistakenly classified as a foaf:Person in DBpedia. The implausibility of this is reflected by the absence of gender-specific pronouns in the corresponding article.

To avoid decisions based on limited evidence, I tweaked the script to only output a match if its absolute score is significant, and a certain factor higher than the second-best option’s score.

Using the gender data

The resulting triples contain predictions for 52,686 out of 80,499 topics marked as people. I have manually verified a part of them and did not find any mistakes (yet); apart from cases where the topic is mistakenly classified as a person in DBpedia. A perhaps surprising discovery is that 44,614 (85%) of them seem to be men, whereas only 8,072 (15%) are women. This imbalance is known for a long time already in the Wikipedia community, but as there was no fully automated way yet to count men and women, this is actual supporting evidence.

In the most simple way, the gender data can be used to pose new questions to DBpedia, since we can now ask for gender information. More interestingly, we can ask detailed queries to answer the question of why the imbalance exists on Wikipedia. Would this balance improve if we consider more recently born people? We would expect a larger percentage of women in articles about people in the 1900s than in the 1400s. My generated dataset allows to back up such hypotheses with factual data.

Now that this first leap has been taken, I will apply this approach also to other languages, such as English and French. If you know signal words in other languages and want to help, don’t hesitate to get in touch.

A final remark: the output uses the FOAF vocabulary, which contains a foaf:gender property that expects a string (e.g., "male" or "female") as object. This might seem strange: why not use a URI for something simple such as gender? Turns out gender is not a simple black-and-white matter for many people, so the FOAF authors deliberately leave options open for those who do not fully identify with either gender. To support this idea, I left options open in my script as well: the number of possible genders can be extended as anybody sees fit. Unfortunately, detecting non-binary genders is likely also much harder for machines than it is for humans.

Ruben Verborgh

Enjoyed this blog post? Subscribe to the feed for updates!

Comment on this post