www.postcogito.org
          ...sine propero notiones

Kiko
You are here: Kiko > ExcelStuff > MaleOrFemaleWorksheetPtBr Printable | topic end


Start of topic | Skip to actions
Versão em Português

Male or Female?

Type a typical Brazilian proper name and this worksheet will tell you whether it is a male or female name, getting it right more than 99% of the times.

Screenshot
Click on the image to download the worksheet

About the algorithm

For each of the possible 26 last letters of the name there is a default result and an exception table.

The word endings are reversed but sorted so that Excel's PROCH function acts as prefix-based partial match. In some situations we need an exact match; that's what the vertical bars are for.

If the name is in the exception table, the result is the 1 minus the default; otherwise, it will be the default. Zero means female, one means male.

Examples:

  • "Kiko" ends in "o" and the default for this letter is 1 (male). Taking out the final "o" and reversing, we et "kik". Neither this string, nor any prefix of it, is in the exception table. Therefore, "Kiko" is a male name.
  • "Babi" ends in "i" and its default is 1 (male). Taking out the final letter and reversing we get "bab". The prefix "ba" in the exception table does match this string, so this is a female name. (If you try this in the worksheet you'll see that the "ba" gets red to indicate a match).

The implementation could be a bit simpler if it weren't for what appears to be an Excel bug: the search range used by PROCV seems to be limited to about 160 cells. Because of that we had to split the search in two because the exception table for the suffix "e" has more items than that.

Excel has some built-in features that make our job easier, like the case insensitive and locale aware comparisons, sparing us the work of creating a routine for converting uppercase to lowercase and replacing accented characters by their non-accented counterparts.

The heart of the algorithm is the exception table. It was generated by running a recursive suffix-partitioning algorithm in a list of names and genders taken from race results (like the traditional São Silvestre International Race and the ones documented at Corpore's web site). The algorithm arrived at the minimum amount of suffixes needed to implement a decision tree providing the exact same results as querying the original list, but in a much more compact form. The use of suffixes adds the ability 'extrapolate' the result even for names that weren't in the original list.

Curiosities

The table shows that names ending in "e" are the hardest to decide: they are usually female, but there are nothing less than 168 exceptions that make them male. The other hardest suffixes are "i", "a", "y", "s", "n", "r", with 96, 72, 54, 51, 33 e 33 exceptions, respectively.

In the other extreme, the table says that the suffixes "f", "j", "q", "v", "w" e "x" are always male. This reflects the fact that the lists we used didn't have any female names ending in those letters, but we can't really say there aren't any.

The default for most suffixes is male; the only exceptions are "e" and "a". This agrees with the well-known rule-of-thumb that proper names ending in "a" are almost always female. Fanatic feminists may possibly register that as a corroboration that the bias towards male domination also extends to proper names.

Implementations in Other Languages

These implementations come with the test data I used to estimate the algorithm's accuracy of 99.66%.

It should be possible to get even more precision by taking the other names (second name and surnames) into account and maybe combining them via BayesTheorem much like spam filters do. The hard part is to get a sufficiently large reliable database: the ones I had access had about 1% of enrollment errors, requiring tedious manual review.

Contribute!

If you have access to a reliable, well-maintained database of more than 500,000 names with gender information (male/female) and you don't mind sharing it, please send it to me in CSV format. I can use this data to refine the tables or try other algorithm variations. I will not publish the database unless you explicity tell me it is ok to do so; I will also give you proper credit.

Databases of names in other languages from other countries are also accepted. An interesting open question is whether we could use a single unified compact yet accurate routine that handles most latin-alphabet-based idioms.
top


You are here: Kiko > ExcelStuff > MaleOrFemaleWorksheetPtBr

top

Creative Commons License   The content of this site is made available under the terms of a Creative Commons License, except where otherwise noted.
  O conteúdo deste site está disponibilizado nos termos de uma Licença Creative Commons, exceto onde dito em contrário.