So suppose I'm interested in studying (or perhaps more likely, assigning my students to study) popular androgynous names—i.e., names that belong to both a lot of males and a lot of females, for some values of "a lot". Suppose for instance I'm interested in questions like "have androgynous names gotten more popular over time?" or "what are the phonological characteristics that make a name likely to be a popular androgynous name?"
I could look at the variables of popularity and androgyny independently. For instance, I could look at the top n most popular names, and see how many of them seem to be above some threshold for androgyny: e.g., in 2013 data, Avery is the 33d most popular name at about 11,000 individuals, and it's 82% female and 18% male; Jordan is 59th at 8,000 individuals, and 15% female and 85% male, and so on. Doing this, setting somewhat arbitrary cutoffs, there are 50 names with more than 1000 individuals of whom no more than 90% are of the same sex.
But I find that a little unsatisfying. First of all, it requires two arbitrary cutoffs. Second, there's no direct intercomparability between names. Should Jordan be regarded as a stronger or weaker contribution to the inventory of "popular androgynous" names than, say, Charlie (given name, not nickname), which is substantially more androgynous (46% female) but substantially less popular in total (2900 individuals)? If an androgynous name becomes more popular over time but less gender-balanced, does that mean the popularity of androgynous naming is increasing or decreasing?
To simplify questions like these, I'd like to have a composite index of some kind—to have a single quantity which measures the extent to which a name is both androgynous and popular.
Lieberson et al. (2000) quantify the prevalence of androgynous naming over the population as a whole by means of the following computation: for each girl, calculate the percentage of people sharing her name who are boys, and then average that figure over all the girls in the population. (Or vice versa.) This calculation for the population as a whole immediately suggests a combined androgyny-cum-popularity index for each individual name: instead of averaging over the population as a whole, you sum over the number of bearers of each name and get fm/(f+m), where f and m are the number of girls and boys with each name. (This probably wants to be scaled to normalize with respect to the number of individuals in the data, so that different years of birth can be directly compared.)
This has the obviously desirable properties for a composite index: it's symmetrical with respect to m and f, and it increases as f+m increases (holding f/m constant) and as f/m approaches 1 (holding f+m constant). In the 2013 data, the most androgynous-popular names by this measure are Riley (f = 4900, m=2500), Avery (f=9100, m=2000), and Peyton (f=4500, m=1800), which all seem like reasonably good candidates for both popular and relatively gender-balanced.
Looking over the data as a whole, though, I feel like this formula gives popular a little too much weight relative to androgynous. In 1983 data, seemingly obviously masculine names like Michael, David, Matthew, and Christopher are in the top 25 for fm/(f+m) values—and not because they're much more androgynous than they look, but just because they're so popular as to compensate for the tininess of the fraction of girls with the same names, each well under 1%. I mean, I went to high school with a girl named Michael who was born in 1982, so I know some of these are real people, but I wouldn't be surprised if more of these numbers are due to typos and input errors than actual girls with these names. So this is just a gut feeling, but I'd be happier with a formula that's less susceptible to small fractions of large names.
So the formula I want is probably something like (f+m)e–(log(f/m))^2—the exponential just converts the ratio of f to m into a scale of gender-balancedness from 0 to 1 in a smooth way, and then we multiply that by raw popularity. This formula gives Avery a higher index than Riley (despite it being more popular and less gender-balanced), but at least it kicks Michael, Christopher, David, and Matthew out of the top 25 in 1983 (they're all still in the top 100, though).
So. Do you think I'm on the right track for a good composite index to use? Should I be using a composite index at all? Should I suck it up and admit the 266 girls named "John" in 1983 make it an androgynous name? Any other suggestions?
I could look at the variables of popularity and androgyny independently. For instance, I could look at the top n most popular names, and see how many of them seem to be above some threshold for androgyny: e.g., in 2013 data, Avery is the 33d most popular name at about 11,000 individuals, and it's 82% female and 18% male; Jordan is 59th at 8,000 individuals, and 15% female and 85% male, and so on. Doing this, setting somewhat arbitrary cutoffs, there are 50 names with more than 1000 individuals of whom no more than 90% are of the same sex.
But I find that a little unsatisfying. First of all, it requires two arbitrary cutoffs. Second, there's no direct intercomparability between names. Should Jordan be regarded as a stronger or weaker contribution to the inventory of "popular androgynous" names than, say, Charlie (given name, not nickname), which is substantially more androgynous (46% female) but substantially less popular in total (2900 individuals)? If an androgynous name becomes more popular over time but less gender-balanced, does that mean the popularity of androgynous naming is increasing or decreasing?
To simplify questions like these, I'd like to have a composite index of some kind—to have a single quantity which measures the extent to which a name is both androgynous and popular.
Lieberson et al. (2000) quantify the prevalence of androgynous naming over the population as a whole by means of the following computation: for each girl, calculate the percentage of people sharing her name who are boys, and then average that figure over all the girls in the population. (Or vice versa.) This calculation for the population as a whole immediately suggests a combined androgyny-cum-popularity index for each individual name: instead of averaging over the population as a whole, you sum over the number of bearers of each name and get fm/(f+m), where f and m are the number of girls and boys with each name. (This probably wants to be scaled to normalize with respect to the number of individuals in the data, so that different years of birth can be directly compared.)
This has the obviously desirable properties for a composite index: it's symmetrical with respect to m and f, and it increases as f+m increases (holding f/m constant) and as f/m approaches 1 (holding f+m constant). In the 2013 data, the most androgynous-popular names by this measure are Riley (f = 4900, m=2500), Avery (f=9100, m=2000), and Peyton (f=4500, m=1800), which all seem like reasonably good candidates for both popular and relatively gender-balanced.
Looking over the data as a whole, though, I feel like this formula gives popular a little too much weight relative to androgynous. In 1983 data, seemingly obviously masculine names like Michael, David, Matthew, and Christopher are in the top 25 for fm/(f+m) values—and not because they're much more androgynous than they look, but just because they're so popular as to compensate for the tininess of the fraction of girls with the same names, each well under 1%. I mean, I went to high school with a girl named Michael who was born in 1982, so I know some of these are real people, but I wouldn't be surprised if more of these numbers are due to typos and input errors than actual girls with these names. So this is just a gut feeling, but I'd be happier with a formula that's less susceptible to small fractions of large names.
So the formula I want is probably something like (f+m)e–(log(f/m))^2—the exponential just converts the ratio of f to m into a scale of gender-balancedness from 0 to 1 in a smooth way, and then we multiply that by raw popularity. This formula gives Avery a higher index than Riley (despite it being more popular and less gender-balanced), but at least it kicks Michael, Christopher, David, and Matthew out of the top 25 in 1983 (they're all still in the top 100, though).
So. Do you think I'm on the right track for a good composite index to use? Should I be using a composite index at all? Should I suck it up and admit the 266 girls named "John" in 1983 make it an androgynous name? Any other suggestions?
no subject
Date: 2014-10-01 06:16 am (UTC)I've been working with something similar lately with my binomial expressions, which have (what I call) both an "overall frequency" (how often do you form a binomial with these two words in whatever order) and a "relative frequency" for each order (what proportion of the time a binomial appears in the given order). There's an interesting relationship between overall frequency and relative frequency (namely that more overall-frequent expressions tend to be more polarized), and these two measures also interact non-trivially in their effects on other behavioral measures.
I'd be glad to talk about this in more detail if you want!
no subject
Date: 2014-10-01 10:06 pm (UTC)The reason I wanted a composite index, though, is kind of because I sort of see androgyny in popular names and in unpopular names as distinct and not-really-related phenomena. Someone naming their kid Riley or Casey is very likely doing so in the knowledge that they're using a name that's relatively frequently used for both boys and girls. One of the 10 parents naming their kid Zyree or Yasha probably has no idea that the other nine even exist, let alone that they're evenly divided between sons and daughters. So that's why I was thinking of "popular androgynous" names in particular as the phenomenon I wanted to study, and measuring the degree to which a name is part of that distinct category.
no subject
Date: 2014-10-01 10:18 pm (UTC)no subject
Date: 2014-10-01 12:00 pm (UTC)One of the things that has always fascinated me about this is the way the definition of which phonological characteristics make a name "androgynous" have definitely changed, even though a lot of the semantic-ish categories haven't.
no subject
Date: 2014-10-01 07:48 pm (UTC)no subject
Date: 2014-10-01 10:25 pm (UTC)I'd be really interested to see a long-term study of this -- it's clearly been a thing since sometime in the early twentieth century at least.
no subject
Date: 2014-10-03 07:46 pm (UTC)no subject
Date: 2014-10-01 03:45 pm (UTC)I think the problem kicks in when you multiply it by the popularity of the name. I'm not sure your search really considers two names equivalent if one is twice as popular but half as androgynous.
One really simple measure that has the desirable properties you mention but seems a bit wonky in other ways is: what names have the largest population of their minority gender? The least desirable property is that if 1000 female Rileys changed their name to Sue, Riley wouldn't lose any of this metric. That said: if 1M Sues changed to Rileys, we probably don't want Riley to go up.
Other vaguer thoughts: can you calibrate the number of typos in the database using a name that really truly is only given to one gender (e.g., by targeting a demographic unlikely to get inventive about names)? If so, that number can probably adjust the fm/(f+m) metric in a way that kicks Michael out of the top 25.
no subject
Date: 2014-10-01 09:12 pm (UTC)As for calibraring the number of typos... hmm. I don't know how to find a name where all of the cross-gender manifestations are typos. From the opposite perspective, though, the most popular names that have no cross-gender manifestation are... well, that's interesting. In 2013, these are Leah, Ian, and Savannah, whose frequency is about 5000, In 1983, though, you don't get a no-typo name till Kristy and Alisha, whose frequencies are around 2000; names in the 5000s have a "typo rate" of about 20–30 mostly. That's higher than the typo rate for the very most common names in 2013 (there are 18–25 male Sophias, Emmas, and Olivias). So it seems likely that the rate of data entry error declines over time.
If uncertainty times raw frequency seems to overemphasize frequency (because the frequency differences among the most common names are relatively large—Zipf's law–type thing, I guess), I could just do something ad-hoc like uncertainty times log of frequency.... that knocks Jamie pretty far down the ranking to line up with misspelled Dominque. Hmm. I'll think about it.
no subject
Date: 2014-10-02 02:16 am (UTC)(f+m) exp(-log(f/m)^2) seems pretty good: I definitely think you want something of the form (f+m) g(f/m), with g(f/m) = g(m/f) and g decaying rapidly as f/m goes to 0 or infinity. Your choice of g(x) = exp(-(log x)^2) seems more natural than x/(1+x)^2, which was your first suggestion: Gaussians are common everywhere, and the rapid decay as x goes away from 1 seems good to me. Note, though, that (f+m) exp(-a log(f/m)^2) is just as well motivated for any other positive a.
Here is what I would do instead. Go through the top 1000 (or whatever you have data for) names and find the records for androgyny. (fake data follows)
Name # of occurences (f+m) androgyny ratio (min(f/m, m/f))
Noah 1000000 0.01
Sue 900000 0.05
...
Peyton 7300 0.4
...
Yasha 20 1
So, as you go down the table, the popularity declines and the androgyny ratio increases. Any name N which is both less popular and less androgynous than some other name N' get's discarded.
You have now found the popularity/androgyny frontier. Make a scatter plot and try to find a curve
F(popularity, androgyny) = constant
that roughly fits the curve, for F(x,y) some fairly simple function. Then F is your composite index.
David Speyer
no subject
Date: 2014-10-02 05:23 am (UTC)no subject
Date: 2014-10-02 05:18 pm (UTC)My proposal before seeing data was that I wanted to say these should be all roughly equally popular-androgynous, at least after discarding the ends of the list. I think I feel pretty good about that claim for Logan-Milan.
Charlie is a weird case though. Presumably, what we are seeing is that parents of boys usually name their son Charles and call him Charlie, while parents of girls actually give the name Charlie. In general, nicknames are a tricky issue for you, I'd think (Pat=Patrick=Patricia, Sandy=Alexander=Alexandra, etc.)
I might play with this data while procrastinating.
David
no subject
Date: 2014-10-02 05:59 pm (UTC)David
no subject
Date: 2014-10-02 06:46 pm (UTC)no subject
Date: 2014-10-02 08:58 pm (UTC)How much do you actually want to think about this? It would be fun to write a note that said "Riley is the most androgynous name", and would probably get linked a bit, but I don't know how worthwhile it actually is.
One of my bad habits is that when I have two difficult courses to prepare, an NSF grant application due (turned in today, yay!) and not enough time for either because of the High Holidays -- I start looking for a fun new project to think about. Then I don't actually finish the fun new project because I have so many old projects.
David
no subject
Date: 2014-10-02 09:28 pm (UTC)no subject
Date: 2014-10-02 09:54 pm (UTC)no subject
Date: 2014-10-02 05:18 pm (UTC)no subject
Date: 2014-10-02 05:37 pm (UTC)http://www.math.lsa.umich.edu/~speyer/NamePlot.png
The line is
androgyny = 0.5 - 0.0000275 frequency
I put the 0.5 into my model by hand but, if I take a best fit line without putting it in, I still get 0.504. Inverting the model*,
frequency = 18000 (1-2*androgeny)
In other words, a completely gendered name, which was maximally popular in all other ways, would get about 18000 recipients. Each 0.1 increase in androgyny loses you about 3600 recipients. (Again, restricting to the names which are most popular at that androgyny level.)
Given this, I like
frequency + 36000 androgyny = (f+m) + 36000 min(f/m,m/f)
as a measure -- I think of it as "how popular the name would be, if there were no androgyny penalty".
Disclaimer: Pure mathematician here, no formal training in statistics.
* These equations are not mathematically inverse, because I am trying to pay token respect to significant digits.
no subject
Date: 2014-10-03 07:50 pm (UTC)no subject
Date: 2014-10-03 09:11 pm (UTC)no subject
Date: 2014-10-05 10:38 am (UTC)