# Estimate Age from First Name

Today I read a cute post from Flowing Data on the most trendy names in US history. What caught my attention was a link posted in the article to the source data, which happens to be yearly lists of baby names registered with the US social security agency since 1880
(see here). I thought that it might be good to compile and use these lists at work for two reasons:

(1) I don’t have experience handling file input programmatically in R (ie working with a bunch of files in a directory instead of manually loading one or two) and
(2) It could be useful to have age estimates in the donor files that I work with (using the year when each first name was most popular).

I’ve included the R code in this post at the bottom, after the following explanatory text.

I managed to build a dataframe that contains in each row a name, how many people were registered as having been born with that name in a given year, the year, the total population for that year, and the relative proportion of people with that name in that year.

Once I got that dataframe, I built a function to query that dataframe for the year when a given name was most popular, an estimated age using that year, and the relative proportion of people born with that name from that year.

I don’t have any testing data to check the results against, but I did do an informal check around the office, and it seems okay!

However, I’d like to scale this upwards so that age estimates can be calculated for each row in a vector of first names. As the code stands below, the function I made takes too long to be scaled up effectively.

I’m wondering what’s the best approach to take? Some ideas I have so far follow:

• Create a smaller dataframe where each row contains a unique name, the year when it was most popular, and the relative popularity in that year. Make a function to query this new dataframe.
• Possibly convert the above dataframe into a data table and then building a function to query the data table instead.
• Failing the efficacy of the above two ideas, load the popularity data into Python, and make a function to query it there instead.

Does anyone have any better ideas for me ?

I’ll also accept any suggestions for cleaning up my code as is

• ## 5 thoughts on “Estimate Age from First Name”

1. Thanks for the interesting post. I modified your import/data.frame creation code so that it runs a bit faster. I have a situation at work that is very similar to this one: lots of files of the same format need to be read in and processed. Your post got me thinking about the problem in a slightly different way.
Any way, here’s my modified code (I hope the formatting doesn’t get too messed up.) The key difference is using assign and get to avoid multiple calls to rbind.
####################################################################
library(stringr)
library(plyr)

importData=function()
{
# file_listing = list.files()[3:135]
# Don’t rely on exactly the same directory contents every time. Use the
# file names to select which files to process
file_listing =dir(pattern=”yob.*”)

# Create a list to hold object names
objName = array(dim=length(file_listing))
i = 1
for (f in file_listing) {
year = str_extract(f, “[0-9]{4}”)
objName[i] = paste0(“nameData”,year)
colnames(tempData) = c(“Name”,”Sex”,”Pop”)
tempData\$Year = year
tempData\$YearPop = sum(tempData\$Pop)
tempData\$Rel_Pop = tempData\$Pop/tempData\$YearPop
# assign tempData to current objName
assign(objName[i],tempData)
i = i + 1
}
message(“Forming list”)
# ‘get’ all of the objName objects and stuff them into one list
# get is the inverse of assign. mget allows us to pass in a vector
# of object names.
temp = mget(objName,envir=as.environment(-1))
message(“Forming data.frame”)
# Make that list a data.frame
name_data = ldply(temp)

# return data
return(name_data)
}

estimate_age = function (name_data,input_name, sex = NA) {
if (is.na(sex)) {
name_subset = subset(name_data, Name == input_name & Year >= 1921)} #1921 is a year I chose arbitrarily. Change how you like.
else {
name_subset = subset(name_data, Name == input_name & Year >= 1921 & Sex == sex)
}
year_and_rel_pop = name_subset[which.max(name_subset\$Rel_Pop),c("Year","Rel_Pop")]
current_year = as.numeric(substr(Sys.time(),1,4))
estimated_age = current_year – as.numeric(year_and_rel_pop[1])
return(list(year_of_birth=as.numeric(year_and_rel_pop[1]), age=estimated_age, relative_pop=sprintf(“%1.2f%%”,year_and_rel_pop[2]*100)))
}
####################################################################

2. What a great concept! If you do collect some name and age data I would love to see the results of your model fit.