The Middle Name Guesser
5am, 27th January 2009 - Geek, Interesting, DeveloperI have recently made some improvements to the Middle Name Guesser (one of which was to make it actually work again) and I'd like to take this opportunity to invite you to have it guess your middle name... or your friend's middle name, or your favourite celebrity's middle name.
I have also added a couple of statistics graphs and you can clearly see exactly when I fixed that pesky little bug that only showed up when it actually guessed your middle name correctly. (It was a typo I introduced the last time I edited the file - a strong argument for automated testing if ever I heard one.) At that point it was getting about 1 in 20 guesses correct. Since then it has been steadily improving up to a peak of getting 1 in 4 guesses correct. 1 in 4 guesses correct is better than I ever hoped it would achieve. I was originally thinking that 1 in 10 would be a good result. Now I'm wondering if it will get to 1 in 2...
I expect to see the ratio of correct to incorrect guesses remain relatively unstable until the number of new, unique middle names, first names and last names (the red, blue and purple line) starts flattening out. After that the ratio should only improve as the relationships between the known first and last names and middle names are strengthened.
Related posts:
Clawing my way back on to the webHow may I help you today ?
They took my shower !
Swedish security researcher exposes plaintext passwords found while sniffing Tor
So many servers, all hacked.
Comments
I discovered a rather old <a href = "http://www.joelonsoftware.com/articles/fog0000000054.html">blog post from Joel Spolsky</a> today that talks about the chicken and egg problem as it applies to software development. Most software enables a meeting of two different groups of people. Importantly, your software is only interesting to one group if the other group is already using it.
The middle name guesser has a similar problem. It is only interesting if it can guess your middle name correctly. If it isn't interesting, it won't attract new users. It can only guess your middle name correctly if it has learned a lot of names and it learns names by attracting new users.
I attempted to solve this problem early on by trawling through Wikipedia looking for celebrities and inputting their first, middle and last names into the middle name guesser.
You can see this period in the yellow line in the graph on the left in which the X axis represents the number of guesses. After the general spikiness at the start when a single correct guess could improve the ratio by 30% there is a jagged line leading steeply upwards. In the graph on the right, in which the X axis represents time, this line is nearly vertical because I did nearly all of them on a single day. The ratio would even out at about 50% for this period because I was entering a new name and the guesser would get it wrong because it had never heard of the name before, then I would enter the same name again and inevitably get it right.
Interestingly, celebrities don't seem to have much in common with the people who actually use the middle name guesser so for a long time after this it was guessing names incorrectly, even though I had plenty of names in the database. The bug I mentioned in the blog post above was in place for about 100 guesses which meant that it could only learn about its mistakes during this time but not its successes. After I fixed it up, the yellow line starts steadily rising.
Recently it has leveled off at around 20% correct guesses but the number of first, last and middle names in the system is still steadily rising. I suspect this means that the system has reached its equilibrium point while it is still learning. Mostly, it is still only guessing people's middle names correctly on their second guess. I predict that there is a threshold somewhere that will cause a sharp rise in the accuracy statistic. I don't know how many names will need to be in the system to achieve this but currently there are about 350 of each first and last names and about 190 middle names. (The middle names look higher than the other names on the graph because I doubled that value before plotting it. Someday I will change that back to raw values and at the same time label the axes and the entire chart.)
So as for the chicken and egg problem, I tried to overcome it by finding one of them (a chicken or an egg) to get the guesser started but I didn't go far enough (only about 100 celebrity names) or maybe I found a lizard egg (celebrities are not representative of the rest of the population).
Since I haven't found a chicken yet and it may be a very long time before I reach the predicted threshold where the guesser is self-perpetuating, I have decided to do something to it to artificially improve its accuracy.
What I plan on doing is giving it more data to work with. The data will be completely optional for a user apart from the first and last names but it will improve the accuracy so it will be desirable to enter it. After the first and last names are entered, you will be able to enter your age, nationality, parent's names and maybe some other things. As always, this data is stored so that you can link any of it to a particular middle name but not to any of the other pieces of data. I like my privacy and I assume that other people do too.
I think I would also like to generate some graphs using <a href = http://www.graphviz.org/">Graphviz</a> of the names and possibly the other data. I'll display them on the site if anything interesting shows up in them.
Hi Dave,
Its been a while since I came here. Good improvement on the name guesser. Although it didn't guess my middle name. That is because I am an Indian and your generator probably does not support non-American names?
As far as the self learning mechanism goes, that's a great concept.
Actually, the code has no knowledge of any names at all. It certainly has no knowledge of what makes a name "Indian" or "American". The only names it knows are the ones it has learned by people telling it what their names are after it guesses their names incorrectly. To get things started when the database was empty, I trawled Wikipedia for pages which listed first, last and middle names and I entered those in manually as if Sir James Paul McCartney had actually visited my site himself.
In theory, it should support Japanese and Arabic names with their non-English alphabets just as well as English names.
If you ask it to guess your name but it says it has never seen your middle name before, it will nearly always guess your name correctly the next time you ask it.