| Improving person disambiguation and "smarter" searching |
| Written by Ted Turocy |
| Friday, 22 May 2009 11:53 |
|
One of the challenges we face in the Encyclopedia project is that we have a universe of what will be over 200,000 "notable" people. That universe grows every day as players make their professional debuts, and our knowledge about past players, managers, executives, umpires, and so forth continues to evolve. We chose a wiki-based Encyclopedia in part because it gives us the flexibility to deal with these ever-changing data, and allows for collaboration in improving the data. Still, it can be hard to keep this all organized, and to find what you want. The disambiguation pages listing people with the same or similar names are a good example. When we created the initial set of person pages, we created simple disambiguation pages, with names and active dates. However, these are entirely static, in that they do not reflect the content on the pages they link to. For example, if I update a player's page with his 2009 teams, it does not update a corresponding disambiguation page to indicate he was active through 2009. If I add another person with the same name, I must also manually edit the disambiguation page to add him. This is all fine and well as we start on the project, but it's clear that we need a more robust solution that will serve us well over many years. On selected disambiguation pages, I am rolling out an initial cut at what I believe may be part of the solution. Have a look at the disambiguation page [[John Smith]] in the Encyclopedia. At the top is the original, manually-organized list of all the John Smiths. Below that is an experimental query on the wiki which generates the same list. As most readers of this know, we are making heavy use of Semantic Mediawiki (http://www.semantic-mediawiki.org) in the development of the project. This extension allows us to associate properties with each page, and to do queries on those properties. You've seen this in action already on team pages, where the rosters, managers, and ballparks are automatically generated using such queries. The experimental disambiguation page is similar. I think the potential power of this approach is clear. The list of teams each person played for is generated from the individual person pages, so there is no issue with keeping multiple pages synchronized. If we add a new person with the same name, the disambiguation page will automatically be updated. (Note: For performance, queries like this are cached and only refreshed periodically. You may need to click the "refresh" tab at the top of the screen to see very recent edits reflected on these pages.) There is still much to do on these disambiguation pages. The query does not currently list non-playing engagements, so managing, umpiring, etc. doesn't show up. It would also be helpful to be able to list dates of birth and other biographical information. It will take some trial and error to come up with a visually appealing solution that communicates the information a user is looking for, to help find the right guy out of a list of 40+ people with the same name. The good news is that there are no technical barriers to doing so; with time and experience, these queries will improve. In parallel to this, Peter has been doing some looking into improving the search feature. The default MediaWiki search engine we are using is not very reliable; it's often the case that it doesn't report "near misses." If you search on "Joseph Shlabotnik" and it turns out we have him listed as "Joe Shlabotnik," you may get zero hits. There are better search engines out there, and we'll be deploying one sooner rather than later. In addition, we are looking into ways to exploit the semantic contents of our pages to help make search smarter, so that using Joe vs. Joseph, Mike vs. Michael, and so on won't be an obstacle to helping you find the person you're looking for.
|
| Last Updated on Wednesday, 16 December 2009 12:03 |