|
Improving person disambiguation and "smarter" searching |
|
Written by Ted Turocy
|
|
Friday, 22 May 2009 11:53 |
|
One of the challenges we face in the Encyclopedia project is that we have a universe of what will be over 200,000 "notable" people. That universe grows every day as players make their professional debuts, and our knowledge about past players, managers, executives, umpires, and so forth continues to evolve. We chose a wiki-based Encyclopedia in part because it gives us the flexibility to deal with these ever-changing data, and allows for collaboration in improving the data.
Still, it can be hard to keep this all organized, and to find what you want. The disambiguation pages listing people with the same or similar names are a good example. When we created the initial set of person pages, we created simple disambiguation pages, with names and active dates. However, these are entirely static, in that they do not reflect the content on the pages they link to. For example, if I update a player's page with his 2009 teams, it does not update a corresponding disambiguation page to indicate he was active through 2009. If I add another person with the same name, I must also manually edit the disambiguation page to add him.
This is all fine and well as we start on the project, but it's clear that we need a more robust solution that will serve us well over many years. On selected disambiguation pages, I am rolling out an initial cut at what I believe may be part of the solution. Have a look at the disambiguation page [[John Smith]] in the Encyclopedia. At the top is the original, manually-organized list of all the John Smiths. Below that is an experimental query on the wiki which generates the same list. As most readers of this know, we are making heavy use of Semantic Mediawiki (http://www.semantic-mediawiki.org) in the development of the project. This extension allows us to associate properties with each page, and to do queries on those properties. You've seen this in action already on team pages, where the rosters, managers, and ballparks are automatically generated using such queries. The experimental disambiguation page is similar.
I think the potential power of this approach is clear. The list of teams each person played for is generated from the individual person pages, so there is no issue with keeping multiple pages synchronized. If we add a new person with the same name, the disambiguation page will automatically be updated. (Note: For performance, queries like this are cached and only refreshed periodically. You may need to click the "refresh" tab at the top of the screen to see very recent edits reflected on these pages.)
There is still much to do on these disambiguation pages. The query does not currently list non-playing engagements, so managing, umpiring, etc. doesn't show up. It would also be helpful to be able to list dates of birth and other biographical information. It will take some trial and error to come up with a visually appealing solution that communicates the information a user is looking for, to help find the right guy out of a list of 40+ people with the same name. The good news is that there are no technical barriers to doing so; with time and experience, these queries will improve.
In parallel to this, Peter has been doing some looking into improving the search feature. The default MediaWiki search engine we are using is not very reliable; it's often the case that it doesn't report "near misses." If you search on "Joseph Shlabotnik" and it turns out we have him listed as "Joe Shlabotnik," you may get zero hits. There are better search engines out there, and we'll be deploying one sooner rather than later. In addition, we are looking into ways to exploit the semantic contents of our pages to help make search smarter, so that using Joe vs. Joseph, Mike vs. Michael, and so on won't be an obstacle to helping you find the person you're looking for.
|
|
SABR Encyclopedia: First Update |
|
Written by Ted Turocy
|
|
Monday, 13 April 2009 12:49 |
|
This is the first in a series of occasional updates on the development of the SABR Encyclopedia wiki.
It's been about six weeks since the Board approved the concept of the Encyclopedia and authorized us to begin work. After a month of planning, on April 1, the first automated "bot" went into action, creating a page for each person listed in the Minor Leagues Database, which is the single largest dataset anywhere of people involved with professional baseball. About five days later, the upload process was completed, and a few intrepid souls, Jack Morris, Cliff Blau, Joel Dinda, and John Zajc among them, have begun the task of organizing and expanding biographical knowledge about this set of people. In the meanwhile, pages have automatically been built out for (most) professional leagues and teams. Pages for each ballpark to host at least one Major League game have also been created.
A major focus of development in the coming weeks will be organizing these pages and "stubbing" out pages for other persons, leagues, and concepts. This breadth-first approach is motivated by the belief that most potential contributors will be more comfortable expanding existing pages rather than creating new ones from scratch. Organization and navigation, through categories, navboxes, and the like, will make it possible for contributors to find the best pages on which to make their contributions.
We have begun making use of the Semantic Mediawiki extension within the wiki. We are very excited about the possibilities this extension offers, to allow us to autogenerate information within the wiki. We currently generate roster tables for each club using this extension, and have just implemented a similar feature to autopopulate executive roles for leagues. Similar features for club managers and general managers, and umpires for leagues, are intended. We are also using this feature to create an automatically-updated necrology for 2009, which we hope will help Rod Nelson and the Emerald Guide crew get a head start on next year's edition. (Even a few days later, it still affects me when I see Nick Adenhart's name at the top of that page.)
A key design feature is the use of templates to record information systematically about entities in the wiki for easy extraction down the road. Some of these templates wrap Semantic Mediawiki properties, so contributors don't need to learn how SMW works; the creation of properties happens automatically behind the scenes. Even where templates do not wrap SMW properties, they are easy enough to parse that tools will be able to spider the wiki to extract and cross-check information.
One such spider program being developed now is a program to extract the basic biographical data and update the Persons table in the Minor Leagues Database. The Encyclopedia wiki is now the primary place to update biographical data, both the basic demographics (name, height, weight, date of birth, and so on), and the assignment of playing, managing, and other records to each person. We are hopeful that this will ease the task of processing this information in a timely fashion, as well as minimize the chance of errors. Early experience indicates that this will be a viable solution, if managed properly.
We will continue expanding the breadth of the Encyclopedia in the coming weeks. One of the next datasets to come will be minor league ballparks, based on Gord Brown's register. A few states' worth have been wikified already, and we will soon be seeking volunteers to carry out the rest. Also on the shortlist of major tasks are work on the collegiate summaries Gary Benner has, and updating the wiki with major league managers and umpires, which records are currently largely missing.
For the sake of posterity, as I write this, the front page of the wiki report 217,779 pages, including 169,789 people, 4933 league-seasons, 31003 team-seasons, and 301 ballparks. There have been 250,509 page edits, and 245,292 pages, which means there have been at least around 5,300 non-bot edits. I don't put too much stock in raw edit statistics; after all, mechanically adding navboxes to all the seasons of a league creates a lot of mindless edits that don't directly do very much yet. Even at that, given the small number of us who are active right now, that's a sizeable number, and I take it as an indication that we're off to a good start.
|
|
Last Updated on Monday, 13 April 2009 19:20 |
|
Monthly Report: March 2009 |
|
Written by Peter Garver, SABR staff
|
|
Monday, 13 April 2009 18:02 |
|
This report is coming awfully late, about two weeks into April. This is largely beacuse right around the turn of the month I was moving the main sabr.org website to a new server, which started a few days before the end of March and trickled into April. Around the same time, I heard back from several (4, I think) chapters and committees about putting up sites, so I have been working as hard as possible on those, and I'm almost caught up now. The server moved dominated the month, pushed back the sabr.org redesign, and also makes this list rather short:
- Developed and implemented a delivery system for the Emerald Guide to Baseball
- Improved our website analytics with some combining of sites and a funnel to track new member signups
- Moved the main sabr.org site and about a dozen others, and eliminated about 6 dead sites
- Scanned the first issue of the National Pastime, found an OCR program that should actually function well
- Set up a testing/development Linux server; its first use is for the aforementioned OCR
- Modified an online text editor to create a tool to speed posting Bioproject bios
- Created one chapter draft site, and one research committee draft site (the committee site turned out very well, and I look forward to announcing it here in a month or two when it goes public)
|
|
Last Updated on Tuesday, 14 April 2009 10:07 |
|
Monthly Report: February 2009 |
|
Written by Peter Garver, SABR staff
|
|
Thursday, 26 February 2009 18:08 |
|
February was a very busy and eventful month. I made several major announcements, and the month ends with a lot more things on my todo list than were there at the beginning. Some highlights:
- Announced the Baseball Research Journal Archives.
- Announced to chapter and committee leaders that group web sites are available in-house now.
- Installed time-tracking software to analyze how I spend my day
- Copied the experiemental journal archive site to our server, set it up and made it suitable for public access, then announced it to the membership
- Recruited members to help proofread the BRJ archives and set up a workspace for them
- Prepared a plan for a wiki-based encyclopedia, presented it to the board, and they approved it
- Developed a draft technology plan, also presented to the board
- Converted the content of two non-SABR sites to run on our server
- Set up OCR software to experiment with community prooferading in future digitization.
Coming up in March:
- Beginning the process of developing the encyclopedia (this won't be visible)
- At least one or two new chapter/committee sites
- A re-organization of the main SABR website
|
|
Last Updated on Monday, 02 March 2009 15:11 |
|
|