Working on the Corpus

The beta release of the classroom corpus will be in just a few short weeks!  We’re all really excited and working hard to get it ready.  Along the way we’ve already learned some important lessons about preparing a tagged video corpus for classroom use.  These lessons are based on our own experiences and the feedback we’ve already received from instructors.  Here are a few of the lessons learned for those pondering doing something similar:

The KISS (Keep It Simple Stupid) principle definitely applies!  You want the corpus to have a lot of information, but even more importantly you want it to be accessible.  This is especially crucial when it comes to designing the interface.  Educators and students don’t want to have to learn how to use your website.  They want to be able to come to your website and immediately know how to use it because it is well-designed using commonly accepted webpage design practices.

Know what your target audience wants.  While creating this corpus we have had lots of wonderful ideas about things we could add to it to make it even more awesome!  We have been surprised multiple times when our target audience, educators, haven’t responded positively to many of our awesome ideas.  That is because they don’t want a lot of bells and whistles, especially when these make navigating the corpus more difficult.  They want to be able to find the video they need quickly and easily, and they want to be able to be provided with either a transcript and/or a customized cloze test.  This doesn’t mean that we’ve been discarding our awesome ideas.  We’ve just put them on the back burner until we either figure out how to seamlessly integrate them, or a demand for them surfaces among our users, or we come to our senses and realize that they’re really not all that awesome after all.

Don’t re-invent the wheel.  There are already tools out there for tagging corpora for part of speech (e.g. TreeTagger), for parsing corpora (e.g. FreeLing), for searching corpora (e.g Corpus Work Bench), etc.  This is especially true when dealing with more commonly taught languages such as English and Spanish.  It can be frustrating trying to find the right tool (we’ve found that word-of-mouth is a good way), and sometimes even more frustrating to learn how to use the tools.  In general academic software is not designed to be user friendly, and often requires at least minimal competence in scripting in terminal in order to be used.  Nevertheless, it is definitely much easier to spend the time learning how to use these tools rather than trying to make your own from scratch.

You’ll probably need to do some computer programming, or have it done for you.  Regardless of the availability of pre-made tools, you’ll probably want to do something with your corpus that no one else has done before (at least as far as you can find online for free).  For example, we have one member of our team generating computer programming scripts that identify video segments that are unusually rich in certain pedagogical items, such as the use of the past tense or the side-by-side use of two commonly confused prepositions.  We have another team member generating scripts that combine the data from several different sources (TreeTagger, Closed Caption Files, language identification software, etc.) into files that are usable by SOLR, the search software we’re integrating into our website.  True, these tasks could be performed by hand, but doing so would mean taking much more time in preparing the corpus.

Don’t be afraid to farm out.  For some reason we tend to get into the mindset that we need to do everything ourselves if we want all the credit and glory.  Perhaps that is true, but it also means that you’re going to either kill yourself or not accomplish much as you could have with a little outside help.  There are people out there who will do a lot of the grunt work for a minimal cost.  For example, we farmed out the job of transcribing our full-length video interviews (30 – 60 minutes each) to Automatic Sync, a company that specializes in video transcriptions and captioning.  It costs a little money, but we get our rough transcripts back in less than a week!  Which means that we can dedicate the hundreds of hours we would have spent doing the transcriptions ourselves to doing other jobs that we can’t farm out.

I’m sure there are plenty of things I am forgetting, we’ll try to add those later as we remember them.  If you have any questions or comments, please post them below!