5 Ways to Open Up Corpora for Language Learning

Note: The following post was originally published on COERLL’s Open Up blog.

Corpora developed by linguists to study languages are a promising source of authentic materials to employ in the development of OER for language learning. Recently, COERLL’s SpinTX Corpus-to-Classroom project launched a new open resource that seeks to make it easy to search and adapt materials from a video corpus.

The SpinTX video archive  provides a pedagogically-friendly web interface to search hundreds of videos from the Spanish in Texas Corpus. Each of the videos is accompanied by synchronized closed captions and a transcript that has been annotated with thematic, grammatical, functional and metalinguistic information. Educators using the site can also tag videos for features that match their interests, and share favorite videos in playlists.

A collaboration among educators, professional linguists, and technologists, the SpinTX project leverages different aspects of the “openness” movement includingopen researchopen dataopen source software, and open education. It is our hope that by opening up this corpus, and by sharing the strategies and tools we used to develop it, others may be able to replicate and build on our work in other contexts.

So, how do we make a corpus open and beneficial across communities? Here are 5 ways:

1. Create an open and accessible search interface

Minimize barriers to your content. Searching the SpinTX video archive requires no registration, passwords or fees. To maximize accessibility, think about your audience’s context and needs. The SpinTX video archive offers a corpus interface specifically for educators, and plans to to create a different interface for researchers.

2. Use open content licences

Add a Creative Commons license to your corpus materials. The SpinTX video archive uses a CC BY-NC-SA license that requires attribution but allows others to reuse the materials different contexts.

3. Make your data open and share content

Allow others to easily embed or download your content and data. The SpinTX video archive provides social sharing buttons for each video, as well as providing access to the source data (tagged transcripts) through Google Fusion Tables.

4. Embrace open source development

When possible, use and build upon open source tools. The SpinTX project was developed using a combination of open source software (e.g. TreeTagger,Drupal) and open APIs (e.g. YouTube Captioning API). Custom code developed for the project is openly shared through a GitHub repository.

5. Make project documentation open

Make it easy for others to replicate and build on your work. The SpinTX team is publishing its research protocols, development processes and methodologies, and other project documentation on the SpinTX Corpus-to-Classroom blog.

Openly sharing language corpora may have wide-ranging benefits for diverse communities of researchers, educators, language learners, and the public interest. The SpinTX team is interested in starting a conversation across these communities. Have you ever used a corpus before? What did you use it for? If you have never used a corpus, how do you find and use authentic videos in the classroom?  How can we make video corpora more accessible and useful for teachers and learners?

Working on the Corpus

The beta release of the classroom corpus will be in just a few short weeks!  We’re all really excited and working hard to get it ready.  Along the way we’ve already learned some important lessons about preparing a tagged video corpus for classroom use.  These lessons are based on our own experiences and the feedback we’ve already received from instructors.  Here are a few of the lessons learned for those pondering doing something similar:

The KISS (Keep It Simple Stupid) principle definitely applies!  You want the corpus to have a lot of information, but even more importantly you want it to be accessible.  This is especially crucial when it comes to designing the interface.  Educators and students don’t want to have to learn how to use your website.  They want to be able to come to your website and immediately know how to use it because it is well-designed using commonly accepted webpage design practices.

Know what your target audience wants.  While creating this corpus we have had lots of wonderful ideas about things we could add to it to make it even more awesome!  We have been surprised multiple times when our target audience, educators, haven’t responded positively to many of our awesome ideas.  That is because they don’t want a lot of bells and whistles, especially when these make navigating the corpus more difficult.  They want to be able to find the video they need quickly and easily, and they want to be able to be provided with either a transcript and/or a customized cloze test.  This doesn’t mean that we’ve been discarding our awesome ideas.  We’ve just put them on the back burner until we either figure out how to seamlessly integrate them, or a demand for them surfaces among our users, or we come to our senses and realize that they’re really not all that awesome after all.

Don’t re-invent the wheel.  There are already tools out there for tagging corpora for part of speech (e.g. TreeTagger), for parsing corpora (e.g. FreeLing), for searching corpora (e.g Corpus Work Bench), etc.  This is especially true when dealing with more commonly taught languages such as English and Spanish.  It can be frustrating trying to find the right tool (we’ve found that word-of-mouth is a good way), and sometimes even more frustrating to learn how to use the tools.  In general academic software is not designed to be user friendly, and often requires at least minimal competence in scripting in terminal in order to be used.  Nevertheless, it is definitely much easier to spend the time learning how to use these tools rather than trying to make your own from scratch.

You’ll probably need to do some computer programming, or have it done for you.  Regardless of the availability of pre-made tools, you’ll probably want to do something with your corpus that no one else has done before (at least as far as you can find online for free).  For example, we have one member of our team generating computer programming scripts that identify video segments that are unusually rich in certain pedagogical items, such as the use of the past tense or the side-by-side use of two commonly confused prepositions.  We have another team member generating scripts that combine the data from several different sources (TreeTagger, Closed Caption Files, language identification software, etc.) into files that are usable by SOLR, the search software we’re integrating into our website.  True, these tasks could be performed by hand, but doing so would mean taking much more time in preparing the corpus.

Don’t be afraid to farm out.  For some reason we tend to get into the mindset that we need to do everything ourselves if we want all the credit and glory.  Perhaps that is true, but it also means that you’re going to either kill yourself or not accomplish much as you could have with a little outside help.  There are people out there who will do a lot of the grunt work for a minimal cost.  For example, we farmed out the job of transcribing our full-length video interviews (30 – 60 minutes each) to Automatic Sync, a company that specializes in video transcriptions and captioning.  It costs a little money, but we get our rough transcripts back in less than a week!  Which means that we can dedicate the hundreds of hours we would have spent doing the transcriptions ourselves to doing other jobs that we can’t farm out.

I’m sure there are plenty of things I am forgetting, we’ll try to add those later as we remember them.  If you have any questions or comments, please post them below!

Brainstorming on the search & browse interface

We are thinking of offering teachers a practical and user friendly way of accessing the video clips in the SPinTX corpus. We are assuming that teachers might sometimes be overwhelmed by what can be asked to a corpus query interface (i.e., they did not design the compilation process, and it can be just a small corpus — compare to Google, querying the entire web).

Thus we want to offer teachers two clip retrieval modes: the search mode and the browsing mode. The search mode is the usual Google-like key term based search. I would type “banco Medellín” to retreive documents related to banks (financial institutions) in Medellín (Colombia). However, I would type “banco madera Medellín”, if I were looking for documents about carpenters or stores selling wooden banks (to sit on) in Medellín.

The browsing functionality is intended to facilitate the visual exploration of pedagogically relevant information extracted from the corpus. One initial thought is the use of information clouds, as reflected in the figure below. Imagine a a blank square with two drop-down menus. On one of them you could select a topic, to determine the lexical goal, the vocabulary. On the other one you could select the linguistic topic, which could range from grammatical categories to functional ones and a range of other classification criteria that could be relevant for language instruction/learning.

Figure 1 shows how this particular strategy would look like if we select Todos (all topics) in the thematic dropdown list and Gram: Prep. régimen (grammar topic, verb and preposition combinations). The size of the particular verb+prep combination is related to the number of occurrences it has in the corpus now, though it could also be related to the number of documents that have it in the corpus too.

Wireframe of the user interface for exploring the corpus via the browsing mode.

Figure 1. Wireframe of a user interface for browsing the corpus information on the basis of thematic criteria and linguistic criteria.

From Transcript to Tagged Corpus

In this post I will discuss the steps that we are using to get from our transcripts to our final corpus (as of 01/15/2013).  This is still a messy process, but with this documentation anyone should be able to replicate our output (on a Mac).

Step 1. Download and unzip this folder where you would like to do your work.

Step 2. Install TreeTagger within ProjectFolder/TreeTagger (look inside the folder you just unzipped).

Step 3. Make sure that you have updated, complete versions of PHP and Python installed.

Step 4. Update TranscriptToSrt.py and SrtGatherer.py with your YouTube client id, secret, and developer key.

Step 5. Save your plain-text transcripts in Project/transcripts (one for each video).

Step 6. Update MainInput.txt with your information.

Step 7. Log in to your YouTube account.

Step 8. Open Terminal and navigate to ProjectFolder.

Step 9. Run MainBatchMaker.py by typing: python MainBatchMaker.py

Step 10. Run MainProcessor by typing: ./MainProcessor

And you’re done!  You should now have fully tagged files in ProjectFolder/Processing/Tagged and closed caption files in ProjectFolder/Processing/SRT.  And next time you’ll only need to do steps 5 – 10!  😀


A few hints in case you run into trouble:

You may need to install some additional Python libraries as indicated by any relevant errors.

If you have an encoding error with some of the Spanish characters, you may need to edit srtitem.py.  See my comment on StackOverflow.

If the scripts are successful at downloading some srt files from YouTube, but not others, it is probably a timing issue with YouTube’s API.  I am currently trying to build in a work-around, but for now, just wait a few minutes, run MainProcessor again, and cross your fingers.

Finally, these scripts are not very efficient yet.  When running them with around 30 videos and around 100,000 words, it takes about two hours on my MacBook Pro.  Sorry about that.  We will be working on optimizing these scripts as time permits.

Please contact me with any questions or suggestions!

Automated captioning of Spanish language videos

By the end of the summer, we expect the Spanish in Texas corpus will include 100 videos with a total running time of more than 50 hours. Fortunately, there are a range of services and tools to expedite the process of transcribing and captioning all those hours of video.

YouTube began offering automated captioning for videos a few years ago. Using Google’s voice recognition technology, a transcript is automatically generated for any video in one of the supported languages. As of today those languages include English, Japanese, Korean and Spanish, German, Italian, French, Portuguese, Russian and Dutch. The result of the automated transcription is still very much inferior to human transcription and is not usable for our purposes. However, YouTube also allows the option of uploading your own transcript as the basis for generating the synchronized captions. When a transcript is provided, the syncing process is very effective at creating accurate closed captions synchronized to a video. In addition, YouTube offers a Captioning API, which allows programmers to access the caption syncing service from within other applications.

Automatic Sync Technologies is a commercial provider of human transcription services as well as a technology for automatically syncing transcripts with media to produce closed captions in a variety of formats. Automatic Sync recently expanded their service to include Spanish as well as mixed Spanish/English content. An advantage of using their service is that they have the ability to create custom output formats (requires a one-time fee). For instance, we worked with them to create a custom output file that included the start and end time for each word in the transcript and was formatted as a tab-delimited text file.

There are also online platforms for manually transcribing and captioning videos in a user-friendly web interface. DotSub leverages a crowd-sourcing model for creating subtitles and then translating the subtitles into many different languages. Another option in this category is Universal Subtitles, which is the platform used to subtitle and translate the popular TED Video series. These can be a good option if resources aren’t available to hire transcribers and/or translators.

While developing the SPinTX corpus we have used all of the solutions mentioned above, but we have now settled on a standard process that works best for us. First, we pay a transcription service to transcribe the video files in mixed Spanish / English and provide us with a plain text file, at a cost of approximately $70 per hour of video. Then, we use the YouTube API to sync the transcripts with the videos and retrieve a caption file. This process works for us because our transcripts often need a lot of revisions, and we can sync as many times as we need at no cost. The caption file is then integrated into our annotation process, so when users get search results they can jump directly to the place it occurs in the video. In a later post, we will go into more detail about how we are implementing the free YouTube API and how you can adapt this process for your own video content!

¿Qué criterios usarías para buscar vídeos?

[N.B. Información previa sobre el corpus abajo mencionado: post anterior y site de SPinTX, ambos en inglés]

Pregunta para los que enseñáis Español como Lengua Extranjera (ELE): Cuando buscáis en Internet un vídeo para trabajar un objetivo gramatical o léxico específico, ¿qué tipo de criterios de búsqueda crees que os serían útiles? Estamos tratando de añadir metadatos al corpus de Español de Texas (SPinTX) y hemos empezado a hacer una pequeña lista (adjunta a continuación). ¿Tienes cinco minutos para darnos tu opinión? ¡Déjanos un comentario, por favor!

Lista de descriptores pedagógicos para SPinTX

  1. Nivel morfológico
    • Tiempos verbales: presentes, pretéritos, futuros, condicionales, etc.
    • Modo verbal: indicativo, subjuntivo, imperativo, infinitivo, gerundio, etc.
  2. Nivel morfosintático
    • Género en sustantivos y combinación con determinantes.
    • Uso de preposiciones.
      • Por y para: distinción entre usos causales, objetivos, destinos, destinatarios.
  3. Nivel discursivo
    • Marcadores discursivos.
  4. Nivel léxico
    • Identificar los campos semánticos de un texto a través de una lista de palabras clave.
  5. Nivel funcional
    • Expresar gustos y preferencias.

Si habéis llegado aquí, una pregunta más: ¿os imagináis una ficha técnica asociada a cada uno de los vídeos de una lista de resultados con este tipo de información para poder filtrar los más o menos adecuados para vuestra clase?

State of the Corpus

One of the questions that is most frequently asked is: How big is your corpus?  The answer is: Beats me, its constantly changing and there are several different versions of the corpus available at any one time.  But people usually aren’t satisfied with that answer, so here are the details of where the SPinTX corpus currently stands to the best of my knowledge (as researched this morning):

Total n interviews: 123

Total n transcripts: 74

Total n words: 315,673

Total n transcripts approved and tagged: 32

Total n words for approved and tagged transcripts: 134,737

Total n clips available to public taken from approved videos: 328

Total n words for clips: 102,573 (Note: many of the clips overlap, this is not filtered out in this count.)

Please let me know if there are any other stats that would be of use/interest and I will append them to this post.

-Cheers, Arthur

Designing a pedagogical interface for a repository of video interviews

One of the goals in the Corpus to Classroom project is to design a pedagogical interface for the repository of video clips that are being generated out of the more than 100 interviews that were collected in the past as part of the Spanish in Texas project. From our interviews with actual teachers and materials developers, we confirmed that teachers are potentially interested in applying the following types of filtering criteria to their searches:

  • Grammar topics: e.g., search for those clips that contain a significant number of occurrences of por and para
  • Functional topics: e.g., search for those clips that contain exponents of the function apologizing
  • Vocabulary: e.g., clips that contain words (in a pre-defined list maybe) that relate to the topic la familia (papá, mamá, padre(s), madre, hermano/a, abuelo/a…)
  • Thematic: e.g., clips talking about food, traditions, reasons for moving to the US (in our case)…

This is not a complete list, but it is a starting one that contains the most common types of criteria (emotion and phonetics are two criteria that were mentioned too).

With this in mind we are considering the use of a standard search engine (such as Apache Solr/Lucene) to allow teachers to search for the clips and use facets (filtering options) to dig down or define finer-grained queries. However, we also consider the use of typical corpus query tools (such as CWB or SketchEngine — or NoSketchEngine). With this we can cover the Information Retrieval part of our task (more appropriate for document retrieval on the basis of word- or term-based queries) and the Information Extraction part of our task (more appropriate for the queries driven by linguistic patterns).

We will further describe our advances in future posts.

LIFT off!

This blog will chronicle the development of the SPinTX Corpus, and our work to bring a pedagogically useful corpus of authentic Spanish and bilingual Spanish-English speech samples into language classrooms across Texas. The Spanish in Texas (SPinTX) Project project was selected to receive funding from the Longhorn Innovation Fund for Technology (LIFT) for the grant period September 1, 2012 – August 31, 2013. Development of the Corpus began in 2010 and is ongoing under the auspices of the Title VI Center for Open Educational Resources and Language Learning (COERLL).

The focus of the project over the next year will be to help educators exploit the SPinTX corpus to customize materials for the teaching of Spanish at all educational levels. The aims of the project are:

  • to develop a pedagogically friendly interface for the corpus;
  • to involve teachers and learners, via crowd-sourcing, social networking, and workshops, in the development of open educational resources (OER); and to
  • develop a model for using open source tools and a pedagogical interface that can be adapted for any language corpus.

In the spirit of openness, we will be sharing and discussing what we learn and create throughout the project. We invite you to join with us as we explore new tools and methods for integrating authentic content and open data into the language classroom!