Using VISL Constraint Grammar to pedagogically annotate oral text

We are using VISLCG3 to annotate with pedagogically relevant information the transcripts of our oral video clips. VISLCG3 is open source software under a GNU General Public License that implements a finite-state machine that allows for the linguistic analysis of text in so-called local contexts using hand-crafted rules.

VISLCG3 (CG3 for short) can be used mainly for three types of operations: replace information related/assigned to a word, remove information assigned to a word (by a previous module, e.g., a dictionary look-up module), or add information to a word. For the purpose of our project we are mainly using CG3 to add information, that is, to add pedagogically-relevant annotations to texts (oral transcripts) that have previously been tagged with basic linguistic information such as lemma and part-of-speech with TreeTagger (this deserves a separate post). We are also using it to do some post-editing of the tagging, since our tagger systematically makes certain decisions with which we do not agree.

The two operators that we most frequently use are ADD and ADDRELATION. They perform similar actions: They both add one piece of information to the reading of a particular word. The only difference is that the former can be applied to phenomena that extend over one word (cohort, using CG3’s terminology), while the latter can be applied to phenomena that extend over two cohorts or more — optionally with words inbetween. This annotated phenomena are correlated with an internal Pedagogical Typology (still in progress) which we elaborated by extracting linguistic and communicative topics often found in Spanish textbooks.


The following rules exemplify how we used ADD and ADDRELATION to pedagogically annotate our corpus. The first two are pretty straightforward rules, the third one is a relatively more complex rule and gives the reader an idea of the power of formalisms such as CG3.

  • ADD (@Prag:MarcadoresDisc:Probabilidad) Adverbio IF (0 Quizas);

The above rule states that any adverb reading of the word(s) included in the set Quizas (which includes both quizás and quizá and is defined elsewhere in the grammar file) will be assigned the information @Prag:MarcadoresDisc:Probabilidad.

  • ADDRELATION (Gram:SerEstar:EstarAux) VerboEstar IF (0 FiniteVerbForms) TO (1 Gerundio);

The above rule states that any occurrence of the verb estar that is conjugated (that is, finite verb form) and is followed by a gerund will be tagged as an instance of the verb estar being used as an auxiliary R:Gram:SerEstar:EstarAux. The rule establishes and index-based relation between the form of estar and the corresponding gerund.

  • ADDRELATION (Func:Deseos:Ajenos) Verbo IF (0 VerbosExpresarDeseos) (1 Que) TO (*2 Verbo + Subjuntivo BARRIER GrupoNominal OR LimiteOracionSimple OR FiniteVerbForms);

The above rule states that any instance of a verb included in the list VerbosExpresarDeseos (which is defined elsewhere and includes verbs such as gustar, desear, querer…) that is followed by a que and also followed by a verb in subjunctive mood should be annotated as an instance of a way of expressing desire. The rule uses operators such as the Kleene star (*) or the special word BARRIER to give more flexibility to the actual location of the verb (not necessarily right after the que), but also controls that there is no crossing over of certain linguistic itmes such as conjunctions or sentence delimiters to guarantee that the rule stays within a safe scope. The tag that the rule maps is R:Func:Desesos:Ajenos to the actual verb and an index records the direction of the relation with the subjunctive verb (note the TO).

From Transcript to Tagged Corpus

In this post I will discuss the steps that we are using to get from our transcripts to our final corpus (as of 01/15/2013).  This is still a messy process, but with this documentation anyone should be able to replicate our output (on a Mac).

Step 1. Download and unzip this folder where you would like to do your work.

Step 2. Install TreeTagger within ProjectFolder/TreeTagger (look inside the folder you just unzipped).

Step 3. Make sure that you have updated, complete versions of PHP and Python installed.

Step 4. Update and with your YouTube client id, secret, and developer key.

Step 5. Save your plain-text transcripts in Project/transcripts (one for each video).

Step 6. Update MainInput.txt with your information.

Step 7. Log in to your YouTube account.

Step 8. Open Terminal and navigate to ProjectFolder.

Step 9. Run by typing: python

Step 10. Run MainProcessor by typing: ./MainProcessor

And you’re done!  You should now have fully tagged files in ProjectFolder/Processing/Tagged and closed caption files in ProjectFolder/Processing/SRT.  And next time you’ll only need to do steps 5 – 10!  😀


A few hints in case you run into trouble:

You may need to install some additional Python libraries as indicated by any relevant errors.

If you have an encoding error with some of the Spanish characters, you may need to edit  See my comment on StackOverflow.

If the scripts are successful at downloading some srt files from YouTube, but not others, it is probably a timing issue with YouTube’s API.  I am currently trying to build in a work-around, but for now, just wait a few minutes, run MainProcessor again, and cross your fingers.

Finally, these scripts are not very efficient yet.  When running them with around 30 videos and around 100,000 words, it takes about two hours on my MacBook Pro.  Sorry about that.  We will be working on optimizing these scripts as time permits.

Please contact me with any questions or suggestions!

Automated captioning of Spanish language videos

By the end of the summer, we expect the Spanish in Texas corpus will include 100 videos with a total running time of more than 50 hours. Fortunately, there are a range of services and tools to expedite the process of transcribing and captioning all those hours of video.

YouTube began offering automated captioning for videos a few years ago. Using Google’s voice recognition technology, a transcript is automatically generated for any video in one of the supported languages. As of today those languages include English, Japanese, Korean and Spanish, German, Italian, French, Portuguese, Russian and Dutch. The result of the automated transcription is still very much inferior to human transcription and is not usable for our purposes. However, YouTube also allows the option of uploading your own transcript as the basis for generating the synchronized captions. When a transcript is provided, the syncing process is very effective at creating accurate closed captions synchronized to a video. In addition, YouTube offers a Captioning API, which allows programmers to access the caption syncing service from within other applications.

Automatic Sync Technologies is a commercial provider of human transcription services as well as a technology for automatically syncing transcripts with media to produce closed captions in a variety of formats. Automatic Sync recently expanded their service to include Spanish as well as mixed Spanish/English content. An advantage of using their service is that they have the ability to create custom output formats (requires a one-time fee). For instance, we worked with them to create a custom output file that included the start and end time for each word in the transcript and was formatted as a tab-delimited text file.

There are also online platforms for manually transcribing and captioning videos in a user-friendly web interface. DotSub leverages a crowd-sourcing model for creating subtitles and then translating the subtitles into many different languages. Another option in this category is Universal Subtitles, which is the platform used to subtitle and translate the popular TED Video series. These can be a good option if resources aren’t available to hire transcribers and/or translators.

While developing the SPinTX corpus we have used all of the solutions mentioned above, but we have now settled on a standard process that works best for us. First, we pay a transcription service to transcribe the video files in mixed Spanish / English and provide us with a plain text file, at a cost of approximately $70 per hour of video. Then, we use the YouTube API to sync the transcripts with the videos and retrieve a caption file. This process works for us because our transcripts often need a lot of revisions, and we can sync as many times as we need at no cost. The caption file is then integrated into our annotation process, so when users get search results they can jump directly to the place it occurs in the video. In a later post, we will go into more detail about how we are implementing the free YouTube API and how you can adapt this process for your own video content!

¿Qué criterios usarías para buscar vídeos?

[N.B. Información previa sobre el corpus abajo mencionado: post anterior y site de SPinTX, ambos en inglés]

Pregunta para los que enseñáis Español como Lengua Extranjera (ELE): Cuando buscáis en Internet un vídeo para trabajar un objetivo gramatical o léxico específico, ¿qué tipo de criterios de búsqueda crees que os serían útiles? Estamos tratando de añadir metadatos al corpus de Español de Texas (SPinTX) y hemos empezado a hacer una pequeña lista (adjunta a continuación). ¿Tienes cinco minutos para darnos tu opinión? ¡Déjanos un comentario, por favor!

Lista de descriptores pedagógicos para SPinTX

  1. Nivel morfológico
    • Tiempos verbales: presentes, pretéritos, futuros, condicionales, etc.
    • Modo verbal: indicativo, subjuntivo, imperativo, infinitivo, gerundio, etc.
  2. Nivel morfosintático
    • Género en sustantivos y combinación con determinantes.
    • Uso de preposiciones.
      • Por y para: distinción entre usos causales, objetivos, destinos, destinatarios.
  3. Nivel discursivo
    • Marcadores discursivos.
  4. Nivel léxico
    • Identificar los campos semánticos de un texto a través de una lista de palabras clave.
  5. Nivel funcional
    • Expresar gustos y preferencias.

Si habéis llegado aquí, una pregunta más: ¿os imagináis una ficha técnica asociada a cada uno de los vídeos de una lista de resultados con este tipo de información para poder filtrar los más o menos adecuados para vuestra clase?