Using the Content

I’ll be honest, I hate making quizzes and tests. Usually the content is either very artificial, or mostly unrelated to the material we’re covering in class. Well, I decided to make a quiz and to use Content from SpinTX. And I was very pleased to be able to find content that related to both the grammar and the vocabulary that we are covering in class! I had the students watch one of the videos (611) and answer questions about tolerance and the subjunctive, and then I had them read a modified version of the text from another video (635) and fill in the blanks with the correct vocabulary and verb conjugations, and then had them read a shortened version of the text from yet another video (1575) that contains a lot of the vocabulary we are going over and had them react to the text using the subjunctive with emotion/reaction, doubt/negation, and will/wish/desire. The quiz went very well, and I think that it definitely was more interesting for both myself and the students than a normal textbook-based quiz.

The tricky part was, of course, finding the videos that had high concentrations of the vocabulary and grammar that we are going over. I’ll admit that I cheated a little since I have access to the text files for all of the video clips and have some scripting knowledge. I wrote a script in Python that went through and counted the number of times each vocabulary item and grammar point that we are covering is used in each video, and then looked through the top ten hits to pick the three videos that I wound up using. And I also modified the content somewhat of two of the videos in order to increase the concentration of relevant content. But I think that it was definitely worth the little extra effort that it took to have a quiz of such a higher caliber.


Our assigned topic for one day of class was immigration. On this day the provided activities mostly consisted of the students discussing black-and-white pictures of immigrants entering the US and various Spanish-speaking countries. Instead I decided to use SpinTX so that the students could talk about immigration that would hopefully be easier for them to relate to. I searched for inmi* in order to find all videos in which the Spanish word inmigración was explicitly mentioned. I then narrowed down the search to only videos that contained the present subjunctive since that was the grammar topic we were covering in conjunction with immigration. I then selected the six videos that I thought would work best and assigned one to each of the six groups in my class. I prepared a separate Google Form for each group with the link to their specific video.

On the day of class I gave the students the links to the forms for their groups and then gave them half an hour to prepare 5 questions for their classmates to answer while watching the videos. During their preparation time I walked around and heard several very good conversations going on about the videos and immigration. After all the groups were ready we watched the videos as a class with the questions displayed next to them. The students seemed very interested in the videos and in trying to answer the questions. Having taught this class before using the activities provided in the textbook, I can say that this time with SpinTX the class went much better! In the sense that the students were more interested, more active, and there was much better discussion.

“Gustar-Type Verbs” with the Subjunctive

For this activity I had students break into groups 4 and then search for the subjunctive following the triggers (“gustar-type verbs”) that we were looking at in class. So, for example, they searched for examples of the subjunctive after molestar by doing a keyword search for molest*. I had them search for five examples using five different trigger words, and they had to identify the indirect object, verb and subject of each example.

There was a lot of laughter involved with this activity, which I always take as a good sign. Mostly they were laughing at vocabulary that they had never seen before but for which they could still guess the meaning, such as ‘bailarina’. I think that they were excited to see that they could actually understand real Spanish. They also came across some great examples that they had not fully understood but which they thought that they had understood, so this was definitely beneficial. One common example of this was when the subject of a gustar-type verb was a verb phrase that contained multiple singular nouns or a plural noun (This example is not from the corpus: Me molesta recibir cartas). When confronted with these examples they began to ask why molestar was conjugated in the singular instead of the plural, which led to what I believe was a helpful discussion.

Bringing Authentic Spanish Videos into the Classroom

This weekend COERLL attended the Texas Foreign Language Association Fall 2013 Conference. Our presentation on SpinTX featured two excellent Spanish teachers/curriculum developers: Tina Dong (Instructional Coordinator, World Languages at Austin ISD), and Jared Abels (Secondary Spanish Instructor, Round Rock Christian Academy). Each of them shared some great ideas for incorporating SpinTX videos in the classroom. Check out the presentation, below:

SpinTX Project Featured in COERLL Summer Webinar Series

In June, 2013 the SpinTX project was the subject of a professional development webinar offered by COERLL.

From the webinar description:

In the final installment of COERLL’s summer webinar series, we’ll unpack one of our most recent projects, SpinTX. SpinTX is a video archive that provides access to selected video clips and transcripts from the Spanish in Texas Corpus, a collection of video interviews with bilingual Spanish speakers in Texas. We will hear from project members who will show you how to use SpinTX to search and tag the videos for features that match your interests, and create and share your favorite playlists.

Using VISL Constraint Grammar to pedagogically annotate oral text

We are using VISLCG3 to annotate with pedagogically relevant information the transcripts of our oral video clips. VISLCG3 is open source software under a GNU General Public License that implements a finite-state machine that allows for the linguistic analysis of text in so-called local contexts using hand-crafted rules.

VISLCG3 (CG3 for short) can be used mainly for three types of operations: replace information related/assigned to a word, remove information assigned to a word (by a previous module, e.g., a dictionary look-up module), or add information to a word. For the purpose of our project we are mainly using CG3 to add information, that is, to add pedagogically-relevant annotations to texts (oral transcripts) that have previously been tagged with basic linguistic information such as lemma and part-of-speech with TreeTagger (this deserves a separate post). We are also using it to do some post-editing of the tagging, since our tagger systematically makes certain decisions with which we do not agree.

The two operators that we most frequently use are ADD and ADDRELATION. They perform similar actions: They both add one piece of information to the reading of a particular word. The only difference is that the former can be applied to phenomena that extend over one word (cohort, using CG3’s terminology), while the latter can be applied to phenomena that extend over two cohorts or more — optionally with words inbetween. This annotated phenomena are correlated with an internal Pedagogical Typology (still in progress) which we elaborated by extracting linguistic and communicative topics often found in Spanish textbooks.


The following rules exemplify how we used ADD and ADDRELATION to pedagogically annotate our corpus. The first two are pretty straightforward rules, the third one is a relatively more complex rule and gives the reader an idea of the power of formalisms such as CG3.

  • ADD (@Prag:MarcadoresDisc:Probabilidad) Adverbio IF (0 Quizas);

The above rule states that any adverb reading of the word(s) included in the set Quizas (which includes both quizás and quizá and is defined elsewhere in the grammar file) will be assigned the information @Prag:MarcadoresDisc:Probabilidad.

  • ADDRELATION (Gram:SerEstar:EstarAux) VerboEstar IF (0 FiniteVerbForms) TO (1 Gerundio);

The above rule states that any occurrence of the verb estar that is conjugated (that is, finite verb form) and is followed by a gerund will be tagged as an instance of the verb estar being used as an auxiliary R:Gram:SerEstar:EstarAux. The rule establishes and index-based relation between the form of estar and the corresponding gerund.

  • ADDRELATION (Func:Deseos:Ajenos) Verbo IF (0 VerbosExpresarDeseos) (1 Que) TO (*2 Verbo + Subjuntivo BARRIER GrupoNominal OR LimiteOracionSimple OR FiniteVerbForms);

The above rule states that any instance of a verb included in the list VerbosExpresarDeseos (which is defined elsewhere and includes verbs such as gustar, desear, querer…) that is followed by a que and also followed by a verb in subjunctive mood should be annotated as an instance of a way of expressing desire. The rule uses operators such as the Kleene star (*) or the special word BARRIER to give more flexibility to the actual location of the verb (not necessarily right after the que), but also controls that there is no crossing over of certain linguistic itmes such as conjunctions or sentence delimiters to guarantee that the rule stays within a safe scope. The tag that the rule maps is R:Func:Desesos:Ajenos to the actual verb and an index records the direction of the relation with the subjunctive verb (note the TO).

Working on the Corpus

The beta release of the classroom corpus will be in just a few short weeks!  We’re all really excited and working hard to get it ready.  Along the way we’ve already learned some important lessons about preparing a tagged video corpus for classroom use.  These lessons are based on our own experiences and the feedback we’ve already received from instructors.  Here are a few of the lessons learned for those pondering doing something similar:

The KISS (Keep It Simple Stupid) principle definitely applies!  You want the corpus to have a lot of information, but even more importantly you want it to be accessible.  This is especially crucial when it comes to designing the interface.  Educators and students don’t want to have to learn how to use your website.  They want to be able to come to your website and immediately know how to use it because it is well-designed using commonly accepted webpage design practices.

Know what your target audience wants.  While creating this corpus we have had lots of wonderful ideas about things we could add to it to make it even more awesome!  We have been surprised multiple times when our target audience, educators, haven’t responded positively to many of our awesome ideas.  That is because they don’t want a lot of bells and whistles, especially when these make navigating the corpus more difficult.  They want to be able to find the video they need quickly and easily, and they want to be able to be provided with either a transcript and/or a customized cloze test.  This doesn’t mean that we’ve been discarding our awesome ideas.  We’ve just put them on the back burner until we either figure out how to seamlessly integrate them, or a demand for them surfaces among our users, or we come to our senses and realize that they’re really not all that awesome after all.

Don’t re-invent the wheel.  There are already tools out there for tagging corpora for part of speech (e.g. TreeTagger), for parsing corpora (e.g. FreeLing), for searching corpora (e.g Corpus Work Bench), etc.  This is especially true when dealing with more commonly taught languages such as English and Spanish.  It can be frustrating trying to find the right tool (we’ve found that word-of-mouth is a good way), and sometimes even more frustrating to learn how to use the tools.  In general academic software is not designed to be user friendly, and often requires at least minimal competence in scripting in terminal in order to be used.  Nevertheless, it is definitely much easier to spend the time learning how to use these tools rather than trying to make your own from scratch.

You’ll probably need to do some computer programming, or have it done for you.  Regardless of the availability of pre-made tools, you’ll probably want to do something with your corpus that no one else has done before (at least as far as you can find online for free).  For example, we have one member of our team generating computer programming scripts that identify video segments that are unusually rich in certain pedagogical items, such as the use of the past tense or the side-by-side use of two commonly confused prepositions.  We have another team member generating scripts that combine the data from several different sources (TreeTagger, Closed Caption Files, language identification software, etc.) into files that are usable by SOLR, the search software we’re integrating into our website.  True, these tasks could be performed by hand, but doing so would mean taking much more time in preparing the corpus.

Don’t be afraid to farm out.  For some reason we tend to get into the mindset that we need to do everything ourselves if we want all the credit and glory.  Perhaps that is true, but it also means that you’re going to either kill yourself or not accomplish much as you could have with a little outside help.  There are people out there who will do a lot of the grunt work for a minimal cost.  For example, we farmed out the job of transcribing our full-length video interviews (30 – 60 minutes each) to Automatic Sync, a company that specializes in video transcriptions and captioning.  It costs a little money, but we get our rough transcripts back in less than a week!  Which means that we can dedicate the hundreds of hours we would have spent doing the transcriptions ourselves to doing other jobs that we can’t farm out.

I’m sure there are plenty of things I am forgetting, we’ll try to add those later as we remember them.  If you have any questions or comments, please post them below!

Brainstorming on the search & browse interface

We are thinking of offering teachers a practical and user friendly way of accessing the video clips in the SPinTX corpus. We are assuming that teachers might sometimes be overwhelmed by what can be asked to a corpus query interface (i.e., they did not design the compilation process, and it can be just a small corpus — compare to Google, querying the entire web).

Thus we want to offer teachers two clip retrieval modes: the search mode and the browsing mode. The search mode is the usual Google-like key term based search. I would type “banco Medellín” to retreive documents related to banks (financial institutions) in Medellín (Colombia). However, I would type “banco madera Medellín”, if I were looking for documents about carpenters or stores selling wooden banks (to sit on) in Medellín.

The browsing functionality is intended to facilitate the visual exploration of pedagogically relevant information extracted from the corpus. One initial thought is the use of information clouds, as reflected in the figure below. Imagine a a blank square with two drop-down menus. On one of them you could select a topic, to determine the lexical goal, the vocabulary. On the other one you could select the linguistic topic, which could range from grammatical categories to functional ones and a range of other classification criteria that could be relevant for language instruction/learning.

Figure 1 shows how this particular strategy would look like if we select Todos (all topics) in the thematic dropdown list and Gram: Prep. régimen (grammar topic, verb and preposition combinations). The size of the particular verb+prep combination is related to the number of occurrences it has in the corpus now, though it could also be related to the number of documents that have it in the corpus too.

Wireframe of the user interface for exploring the corpus via the browsing mode.

Figure 1. Wireframe of a user interface for browsing the corpus information on the basis of thematic criteria and linguistic criteria.

State of the Corpus

One of the questions that is most frequently asked is: How big is your corpus?  The answer is: Beats me, its constantly changing and there are several different versions of the corpus available at any one time.  But people usually aren’t satisfied with that answer, so here are the details of where the SPinTX corpus currently stands to the best of my knowledge (as researched this morning):

Total n interviews: 123

Total n transcripts: 74

Total n words: 315,673

Total n transcripts approved and tagged: 32

Total n words for approved and tagged transcripts: 134,737

Total n clips available to public taken from approved videos: 328

Total n words for clips: 102,573 (Note: many of the clips overlap, this is not filtered out in this count.)

Please let me know if there are any other stats that would be of use/interest and I will append them to this post.

-Cheers, Arthur