Using the Content

I’ll be honest, I hate making quizzes and tests. Usually the content is either very artificial, or mostly unrelated to the material we’re covering in class. Well, I decided to make a quiz and to use Content from SpinTX. And I was very pleased to be able to find content that related to both the grammar and the vocabulary that we are covering in class! I had the students watch one of the videos (611) and answer questions about tolerance and the subjunctive, and then I had them read a modified version of the text from another video (635) and fill in the blanks with the correct vocabulary and verb conjugations, and then had them read a shortened version of the text from yet another video (1575) that contains a lot of the vocabulary we are going over and had them react to the text using the subjunctive with emotion/reaction, doubt/negation, and will/wish/desire. The quiz went very well, and I think that it definitely was more interesting for both myself and the students than a normal textbook-based quiz.

The tricky part was, of course, finding the videos that had high concentrations of the vocabulary and grammar that we are going over. I’ll admit that I cheated a little since I have access to the text files for all of the video clips and have some scripting knowledge. I wrote a script in Python that went through and counted the number of times each vocabulary item and grammar point that we are covering is used in each video, and then looked through the top ten hits to pick the three videos that I wound up using. And I also modified the content somewhat of two of the videos in order to increase the concentration of relevant content. But I think that it was definitely worth the little extra effort that it took to have a quiz of such a higher caliber.

Immigration

Our assigned topic for one day of class was immigration. On this day the provided activities mostly consisted of the students discussing black-and-white pictures of immigrants entering the US and various Spanish-speaking countries. Instead I decided to use SpinTX so that the students could talk about immigration that would hopefully be easier for them to relate to. I searched for inmi* in order to find all videos in which the Spanish word inmigración was explicitly mentioned. I then narrowed down the search to only videos that contained the present subjunctive since that was the grammar topic we were covering in conjunction with immigration. I then selected the six videos that I thought would work best and assigned one to each of the six groups in my class. I prepared a separate Google Form for each group with the link to their specific video.

On the day of class I gave the students the links to the forms for their groups and then gave them half an hour to prepare 5 questions for their classmates to answer while watching the videos. During their preparation time I walked around and heard several very good conversations going on about the videos and immigration. After all the groups were ready we watched the videos as a class with the questions displayed next to them. The students seemed very interested in the videos and in trying to answer the questions. Having taught this class before using the activities provided in the textbook, I can say that this time with SpinTX the class went much better! In the sense that the students were more interested, more active, and there was much better discussion.

“Gustar-Type Verbs” with the Subjunctive

For this activity I had students break into groups 4 and then search for the subjunctive following the triggers (“gustar-type verbs”) that we were looking at in class. So, for example, they searched for examples of the subjunctive after molestar by doing a keyword search for molest*. I had them search for five examples using five different trigger words, and they had to identify the indirect object, verb and subject of each example.

There was a lot of laughter involved with this activity, which I always take as a good sign. Mostly they were laughing at vocabulary that they had never seen before but for which they could still guess the meaning, such as ‘bailarina’. I think that they were excited to see that they could actually understand real Spanish. They also came across some great examples that they had not fully understood but which they thought that they had understood, so this was definitely beneficial. One common example of this was when the subject of a gustar-type verb was a verb phrase that contained multiple singular nouns or a plural noun (This example is not from the corpus: Me molesta recibir cartas). When confronted with these examples they began to ask why molestar was conjugated in the singular instead of the plural, which led to what I believe was a helpful discussion.

SpinTX to the Rescue

During my intermediate Spanish class last Friday we got through the material somewhat faster than I was anticipating.  We were going back over the subjunctive and its different uses (our textbook discusses 6).  Anyway, we were finished with the material with about 15 minutes left to go from our 2-hour class so I had to think fast.  I told my students to go to SpinTX and look up two real-life uses of the subjunctive that illustrate two of the six different uses of the subjunctive that we had discussed, and that they could leave as soon as their group showed me their examples and told me which functions they illustrated.  Needless to say my students jumped right in!  While I watched them work I was very pleased to see them having good discussions about how the subjunctive was being used on the site.  They had absolutely no difficulty in using the site, and each of their identified uses were spot on.  The first group finished after about eight minutes and the last group was almost finished when class officially ended.  I felt that this was a very successful activity at exposing them to the subjunctive and they were happy to have the chance to leave early, so it was a nice win-win.  I will very probably do this again if I’m ever stuck with 15 or so minutes left at the end of a class.

SpinTX in use in an intermediate Spanish class

I have used the SpinTX pedagogical video archive in my intermediate Spanish class three times so far this semester and I thought that I’d share a little about what I did and how it worked out.

The first time we were talking about stereotypes in class.  I selected a video ahead of time that I felt was applicable (clip from the interview with Nancy T.)  and then showed it to my students twice in class.  The first time they watched it without captions and the second time they watched it with captions.  I had them listen for all of the stereotypes mentioned in the video.  They seemed very interested in watching the video and we had a good discussion afterwards.

The next time we used SpinTX the class was divided into groups and each group had to look for a video that illustrated real use of adjectives that change meaning depending whether they are preceded by ser or estar, which we had just covered in class.  Then they had to explain why ser or estar was used with each example that they found.  Every group was able to find something to share within 5 minutes; sharing took another 5 minutes.  They really seemed to like the fact that they were looking at real-life examples.

The third time I had my students search for and explain examples of the pluperfect and the present perfect on SpinTX.  These aren’t tagged yet so I had them search for había/habías/habían/habíamos or he/has/ha/hemos/han respectively and then skim the results for hits.  They had about 10 minutes to find and explain two examples of each compound tense.  These are not the most exciting verb tenses, so up to that point that day class had been pretty lethargic.  But their interest was obviously piqued as they searched SpinTX for the examples, and there was even sporadic laughter as they came accross certain examples.  They worked in groups of 4 and wrote down their sentences and explanations on pieces of paper that they turned in after finishing.  I was impressed with how well they were able to apply what we had covered in class that day to recognize and explain correctly examples of the pluperfect and present perfect so quickly!

One final anecdote.  At one point during one of these activities one of my students noticed that the speaker was using “educado” to mean “educated” rather than “mannered”.  She pointed this out and we had a very good, brief discussion about how there is a “standard” Spanish that we teach in class and many different dialects that vary from this standard in different ways.  The entire class was very interested, I think especially since it was one of their own who had noticed the discrepancy.

As these anecdotes illustrate the use of SpinTX in my intermediate Spanish class has been very easy and successful so far this semester!  I have already planned several more activities involving SpinTX throughout the remainder of the semester.

Working on the Corpus

The beta release of the classroom corpus will be in just a few short weeks!  We’re all really excited and working hard to get it ready.  Along the way we’ve already learned some important lessons about preparing a tagged video corpus for classroom use.  These lessons are based on our own experiences and the feedback we’ve already received from instructors.  Here are a few of the lessons learned for those pondering doing something similar:

The KISS (Keep It Simple Stupid) principle definitely applies!  You want the corpus to have a lot of information, but even more importantly you want it to be accessible.  This is especially crucial when it comes to designing the interface.  Educators and students don’t want to have to learn how to use your website.  They want to be able to come to your website and immediately know how to use it because it is well-designed using commonly accepted webpage design practices.

Know what your target audience wants.  While creating this corpus we have had lots of wonderful ideas about things we could add to it to make it even more awesome!  We have been surprised multiple times when our target audience, educators, haven’t responded positively to many of our awesome ideas.  That is because they don’t want a lot of bells and whistles, especially when these make navigating the corpus more difficult.  They want to be able to find the video they need quickly and easily, and they want to be able to be provided with either a transcript and/or a customized cloze test.  This doesn’t mean that we’ve been discarding our awesome ideas.  We’ve just put them on the back burner until we either figure out how to seamlessly integrate them, or a demand for them surfaces among our users, or we come to our senses and realize that they’re really not all that awesome after all.

Don’t re-invent the wheel.  There are already tools out there for tagging corpora for part of speech (e.g. TreeTagger), for parsing corpora (e.g. FreeLing), for searching corpora (e.g Corpus Work Bench), etc.  This is especially true when dealing with more commonly taught languages such as English and Spanish.  It can be frustrating trying to find the right tool (we’ve found that word-of-mouth is a good way), and sometimes even more frustrating to learn how to use the tools.  In general academic software is not designed to be user friendly, and often requires at least minimal competence in scripting in terminal in order to be used.  Nevertheless, it is definitely much easier to spend the time learning how to use these tools rather than trying to make your own from scratch.

You’ll probably need to do some computer programming, or have it done for you.  Regardless of the availability of pre-made tools, you’ll probably want to do something with your corpus that no one else has done before (at least as far as you can find online for free).  For example, we have one member of our team generating computer programming scripts that identify video segments that are unusually rich in certain pedagogical items, such as the use of the past tense or the side-by-side use of two commonly confused prepositions.  We have another team member generating scripts that combine the data from several different sources (TreeTagger, Closed Caption Files, language identification software, etc.) into files that are usable by SOLR, the search software we’re integrating into our website.  True, these tasks could be performed by hand, but doing so would mean taking much more time in preparing the corpus.

Don’t be afraid to farm out.  For some reason we tend to get into the mindset that we need to do everything ourselves if we want all the credit and glory.  Perhaps that is true, but it also means that you’re going to either kill yourself or not accomplish much as you could have with a little outside help.  There are people out there who will do a lot of the grunt work for a minimal cost.  For example, we farmed out the job of transcribing our full-length video interviews (30 – 60 minutes each) to Automatic Sync, a company that specializes in video transcriptions and captioning.  It costs a little money, but we get our rough transcripts back in less than a week!  Which means that we can dedicate the hundreds of hours we would have spent doing the transcriptions ourselves to doing other jobs that we can’t farm out.

I’m sure there are plenty of things I am forgetting, we’ll try to add those later as we remember them.  If you have any questions or comments, please post them below!

From Transcript to Tagged Corpus

In this post I will discuss the steps that we are using to get from our transcripts to our final corpus (as of 01/15/2013).  This is still a messy process, but with this documentation anyone should be able to replicate our output (on a Mac).

Step 1. Download and unzip this folder where you would like to do your work.

Step 2. Install TreeTagger within ProjectFolder/TreeTagger (look inside the folder you just unzipped).

Step 3. Make sure that you have updated, complete versions of PHP and Python installed.

Step 4. Update TranscriptToSrt.py and SrtGatherer.py with your YouTube client id, secret, and developer key.

Step 5. Save your plain-text transcripts in Project/transcripts (one for each video).

Step 6. Update MainInput.txt with your information.

Step 7. Log in to your YouTube account.

Step 8. Open Terminal and navigate to ProjectFolder.

Step 9. Run MainBatchMaker.py by typing: python MainBatchMaker.py

Step 10. Run MainProcessor by typing: ./MainProcessor

And you’re done!  You should now have fully tagged files in ProjectFolder/Processing/Tagged and closed caption files in ProjectFolder/Processing/SRT.  And next time you’ll only need to do steps 5 – 10!  😀

 

A few hints in case you run into trouble:

You may need to install some additional Python libraries as indicated by any relevant errors.

If you have an encoding error with some of the Spanish characters, you may need to edit srtitem.py.  See my comment on StackOverflow.

If the scripts are successful at downloading some srt files from YouTube, but not others, it is probably a timing issue with YouTube’s API.  I am currently trying to build in a work-around, but for now, just wait a few minutes, run MainProcessor again, and cross your fingers.

Finally, these scripts are not very efficient yet.  When running them with around 30 videos and around 100,000 words, it takes about two hours on my MacBook Pro.  Sorry about that.  We will be working on optimizing these scripts as time permits.

Please contact me with any questions or suggestions!

State of the Corpus

One of the questions that is most frequently asked is: How big is your corpus?  The answer is: Beats me, its constantly changing and there are several different versions of the corpus available at any one time.  But people usually aren’t satisfied with that answer, so here are the details of where the SPinTX corpus currently stands to the best of my knowledge (as researched this morning):

Total n interviews: 123

Total n transcripts: 74

Total n words: 315,673

Total n transcripts approved and tagged: 32

Total n words for approved and tagged transcripts: 134,737

Total n clips available to public taken from approved videos: 328

Total n words for clips: 102,573 (Note: many of the clips overlap, this is not filtered out in this count.)

Please let me know if there are any other stats that would be of use/interest and I will append them to this post.

-Cheers, Arthur