About Martí Q.

I am a graduate student at Universitat Pompeu Fabra in Barcelona and at Karls-Eberhard Universität in Tübingen. My PhD dissertation is a study on the methodological and technical aspects that can help produce ICALL activities that are pedagogically meaningful and computationally feasible. I have worked in the design and implementation of several software solutions including Natural Language Processing techniques, among them web interfaces to parallel and monolingual corpora, the development of morphosyntactic taggers and parsers, and the development of end-user spell, grammar and style checking tools.

Using VISL Constraint Grammar to pedagogically annotate oral text

We are using VISLCG3 to annotate with pedagogically relevant information the transcripts of our oral video clips. VISLCG3 is open source software under a GNU General Public License that implements a finite-state machine that allows for the linguistic analysis of text in so-called local contexts using hand-crafted rules.

VISLCG3 (CG3 for short) can be used mainly for three types of operations: replace information related/assigned to a word, remove information assigned to a word (by a previous module, e.g., a dictionary look-up module), or add information to a word. For the purpose of our project we are mainly using CG3 to add information, that is, to add pedagogically-relevant annotations to texts (oral transcripts) that have previously been tagged with basic linguistic information such as lemma and part-of-speech with TreeTagger (this deserves a separate post). We are also using it to do some post-editing of the tagging, since our tagger systematically makes certain decisions with which we do not agree.

The two operators that we most frequently use are ADD and ADDRELATION. They perform similar actions: They both add one piece of information to the reading of a particular word. The only difference is that the former can be applied to phenomena that extend over one word (cohort, using CG3’s terminology), while the latter can be applied to phenomena that extend over two cohorts or more — optionally with words inbetween. This annotated phenomena are correlated with an internal Pedagogical Typology (still in progress) which we elaborated by extracting linguistic and communicative topics often found in Spanish textbooks.


The following rules exemplify how we used ADD and ADDRELATION to pedagogically annotate our corpus. The first two are pretty straightforward rules, the third one is a relatively more complex rule and gives the reader an idea of the power of formalisms such as CG3.

  • ADD (@Prag:MarcadoresDisc:Probabilidad) Adverbio IF (0 Quizas);

The above rule states that any adverb reading of the word(s) included in the set Quizas (which includes both quizás and quizá and is defined elsewhere in the grammar file) will be assigned the information @Prag:MarcadoresDisc:Probabilidad.

  • ADDRELATION (Gram:SerEstar:EstarAux) VerboEstar IF (0 FiniteVerbForms) TO (1 Gerundio);

The above rule states that any occurrence of the verb estar that is conjugated (that is, finite verb form) and is followed by a gerund will be tagged as an instance of the verb estar being used as an auxiliary R:Gram:SerEstar:EstarAux. The rule establishes and index-based relation between the form of estar and the corresponding gerund.

  • ADDRELATION (Func:Deseos:Ajenos) Verbo IF (0 VerbosExpresarDeseos) (1 Que) TO (*2 Verbo + Subjuntivo BARRIER GrupoNominal OR LimiteOracionSimple OR FiniteVerbForms);

The above rule states that any instance of a verb included in the list VerbosExpresarDeseos (which is defined elsewhere and includes verbs such as gustar, desear, querer…) that is followed by a que and also followed by a verb in subjunctive mood should be annotated as an instance of a way of expressing desire. The rule uses operators such as the Kleene star (*) or the special word BARRIER to give more flexibility to the actual location of the verb (not necessarily right after the que), but also controls that there is no crossing over of certain linguistic itmes such as conjunctions or sentence delimiters to guarantee that the rule stays within a safe scope. The tag that the rule maps is R:Func:Desesos:Ajenos to the actual verb and an index records the direction of the relation with the subjunctive verb (note the TO).

Brainstorming on the search & browse interface

We are thinking of offering teachers a practical and user friendly way of accessing the video clips in the SPinTX corpus. We are assuming that teachers might sometimes be overwhelmed by what can be asked to a corpus query interface (i.e., they did not design the compilation process, and it can be just a small corpus — compare to Google, querying the entire web).

Thus we want to offer teachers two clip retrieval modes: the search mode and the browsing mode. The search mode is the usual Google-like key term based search. I would type “banco Medellín” to retreive documents related to banks (financial institutions) in Medellín (Colombia). However, I would type “banco madera Medellín”, if I were looking for documents about carpenters or stores selling wooden banks (to sit on) in Medellín.

The browsing functionality is intended to facilitate the visual exploration of pedagogically relevant information extracted from the corpus. One initial thought is the use of information clouds, as reflected in the figure below. Imagine a a blank square with two drop-down menus. On one of them you could select a topic, to determine the lexical goal, the vocabulary. On the other one you could select the linguistic topic, which could range from grammatical categories to functional ones and a range of other classification criteria that could be relevant for language instruction/learning.

Figure 1 shows how this particular strategy would look like if we select Todos (all topics) in the thematic dropdown list and Gram: Prep. régimen (grammar topic, verb and preposition combinations). The size of the particular verb+prep combination is related to the number of occurrences it has in the corpus now, though it could also be related to the number of documents that have it in the corpus too.

Wireframe of the user interface for exploring the corpus via the browsing mode.

Figure 1. Wireframe of a user interface for browsing the corpus information on the basis of thematic criteria and linguistic criteria.

¿Qué criterios usarías para buscar vídeos?

[N.B. Información previa sobre el corpus abajo mencionado: post anterior y site de SPinTX, ambos en inglés]

Pregunta para los que enseñáis Español como Lengua Extranjera (ELE): Cuando buscáis en Internet un vídeo para trabajar un objetivo gramatical o léxico específico, ¿qué tipo de criterios de búsqueda crees que os serían útiles? Estamos tratando de añadir metadatos al corpus de Español de Texas (SPinTX) y hemos empezado a hacer una pequeña lista (adjunta a continuación). ¿Tienes cinco minutos para darnos tu opinión? ¡Déjanos un comentario, por favor!

Lista de descriptores pedagógicos para SPinTX

  1. Nivel morfológico
    • Tiempos verbales: presentes, pretéritos, futuros, condicionales, etc.
    • Modo verbal: indicativo, subjuntivo, imperativo, infinitivo, gerundio, etc.
  2. Nivel morfosintático
    • Género en sustantivos y combinación con determinantes.
    • Uso de preposiciones.
      • Por y para: distinción entre usos causales, objetivos, destinos, destinatarios.
  3. Nivel discursivo
    • Marcadores discursivos.
  4. Nivel léxico
    • Identificar los campos semánticos de un texto a través de una lista de palabras clave.
  5. Nivel funcional
    • Expresar gustos y preferencias.

Si habéis llegado aquí, una pregunta más: ¿os imagináis una ficha técnica asociada a cada uno de los vídeos de una lista de resultados con este tipo de información para poder filtrar los más o menos adecuados para vuestra clase?

Designing a pedagogical interface for a repository of video interviews

One of the goals in the Corpus to Classroom project is to design a pedagogical interface for the repository of video clips that are being generated out of the more than 100 interviews that were collected in the past as part of the Spanish in Texas project. From our interviews with actual teachers and materials developers, we confirmed that teachers are potentially interested in applying the following types of filtering criteria to their searches:

  • Grammar topics: e.g., search for those clips that contain a significant number of occurrences of por and para
  • Functional topics: e.g., search for those clips that contain exponents of the function apologizing
  • Vocabulary: e.g., clips that contain words (in a pre-defined list maybe) that relate to the topic la familia (papá, mamá, padre(s), madre, hermano/a, abuelo/a…)
  • Thematic: e.g., clips talking about food, traditions, reasons for moving to the US (in our case)…

This is not a complete list, but it is a starting one that contains the most common types of criteria (emotion and phonetics are two criteria that were mentioned too).

With this in mind we are considering the use of a standard search engine (such as Apache Solr/Lucene) to allow teachers to search for the clips and use facets (filtering options) to dig down or define finer-grained queries. However, we also consider the use of typical corpus query tools (such as CWB or SketchEngine — or NoSketchEngine). With this we can cover the Information Retrieval part of our task (more appropriate for document retrieval on the basis of word- or term-based queries) and the Information Extraction part of our task (more appropriate for the queries driven by linguistic patterns).

We will further describe our advances in future posts.