State of the Corpus

One of the questions that is most frequently asked is: How big is your corpus?  The answer is: Beats me, its constantly changing and there are several different versions of the corpus available at any one time.  But people usually aren’t satisfied with that answer, so here are the details of where the SPinTX corpus currently stands to the best of my knowledge (as researched this morning):

Total n interviews: 123

Total n transcripts: 74

Total n words: 315,673

Total n transcripts approved and tagged: 32

Total n words for approved and tagged transcripts: 134,737

Total n clips available to public taken from approved videos: 328

Total n words for clips: 102,573 (Note: many of the clips overlap, this is not filtered out in this count.)

Please let me know if there are any other stats that would be of use/interest and I will append them to this post.

-Cheers, Arthur

One thought on “State of the Corpus

  1. Pingback: State of the Corpus | From Corpus to Classroom | EL ESPAÑOL DE AMERICA | Scoop.it

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>