2009-06-12

Presenting: PyAnnotation

Yesterday I wrote about Corpus linguistics and user interfaces , today I present my own contribution to you all out there: PyAnnotation, a linguistic annotation library for python. It currently support only Elan's .eaf files, and I am quite sure that not every .eaf file is supported right now. The library is work in progress, so for those of you who know about python please try it out and tell me what you like and don't like about it.

2009-06-11

Corpus linguistics and user interfaces


What user interface does a corpus linguist need? Recent publications seem to suggest that the user interface does not necessarily be "graphical"; a new approach is to use a programming language, more spefically, an interpeted language like python or R. The first time I heard someone making this idea explicit was when I read the manual of ChucK, an "audio programming language" especially designed for live coding:



"In the history of computing, many interfaces have been designed to instruct computers, but none have been as fundamental (or perhaps as enduring) as programming languages. Unlike most other classes of human-computer interfaces, programming languages don’t directly perform any specific “end-use” task (such as word processing or video editing), but instead allow us to build software that might perform almost any custom function. The programming language acts as a mediator between human intention and the corresponding bits and instructions that make sense to a computer. It is the most general and yet the most intimate and precise tool for instructing computers." (The ChucK Audio Programming Language, p. 2)



In linguistics, there are two protagonists that follow that idea: one is Steven Bird, the other one is Stefan Th. Gries. Steven Bird is the "head" of the "Natural Language Toolkit", a python library that contains, among other things, algorithms for statistical analysis and automated tagging of language data. A very good entry point for learning about the possibilities of the toolkit is his Natural Language Processing with Python (together with Ewan Klein and Edward Loper). I find the book quite understandable and suitable for beginners, although you might have to consult additional sources for programming in python while reading through it. In contrast to Bird, Gries proposes R as the programming language for the corpus linguist, as R was designed for statistical tasks from the beginning. His book, Quantitative Corpus Linguistics with R: A Practical Introduction, like Bird's NLTK book, tries to introduce the reader to the programming language as well as to the field of corpus linguistics. Several linguistic problems are presented and a corpus linguistic approach to those problems is described. In my view, this last part is presented a bit better then in the python book. But: Gries' book clearly suffers from the difficulties a beginner will have to understand the programming part. I am programming in scripting languages for more then 10 years now, so maybe R's data structures were harder to understand because of my knowledge of Perl, Ruby and Python, which differ a lot in that. But I even had problems in understanding the regular expression Gries uses throughout his book, while I thought that I knew quite a lot about them from Perl. So, in summary, I would propose his book to people who know about programming already, or who are willing to consult additional books and tutorials about programming in R and regular expressions. Or to people who read Bird's, Klein's, and Loper's book first. ;-)



Finally, although I personally favor the idea of a programming language as the most flexible access to your corpus data, I also doubt that this approach is suitable as a general technique. The learning curve seems steep, and normally linguists should be concerned with linguistics, not with programming and the complexity of regular expressions. I think that interested students should be trained in those methods, and the books presented here could be used as a basis for such courses. But I see it also as necessary to create user interfaces "for human beings", so that every linguist is able to test her linguistic hypothesis on language data, which is in my view one of the most important approach in general linguistics today.

Autos als die Lösung aller Probleme

Die Diskussionen über Insolvenzen in der Autoindustrie und anderen Wirtschaftszweigen werden wohl noch eine Weile geführt werden, aufmerksam machen wollte ich hiermit auf einen interessanten Beitrag zur Debatte, der die grundsätzliche Ausrichtung der Industriesubvention in Deutschland in Frage stellt: "Der Motor stottert: Abbruch oder Umbau" von Stephan Krull. Die Subventionierung von nicht zukunftsfähigen Industrien ist in Deutschland zeit meines Lebens kritisiert worden, aber anstatt gleich gegen alle Subventionen zu wettern zeigt der Artikel sehr schön und konstruktiv auf welche Alternativen es geben könnte. Vor allem die bessere Verteilung von Arbeit gehört meiner Ansicht nach unbedingt in die aktuelle politische Debatte. Deswegen: Lesen!