Abstracts


Teemu Roos - Probabilistic models for phylogenetics and stemmatology: Theory and practice
It has long been known that textual traditions that are produced by repeated copying with modification, as well as many other cultural objects, evolve in ways that can be likened to biological evolution. Hence, it is not surprising that many techniques initially developed for building evolutionary trees (phylogenetics) can be applied to the analysis of such cultural objects. I will discuss recent advances in the theory and practice of phylogenetics applied to the study of cultural evolution. In particular, I will describe a new method based on probabilistic models such as Bayesian networks. In experiments with artificially created textual traditions, the new method outperforms current state-of-the-art in the specific task of reconstructing copying histories both in terms of a numerical score as well as interpretability.

No prior knowledge of phylogenetics or algorithmics is assumed.

Short bio:
TeemuRoos is a Senior Researcher at the Helsinki Institute for Information Technology HIIT and an Adjunct Professor in Computer Science at the University of Helsinki, Finland. His research interests include machine learning, probabilistic modelling, information theory, and their cross-disciplinary applications.


____________________________ 
Michael Cysouw - Back to the roots: using regular sound correspondences for linguistic phylogeny (as one should) 
Traditional historical linguistic stresses the importance of looking for regular sound correspondences for the phylogenetic reconstruction of languages. In recent computational phylogenetic work this old truism is mostly disregarded. This is unfortunate and unnecessary. I will argue that it is possible to statistically approach the regularity of sound correspondences in a straightforward way, even without the necessity of perfectly detected cognacy



____________________________  
Sergej Saj - Two-place verb classes: towards measuring (dis)similarity between the languages of Europe  
The study is a part of a project devoted to the study of valency classes in the languages of Europe. It is based on a questionnaire consisting of 130 polyvalent predicates. These predicates were chosen with the help of a pilot study, so that most of predicates chosen are not uniformly transitive across languages, but rather, often fall in one of smaller two-place valency classes. I will concentrate on the problem of measuring (dis)similarity between languages based on the data obtained for a small (currently consisting of 15) but ever-growing sample of genetically diverse languages of Europe. 

In some respects the data obtained for these languages can be compared directly. The simplest kind of typology is based on binary features such as e.g. whether or not a particular meaning, e.g. ‘wait’, is expressed by a transitive verb. The legitimacy of this operation rests upon typological assumptions about cross-linguistic validity of the notion of transitivity. Genenarilizing these binary results one may arrive at a very simple measure allowing to calculate the overall transitivity profile for various languages (and for the verbs chosen the transitivity ratio shows a high degree of variability, with SAE languages being much more transitive that peripheral European languages).

Likewise, for comparing the sets of transitive and intransitive verbs (not their sizes) the usual techniques (Hamming’s distance etc.) are appropriate. This approach allows one to build Neighbor Nets and similar visualizations so that one grasps both genetic (e.g. Lithuanian is very similar to Latvian) and areal (Basque is quite close to French) similarities.

However, for finding similarities in the structure of more peripheral classes direct cross-linguistic identification is illegitimate. E.g. it is not possible to unequivocally decide on structural grounds whether e.g. German schauen (auf + ACC) and Lithuanian žiūrėti (į + ACC), both meaning ‘look at’, should be treated as belonging to similar classes in the two languages (for genetically closely related languages a possible approach would be to check whether the verbs use coding devices that are cognate, but this is not a very useful approach even for large genetic groupings, let alone for genetically unrelated languages). Yet, there is a need to quantitatively grasp the intuitive idea that some pairs of languages are closer to each other in terms of their systems of verb classes than others. The basic claim in this study is that assessment can be based on properties of groups that the verbs fall into (e.g. one can check to what extent the verbal meaning that require auf + ACC in German overlaps with the group of verbs that take į + ACC in Lithuanian). In my talk, I am going to discuss entropy-based measures (mutual information and predictability) that can be used to measure (dis)similarities between languages in this respect. It will be shown that the results obtained with the help of these techniques are in some respects different from results arrived at with the help of simpler methods based on transitivity alone.  
 ____________________________
Jamie Tehrani –Phylomemetics in Anthropology
Anthropologists have become increasingly interested in the application of phylogenetic methods to study “descent with modification” in cultural traditions. Folk tales, weaving styles, pottery techniques, etc. are handed down from one generation to another and gradually evolve into new forms through the accumulation of copying errors and innovations. These processes have clear parallels in other fields where phylogenetics has been successfully applied, namely evolutionary biology, historical linguistics and stemmatology. However, it is important to note that traditional textiles, folktales, etc. are rarely copied from a single model, but are compiled from many sources. Moreover, the objects of these traditions are not “copied” in a literal sense, but are reconstructed from observation and memory, which can make learning very different from replication. These points have important implications for how we approach, code and analyse folk tradition data, which I discuss using examples from my own work and that of other anthropologists.
 ____________________________

Gerold Schneider - Syntactic parsing as a phylogenetic task
The sequence-based character of natural language has been described by Sinclair's idiom principle (Sinclair 1991), Hunston and Francis' pattern grammar (Hunston and Francis 2000), and by Hoey's lexical priming theory (Hoey 2005). Syntactic rules and sequence preferences (often called collocations) work in close cooperation with each other. The blind application of syntactic rules typically leads to dozens of syntactically correct analyses for real-world sentences, although typically all except for one are semantically implausible. Bi-lexically or tri-lexically conditioned collocational preference statistics (Collins 1999, Nivre 2006, Schneider 2008) are needed to rank and prune these analyses, calculating the probability of a certain syntactic relation (such as object) given the lemma of the governor and the lemma of the dependent in a dependency representation. For example, the sequence verb-­noun is licensed to attach as an object or adjunct relation according to the grammar. Given the governor lemma "eat" and the dependent lemma "pizza", the probability for an object relation is very high, while for the governor lemma "eat" and the dependent lemma "Friday", the probability for an adjunct relation is high.

More recent approaches such as data-oriented parsing (Bod et al. 2003) condition on larger context than just the governor and the dependent. Parsing can be seen as subtree-mapping. The most closely related gold standard subtree from the manually annotated training resource is used to deliver the analyses of the candidate sequence, (be it sentence, clause, or chunk), at parse time, as far as sparse data allows. While word-sequences and genetic sequences may not have much in common, mapping candidate sequences to training sequences could be seen as a phylogenetic task. The candidate sequence is seen as a genetically mutated version of the gold standard sequence. Finding the most closely related sequence efficiently at parse-time is vital.

Even if sequences are short, the sparse data problem is enormous, due to Zipf's law. It says that most word types are rare, and the combination of rare events is exponentially rarer. Even in the case of a local bi- and tri-lexical preference for a dependency relation, the majority of the counts needed for disambiguating between the various candidates at parse-time are null counts, and we need to back off to semantic classes or less lexical information. In terms of genetic mutations, we can only compare sequences with enormous distances between them.

We present on the one hand the various methods that we have used to alleviate data sparseness, such as semantic classes from WordNet (Miller 1990). On the other hand we use methods that are typically used in phylogenetic analysis, for example non-negative matrix factorization (Lee and Sung 2001, Murrell et al. 2011) . We evaluate and compare the approaches to each other, and critically discuss if the phylogenetic metaphor is appropriate.  
 ____________________________


Steven Moran & Johann-Mattis List - A Python Toolkit for Quantitative Tasks in Historical-Comparative Linguistics
The use of numerous quantitative methods in historical linguistics, often inspired from algorithms in information theory and evolutionary biology, has led to a situation in which there are many different tools for the comparative analysis of linguistic data. These tools, however, are typically incompatible with each other. For example, the STARLING software package (cf. Starostin 2000) is database software that provides lexicostatistical and glottochronological analyses, the calculation of family trees, and some rudimentary routines for automatic cognate detection. However, there are also software packages and programs for phonetic alignment analyses, such as the Rug/L04 software (Kleiweg 2009) and the ALINE algorithm (Kondrak 2000). Other software tools are described in the literature, but not made publicly available (Downey et al. 2008). Additionally there are algorithms that are not yet implemented but show promise (Covington 1996). Finally, there are various software packages for evolutionary biology that are often used in linguistic applications, such as MrBayes (Ronquist & Huelsenbeck 2003), Phylip (Felsenstein 2005) and SplitsTree (Huson 1998), but they have not yet been satisfactorily updated to handle the intricacies of linguistic data. This myriad of software puts a particular burden on linguists: those who want to analyze their data in more than just one way and compare their results have to convert their data into many different formats and they have to be familiar with many different kinds of software (of course software packages have their own' respective idiosyncrasies as well). As a result, errors can increase during the process of data transformation and the comparability of the output of different tools is decreased since not only the data format of the various tools may vary, but also the content of the data required (some of the tools have only a limited application range). The expenditure of time required for such research can be enormous.

Our goal is to overcome these problems by developing a Python library for quantitative tasks in historical-comparative linguistics that unifies existing methods within a single open source framework, offers easy routines to convert linguistic data into the formats needed for third-party software, and provides a forum to publish new and innovative quantitative methods in historical linguistics.

We will show discuss this collaborative endeavour by illustrating a completely automatic workflow that imitates the major steps of the comparative method (establishment of sound correspondences, cognate detection, phylogenetic reconstruction, etc.), based on methods recently published in the literature. We will illustrate this workflow with several different data sets.  


____________________________
Balthasar Bickel - Exploring similarities: phylogenetic methods beyond phylogeny 
(will follow)


____________________________   
Harald Hammarström - An Algorithm for Isogloss-Compatible Historical Reconstruction (Poster Session) 
Orthodox theory in linguistics (Ross 1997 inter alia) holds that the only valid criterion for positing a subgroup is by exclusive shared innovations. Yet modern phylogenetic inference algorithm provide a tree output without any explicit reference to exclusive shared innovations in their calculations and it remains unclear to what extent this implicitly modeled. In fact, empirically, modern phylogenetic methods tend to find series of binary branchings where linguists, based on exclusive shared innovations, have higher-order branchings (e.g., the Indo-European tree of Bouckaert et al. 2012). We will present an algorithm whose input is a matrix of languages x features and infers a tree-model subgrouping based on shared innovations. The algorithm closely models the intuitions in orthodox comparative linguistics and is similar to, but not identical to, Maximum Parsimony. Tested on Chapacuran, Quechuan and Indo-European datasets, it does indeed find amounts of non-binary branchings realistic to traditional linguistic analysis.