Laboratoire de Langues & Civilisations à Tradition Orale
Laboratoire de Langues & Civilisations à Tradition Orale
Funder
6 Projects, page 1 of 2
assignment_turned_in ProjectFrom 2023Partners:University of Paris, Laboratoire Interdisciplinaire des Sciences du Numérique, INSHS, François Rabelais University, UORL +4 partnersUniversity of Paris,Laboratoire Interdisciplinaire des Sciences du Numérique,INSHS,François Rabelais University,UORL,Laboratoire Ligérien de Linguistique,Laboratoire de Langues & Civilisations à Tradition Orale,LLF,CNRSFunder: French National Research Agency (ANR) Project Code: ANR-23-CE38-0003Funder Contribution: 460,009 EURIn the last few years, neural models have allowed spectacular progress in natural language processing (NLP). The DeepTypo project proposes to use multilingual models of speech to design methods for automatically extracting, from audio recordings, typological information useful for language documentation and research (phonological and morphosyntactic complexity indices, similarities between languages…). Based on a collaboration between linguists and NLP researchers, the DeepTypo project sits squarely in the space of digital humanities by addressing fundamental questions of both communities. It will help linguists in their work of documenting and analyzing languages, especially “rare” or “poorly endowed” languages, by providing them with new tools and methods that will allow them, for example, to bring out new information on similarities between languages. Beyond the “tool development” aspect, the DeepTypo project aims, above all, at showing that the representations at the heart of neural networks can be used to answer fundamental questions in linguistic, by taking, as an example, current issues in creolistics (the study of creoles) and dialectology of Sino-Tibetan languages. Extracting typological information, the core of the DeepTypo project, will also contribute to the identification of the limits of fine-tuning. This approach has made it possible to develop, at low cost, NLP systems for several languages and many tasks and is often presented today as "THE" solution to all NLP problems. The identification of linguistic features captured by neural networks will allow us to verify if this is indeed the case: if a model is, for example, not able to detect and represent the tones of a language, it is more than likely that it cannot be used to develop a system for tonal languages. To achieve this ambitious goal, we will use neural representation analysis methods to interpret and understand the decisions of neural networks and will develop them along four original axes: 1. Based on the collaboration with the different partners of the project, we will try to identify richer features than those considered in the state of the art: if the existing works have focused on “simple” features (speaker gender, language of the utterance, ...), we will also consider information related to the diversity of the languages and to the linguistic characteristics of these languages (phonemic inventory, identification of tonal languages, ...). 2. In addition to existing analysis methods (e.g. linguistic probes), we will develop new methods to measure similarity between languages. Again, close collaboration between linguists and NLP researchers will be essential to define a linguistically relevant similarity (or similarities). 3. We will apply our methods to the 230 languages of the Pangloss collection (an archive of rare languages managed by LACITO) and to 15 creoles (collected mainly by LLL). These large-scale experiments will allow us to test state-of-the-art pre-trained models on languages with a wide variety of linguistic features rarely considered in NLP work. 4. We will apply these methods to language documentation support tasks, an application that has, until now, never been considered.
more_vert assignment_turned_in ProjectFrom 2024Partners:Université de Tampere, Université Laval, Institut Français de Recherche sur l’Asie de l’Est, Centre d'Etudes en Sciences Sociales sur les Mondes Africains, Américains et Asiatiques, INALCO +3 partnersUniversité de Tampere,Université Laval,Institut Français de Recherche sur l’Asie de l’Est,Centre d'Etudes en Sciences Sociales sur les Mondes Africains, Américains et Asiatiques,INALCO,Laboratoire de Langues & Civilisations à Tradition Orale,University of Paris,CNRSFunder: French National Research Agency (ANR) Project Code: ANR-23-CE41-0017Funder Contribution: 494,623 EURDiasCo-Tib proposes to analyse various patterns of linguistic, spatial and social convergence at a “diasporic moment,” i.e. a critical juncture of reactivation and reconfiguration of a diaspora, as it is unfolding. The research will be based on the case of Tibetan refugees, who are currently undergoing such a “diasporic moment” with the anticipated demise of their spiritual leader, the Dalai Lama (b. 1935). Recent and fast-growing on-migratory trends, from South Asia towards Europe and North America, already lead to a large-scale spatial reconfiguration, with France becoming a major hub in the multipolar Tibetan diasporic network. The project’s central hypothesis is that in the context of a diasporic moment, increased spatial dispersion, paradoxically, triggers enhanced processes of social convergence. In order to produce a comprehensive analysis of diasporic convergence processes, DiasCo-Tib will mobilise an interdisciplinary team to study concomitant social phenomena and evaluate their degree of interrelatedness in the domains of language(s) and linguistic practices; social and economic translocal networks; forms of collective representation (in political, civic or artistic spheres); changing gender roles; and religious practices. Multi-sited research will account for the circulation of norms and social practices, taking into account local and cross-border forms of integration and differentiation as well as ongoing shifts in Tibetan refugees’ inscriptions in host societies. Along expected convergences, lines of segmentation will be observed and analysed as they crystallise to reconfigure the common yet plural linguistic and social practices of the Tibetan diaspora. The chosen case study will thus shed light on multi-dimensional processes of diasporisation as they are experienced and enacted by individuals and communities in their everyday lives and particular biographical trajectories.
more_vert assignment_turned_in Project2024 - 2025Partners:Laboratoire de Langues & Civilisations à Tradition OraleLaboratoire de Langues & Civilisations à Tradition OraleFunder: Swiss National Science Foundation Project Code: 217641Funder Contribution: 116,600more_vert assignment_turned_in ProjectFrom 2020Partners:Laboratoire de Langues & Civilisations à Tradition Orale, CNRS, INSHS, LIMSI, Laboratoire dInformatique pour la Mécanique et les Sciences de lIngénieur +5 partnersLaboratoire de Langues & Civilisations à Tradition Orale,CNRS,INSHS,LIMSI,Laboratoire dInformatique pour la Mécanique et les Sciences de lIngénieur,New Sorbonne University,Laurent Besacier,LPP,Universität Frankfurt / Institut für Empirische Sprachwissenschaft,Karlsruher Institut für Technologie (KIT) / Institut für Anthropomatik (IFA)Funder: French National Research Agency (ANR) Project Code: ANR-19-CE38-0015Funder Contribution: 464,668 EURThe main objective of the CLD2025 project is to facilitate the urgent task of documenting endangered languages by leveraging the potential of computational methods. A breakthrough is now possible: machine learning tools (such as artificial neural networks and Bayesian models) have improved to a point where they can effectively help to perform linguistic annotation tasks such as automatic transcription of audio recordings, automatic glossing of texts, and automatic word discovery. Thorough documentation of the world’s dwindling linguistic diversity is much more feasible with these tools than under a manual workflow. For instance, manual transcription of 50 hours of speech (a sizeable fieldwork corpus) can take hundreds of hours’ work, creating a bottleneck in the language documentation workflow. Another key task, referred to in linguistics as interlinear glossing (in a nutshell: word-by-word translation/annotation), is even more time-consuming, and is moreover difficult to perform manually with the required level of consistency. Models created through machine learning have the potential to aid in these time-consuming and difficult tasks. But Natural Language Processing (NLP) remains little-used in language documentation for a variety of reasons such as that the technology is still new (and evolving rapidly), user-friendly interfaces are still under development, and there are few case studies demonstrating practical usefulness in a low-resource setting. Field linguists typically rely on manual methods throughout the documentation process. The objective of the CLD2025 project is therefore to enable the implementation of these techniques in the mid term (by 2025) by developing a co-construction of models and tools by field linguists and computational linguists, and the development of interfaces and systems that allow real use by field linguists. We are building on the achievements of the BULB project in terms of corpora and modes of acquisition, as well as the development of models for transcription and segmentation. We are not developing corpora here, but rather focusing on how to exploit existing corpora. We address automatic processing problems (phoneme and tone transcription, unit discovery, automatic glossing), some of which are original (tonal transcription, automatic glossing), by validating them on endangered languages of very varied natures: Bantu Mboshi C25, Mande Kakabe, a Sino-Tibetan language, Yongning Na (Mosuo), and 3 Nakh-Daghestanian languages, Khinalug, Kryz (Kryts), Budugh. We will perform work to leverage the results of the improved automatic processing to the linguistic work level: the automatic speech and language processing mechanisms and results will be used to explore phonetic-phonological issues on segmental, supra-segmental and tonal levels of the languages addressed in the project, Finally, from the beginning of the project, the focus will be on the usability of the tools and models developed. This point highlights the fundamentally interdisciplinary aspect of the work carried out here by computational scientists and field linguists. To do so, a recognized field linguist will work full-time on the project, and will participate, both through her experience and expertise in the definition, development and evaluation of the different systems developed in the project.
more_vert assignment_turned_in ProjectFrom 2022Partners:Russian Academy of Sciences / Institute for Linguistic Research, INALCO, Laboratoire de Langues & Civilisations à Tradition Orale, Structure et Dynamique des Langues, CNRSRussian Academy of Sciences / Institute for Linguistic Research,INALCO,Laboratoire de Langues & Civilisations à Tradition Orale,Structure et Dynamique des Langues,CNRSFunder: French National Research Agency (ANR) Project Code: ANR-21-CE27-0020Funder Contribution: 216,608 EURThe Atlas of the Balkan Linguistic Area project will build an online database of language contact phenomena as attested in the Balkan languages and contribute to theoretical discussions in areal linguistics. ABLA will consist of 100+ phonological, morpho-syntactic, semantic, and lexical features, drawing on a linguistic questionnaire to be designed based on the Russian team’s expertise with atlases. Each feature will be matched to a map covering 70+ localities across all Balkan countries. Each map will be accompanied by a chapter co-authored by the project contributors. The online database will be developed by the French team and hosted by HumaNum via the Pangloss Collection with a mirror site in Russia. ABLA will also be published by an international publisher. ABLA will not only be the first online database for the Balkans, an area shaped by multilingualism in forms that are rapidly disappearing, but will further serve as an example for other linguistic areas in the world.
more_vert
chevron_left - 1
- 2
chevron_right
