Call for papersCorpus building in long diachrony, between philological tradition and quantitative analysisConCorDiaL 2024, Lyon
The exponential growth of digital corpora has also led to an overabundance, even a "deluge" of data (Habert 2005: 41). Although this trend is less marked in the case of historical language corpora - since access to primary data is not as immediate as for contemporary language data - these corpora continue to grow in size and diversity. At the same time, tools for processing, annotating and querying texts have considerably enriched textual corpora and their digital exploitation. Ever more data-intensive (cf. recent advances in AI and conversational agents), the tools of Natural Language Processing amplify the demand for growth, and at the same time foster the development of statistical methods in corpus linguistics and textual data analysis (Lebart, Pincemin and Poudat 2019). In this context, and following on from the first ConCorDial conference (Grenoble 2022, https://concordial2022.sciencesconf.org), this second edition aims to deepen our reflection on digital corpora in long diachrony, by linking corpus creation and analysis, and continuing exchanges between creators and users of language data. Topic 1: Processing diachronic digital corporaThe accumulation of digital data makes it necessary to face up to the challenge of their internal heterogeneity. This heterogeneity derives from the diversity of sources, which may have different origins before being brought together in a particular corpus. It may concern the quality of the digitized texts, their digital format (XML or other), the metadata used to describe them and, of course, their linguistic annotations. In addition to these general factors, for the earliest periods we can add graphic and morphological variations that complicate form recognition and the work of NLP tools. Contributions could address different ways of dealing with this heterogeneity, depending on both the intended use of the corpus and the constraints (technical, financial, etc.) imposed. These issues can also be addressed from the point of view of compatibility and interoperability between different corpora. Common standards (for markup tags, metadata, word segmentation, lemmas, morphosyntactic tag sets, syntactic or semantic annotations, etc.) are one way of meeting this objective, which is becoming increasingly necessary as corpora multiply. In this context, we also need to take into account the issues of data durability and backup. For example, how can the need for standardization be reconciled with respect for the diversity and richness of the original data: can a multilingual tag set even be used without compromising the granularity of the tags necessary for a particular language?) The long-term historical dimension may be the subject of specific reflection, diachronic variations being all the more important as the corpora cover vast timespans and are observable at all levels of processing. How can the evolution of textual genres, such as appearances/disappearances, changes within a given genre, genres being historically situated and evolving over time, cf. Winter-Froemel 2023) properly be taken into account? Should the same lemmas be used whatever the period, or should dictionaries specific to each language state be preferred? How should we deal with changes in the segmentation of lexical units and the emergence of grammaticalized phrases? The questions raised here are not exhaustive, and all proposals for papers dealing with the constitution and processing of diachronic corpora will be considered. Topic 2: Quantitative and qualitative methods for exploiting diachronic corporaAs quantitative methods are increasingly used in all areas of linguistic analysis (lexicology, phonology, morphology, syntax, etc.), and are spreading into the field of stylistic studies (stylemes, phrasemes) and literary studies (topics, narrative patterns, etc.), their impact on diachronic digital corpora can be examined. How are these practices into account in the selection, curation, description and organisation of the data? What methods and tools should be used to identify and quantitatively interpret the data? Particular attention could be paid to quantitative methodologies specifically adapted to diachronic analysis. In particular, it will be possible to address the different types of variation, the specificities of the diachronic factor or the ways of targeting this particular factor or, on the contrary, describing the way it interacts with others (Hilpert and Gries 2016). Similarly, contributions considering the new possibilities offered by automatic periodization tools (Gries and Hilpert 2008, Diwersy et al. 2017) or methods for measuring and interpreting trends (Hilpert and Gries 2009) etc. are particularly welcome. The link between quantitative methods and qualitative analysis will also be taken into account, as will the philological dimension of data constructed for linguistic or literary research. Keynote Speakers
Format
Presentations will last 30 minutes, followed by a 10-minute discussion. The conference will be held in hybrid mode (in-person attendance preferred for speakers). The languages accepted for communication are French and English. Registration Fee
Registration fees will be confirmed when registration opens (between €40 and €60). Exemption:
Registration is free for all but obligatory to attend the conference. Calendar
References
Barré Jean, Camps Jean-Baptiste et Poibeau Thierry (2023) « Operationalizing Canonicity: A Quantitative Study of French 19th and 20th Century Literature », Journal of Cultural Analytics, vol. 8, n° 3. ‹DOI : 10.22148/001c.88113›. Bernard Michel et Bohet Baptiste (2017) Littérométrie. Outils numériques pour l’analyse des textes littéraires, Paris, Presses Sorbonne nouvelle. Diwersy Sascha et al. (2021) « La phraséologie du roman contemporain dans les corpus et les applications de la PhraseoBase », Corpus, n° 22. ‹DOI : 10.4000/corpus.6101›. Diwersy Sascha, Falaise Achille, Lay Marie-Hélène et Souvay Gilles (2017) « Ressources et méthodes pour l’analyse diachronique », Langages, vol. 206, n° 2, p. 21‑44. ‹DOI : 10.3917/lang.206.0021›. Gries Stefan et Hilpert Martin (2008) « The identification of stages in diachronic data: variability-based neighbour clustering », Corpora, vol. 3, p. 59‑81. ‹DOI : 10.3366/E1749503208000075›. Habert Benoît (2005) « Face à la disette dans la profusion », Scolia : Sciences Cognitives, Linguistiques et Intelligence Artificielle, vol. 19, n° 1, p. 41‑61. ‹DOI : 10.3406/scoli.2005.1065›. Hilpert Martin et Gries Stefan (2016) « Quantitative approaches to diachronic corpus linguistics », In M. Kytö et P. Pahta (éd.), The Cambridge Handbook of English Historical Linguistics, Cambridge University Press, p. 36‑53. ‹DOI : 10.1017/CBO9781139600231›. Hilpert Martin et Gries Stefan (2009) « Assessing frequency changes in multistage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition », Literary and Linguistic Computing, vol. 24, n° 4, p. 385‑401. ‹DOI : 10.1093/llc/fqn012›. Lebart Ludovic, Pincemin Bénédicte et Poudat Céline (2019) Analyse des données textuelles, Québec, Presses de l’Université du Québec. Marchello-Nizia Christiane (2004) « Linguistique historique, linguistique outillée : les fruits d’une tradition », Le français moderne, n° 1, p. 58‑70. Prévost Sophie (2020) « Une grammaire fondée sur un corpus numérique », In C. Marchello-Nizia, B. Combettes, S. Prévost et T. Scheer (éd.), Grande grammaire historique du français, Berlin, Mouton de Gruyter, p. 37‑53. Winter-Froemel Esme (2023) « Discourse traditions research: foundations, theoretical issues and implications », In E. Winter-Froemel et Á.S. Octavio de Toledo y Huerta (éd.), Manual of Discourse Traditions in Romance, De Gruyter, p. 25‑58. ‹DOI : 10.1515/9783110668636-002›. |
Online user: 3 | Privacy |