Call for papers

Corpus building in long diachrony, between philological tradition and quantitative analysis

ConCorDiaL 2024, Lyon


November 7-8, 2024

Since its origins, diachronic linguistics has maintained close links with corpus linguistics, since diachronicists cannot, by definition, call on their competence as speakers and must rely on attested and authentic data to work. (Marchello-Nizia 2004, Prévost 2020). There has been significant development of digital diachronic corpora and historical texts in corpus linguistics and, as far as France is concerned, Frantext database and the BFM Old French Corpus have played a leading role in this movement, which began in the 1980s. These corpora have generally been built on printed editions, and have developed separately from native digital editions, which are more focused on the digital transposition of critical apparatus and the representation of primary sources, often handwritten, from a philological perspective. In France, this dissociation, which partly reflects the disciplinary boundaries between linguists and literary scholars, has resulted in the creation of two distinct consortia within the Huma-Num national research infrastructure: one for linguistic corpora (now CORLI, CORpus, Langues et Interactions) and the other for text editions and literary and stylistic analysis (now ARIANE, Analyses, Recherches, Intelligence Artificielle et Nouvelles Éditions numériques). In the light of the current dynamic of quantitative approaches to literature (Bernard and Bohet 2017, Diwersy et al. 2021, Barré, Camps and Poibeau 2023) and the creation of new digitally native linguistic data the relevance of this partition can be brought into question.

The exponential growth of digital corpora has also led to an overabundance, even a "deluge" of data (Habert 2005: 41). Although this trend is less marked in the case of historical language corpora - since access to primary data is not as immediate as for contemporary language data - these corpora continue to grow in size and diversity. At the same time, tools for processing, annotating and querying texts have considerably enriched textual corpora and their digital exploitation. Ever more data-intensive (cf. recent advances in AI and conversational agents), the tools of Natural Language Processing amplify the demand for growth, and at the same time foster the development of statistical methods in corpus linguistics and textual data analysis (Lebart, Pincemin and Poudat 2019).

In this context, and following on from the first ConCorDial conference (Grenoble 2022, https://concordial2022.sciencesconf.org), this second edition aims to deepen our reflection on digital corpora in long diachrony, by linking corpus creation and analysis, and continuing exchanges between creators and users of language data.

Topic 1: Processing diachronic digital corpora

The accumulation of digital data makes it necessary to face up to the challenge of their internal heterogeneity. This heterogeneity derives from the diversity of sources, which may have different origins before being brought together in a particular corpus. It may concern the quality of the digitized texts, their digital format (XML or other), the metadata used to describe them and, of course, their linguistic annotations. In addition to these general factors, for the earliest periods we can add graphic and morphological variations that complicate form recognition and the work of NLP tools. Contributions could address different ways of dealing with this heterogeneity, depending on both the intended use of the corpus and the constraints (technical, financial, etc.) imposed.

These issues can also be addressed from the point of view of compatibility and interoperability between different corpora. Common standards (for markup tags, metadata, word segmentation, lemmas, morphosyntactic tag sets, syntactic or semantic annotations, etc.) are one way of meeting this objective, which is becoming increasingly necessary as corpora multiply. In this context, we also need to take into account the issues of data durability and backup. For example, how can the need for standardization be reconciled with respect for the diversity and richness of the original data: can a multilingual tag set even be used without compromising the granularity of the tags necessary for a particular language?)

The long-term historical dimension may be the subject of specific reflection, diachronic variations being all the more important as the corpora cover vast timespans and are observable at all levels of processing. How can the evolution  of textual genres, such as appearances/disappearances, changes within a given genre, genres being historically situated and evolving over time, cf. Winter-Froemel 2023) properly be taken into account? Should the same lemmas be used whatever the period, or should dictionaries specific to each language state be preferred? How should we deal with changes in the segmentation of lexical units and the emergence of grammaticalized phrases?

The questions raised here are not exhaustive, and all proposals for papers dealing with the constitution and processing of diachronic corpora will be considered.

Topic 2: Quantitative and qualitative methods for exploiting diachronic corpora

As quantitative methods are increasingly used in all areas of linguistic analysis (lexicology, phonology, morphology, syntax, etc.), and are spreading into the field of stylistic studies (stylemes, phrasemes) and literary studies (topics, narrative patterns, etc.), their impact on diachronic digital corpora can be examined. How are these practices into account in the selection, curation, description and organisation of the data? What methods and tools should be used to identify and quantitatively interpret the data?
In this context, we seek contributions analysing the added value as well as the limitations of linguistic annotation, and ask what types of enrichment should be favoured to facilitate diachronic research, what level of granularity should be adopted, what balance should be aimed for between the quantity and quality of annotations, etc?

Particular attention could be paid to quantitative methodologies specifically adapted to diachronic analysis. In particular, it will be possible to address the different types of variation, the specificities of the diachronic factor or the ways of targeting this particular factor or, on the contrary, describing the way it interacts with others (Hilpert and Gries 2016). Similarly, contributions considering the new possibilities offered by automatic periodization tools (Gries and Hilpert 2008, Diwersy et al. 2017) or methods for measuring and interpreting trends (Hilpert and Gries 2009) etc. are particularly welcome.

The link between quantitative methods and qualitative analysis will also be taken into account, as will the philological dimension of data constructed for linguistic or literary research.

Keynote Speakers

  • Sascha Diwersy (Montpellier University, UMR Praxiling)
  • Thierry Poibeau (CNRS, Lattice Research Lab)
  • Céline Poudat (Université Côte d'Azur, BLC Research Lab)

Format

Presentations will last 30 minutes, followed by a 10-minute discussion. The conference will be held in hybrid mode (in-person attendance preferred for speakers). The languages accepted for communication are French and English.

Abstracts should be between 300 and 500 words in length (not including bibliographical references) and should be written in the language of the paper. Abstracts must be submitted in two versions on the conference website (https://concordial.sciencesconf.org): an anonymized version to copy-paste in the submission form and a version specifying the author's name and affiliation in a Word or PDF document. Please use the provided document template.

Registration Fee

Registration fees will be confirmed when registration opens (between €40 and €60).

Exemption:

  • online participants
  • members of the organizing laboratories
  • doctoral students

Registration is free for all but obligatory to attend the conference.

Calendar

  • Abstract submission deadline: June 10, 2024
  • Confirmation of acceptance: July 10 2024
  • Submission of final abstracts: October 1st 2024
  • Conference registration: from September 1st to October 1st
  • Conference: November 7-8, 2024

References

Barré Jean, Camps Jean-Baptiste et Poibeau Thierry (2023) « Operationalizing Canonicity: A Quantitative Study of French 19th and 20th Century Literature », Journal of Cultural Analytics, vol. 8, n° 3. ‹DOI : 10.22148/001c.88113›.

Bernard Michel et Bohet Baptiste (2017) Littérométrie. Outils numériques pour l’analyse des textes littéraires, Paris, Presses Sorbonne nouvelle.

Diwersy Sascha et al. (2021) « La phraséologie du roman contemporain dans les corpus et les applications de la PhraseoBase », Corpus, n° 22. ‹DOI : 10.4000/corpus.6101›.

Diwersy Sascha, Falaise Achille, Lay Marie-Hélène et Souvay Gilles (2017) « Ressources et méthodes pour l’analyse diachronique », Langages, vol. 206, n° 2, p. 21‑44. ‹DOI : 10.3917/lang.206.0021›.

Gries Stefan et Hilpert Martin (2008) « The identification of stages in diachronic data: variability-based neighbour clustering », Corpora, vol. 3, p. 59‑81. ‹DOI : 10.3366/E1749503208000075›.

Habert Benoît (2005) « Face à la disette dans la profusion », Scolia : Sciences Cognitives, Linguistiques et Intelligence Artificielle, vol. 19, n° 1, p. 41‑61. ‹DOI : 10.3406/scoli.2005.1065›.

Hilpert Martin et Gries Stefan (2016) « Quantitative approaches to diachronic corpus linguistics », In M. Kytö et P. Pahta (éd.), The Cambridge Handbook of English Historical Linguistics, Cambridge University Press, p. 36‑53. ‹DOI : 10.1017/CBO9781139600231›.

Hilpert Martin et Gries Stefan (2009) « Assessing frequency changes in multistage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition », Literary and Linguistic Computing, vol. 24, n° 4, p. 385‑401. ‹DOI : 10.1093/llc/fqn012›.

Lebart Ludovic, Pincemin Bénédicte et Poudat Céline (2019) Analyse des données textuelles, Québec, Presses de l’Université du Québec.

Marchello-Nizia Christiane (2004) « Linguistique historique, linguistique outillée : les fruits d’une tradition », Le français moderne, n° 1, p. 58‑70.

Prévost Sophie (2020) « Une grammaire fondée sur un corpus numérique », In C. Marchello-Nizia, B. Combettes, S. Prévost et T. Scheer (éd.), Grande grammaire historique du français, Berlin, Mouton de Gruyter, p. 37‑53.

Winter-Froemel Esme (2023) « Discourse traditions research: foundations, theoretical issues and implications », In E. Winter-Froemel et Á.S. Octavio de Toledo y Huerta (éd.), Manual of Discourse Traditions in Romance, De Gruyter, p. 25‑58. ‹DOI : 10.1515/9783110668636-002›.

Online user: 3 Privacy
Loading...