Harmonizing the "ConDÉ" Corpus. From the image to the linguistic resource
PDF (Français (France))
HTML (Français (France))

Keywords

digital humanities
corpus linguistics
encoding
diachrony
TEI
Python
Transkribus
AnaLog

Abstract

The corpus compiled for the RIN ConDÉ project consists of twelve reference sources on Norman customary law, from the 13th to the 19th century. Despite dealing with the same subject, the texts in this corpus are very heterogeneous in terms of format and structure. The texts were processed with the HTR tool Transkribus; Python and XSLT languages were employed for automated transformations; lemmatization was performed by AnaLog and the data was encoded using the TEI encoding model. Processing the data required a stage of reflection to identify the best means of restoring the structures and reference systems and to devise a set of lemma and part-of-speech tags that would work for texts covering six centuries of linguistic evolution. To make the texts maximally comparable, it was eventually decided to create a three-level structure (part > chapter > section).

PDF (Français (France))
HTML (Français (France))