CoNLL 2017 Irish data
There's junk, in case I forget
CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings
There was a web page with raw text; the Irish data has some stuff that looks weird. There are items that look like they were poorly split, but there are items from Logos Poetry like this:
41] Ná tréig neamh ar ní nach lat;
where the line numbering and brace were intentional. Not that there aren’t
odd splits because of poor sentence splitting. The sentence at line 4467
of ga-common_crawl-000.conllu.xz
is:
do giallaibh) .i. tech lán do ghiallaibh aigi.
which comes from here:
Nó Eochaid Domplén .i. domus (.i. tech) plena (.i. do giallaibh) .i. tech lán do ghiallaibh aigi. Is de rohainmniged Eochaid Domplén de.
(i.e., it’s not even modern Irish).