CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings

There was a web page with raw text; the Irish data has some stuff that looks weird. There are items that look like they were poorly split, but there are items from Logos Poetry like this:

41] Ná tréig neamh ar ní nach lat;

where the line numbering and brace were intentional. Not that there aren’t odd splits because of poor sentence splitting. The sentence at line 4467 of ga-common_crawl-000.conllu.xz is:

do giallaibh) .i. tech lán do ghiallaibh aigi.

which comes from here:

Eochaid Domplén .i. domus (.i. tech) plena (.i. do giallaibh) .i. tech lán do ghiallaibh aigi. Is de rohainmniged Eochaid Domplén de.

(i.e., it’s not even modern Irish).