Resources

Sylabification for Croatian based on maximal onset principle

The resources are published under the CC BY-NC-SA 4.0 license.

Croatian lemmatized dictionary syllabified on maximal onset principle (62 387 lexems).
Croatian dictionary with all flective forms syllabified on maximal onset principle (377 143 lexemes).
The list of Croatian words with “JAT” (adjusted for syllabification).
Algorithm source code in Python.

Please cite the following paper:
A. Meštrović, S. Martinčić-Ipšić, M. Matešić. “Syllabification based on maximal onset principle for Croatian / Postupak automatskoga slogovanja temeljem načela najvećega pristupa i statistika slogova za hrvatski jezik”. Govor/Speech, vol. 32, No. 1, pp. 3-35, 2015.

Datasets related to COVID-19 communication in Social Media

The resources are published under the CC BY-NC-SA 4.0 license.

Senti-Cro-CoV-Tweets: dataset of 10.000 annotated COVID-19 related tweets in the Croatian language (containing tweet ID annotatet with one of the labels: positive, negative, neutral).

Senti-Cro-CoV-Reddit: dataset of 6.000 annotated messages in the Croatian language (labels: positive, negative, neutral).

Cro-CoV-Tweets: dataset of COVID-19 related tweets in the Croatian language posted n 1 January 2020 and 31 May 2021.

Cro-CoV-Texts: dataset with samples of the pre-processed COVID-19 related news articles in the Croatian language published in online portals.

Cro-CoV-Texts-Comments: dataset with samples of the pre-processed COVID-19 related users’ comments in the Croatian language published in online portals.

Cro-CoV-Texts-Unigrams: dataset with collection of unigrams extracted from the COVID-19 related news articles.

Cro-CoV-Texts-Links: dataset of links to COVID-19 related news articles.

Cro-CoV-Nets: datasets of networks and multilayer networks collected from Twitter.

Please cite the following papers:
Slobodan Beliga, Sanda Martinčić-Ipšić, Mihaela Matešić, Irena Petrijevčanin Vuksanović, Ana Meštrović: “Infoveillance of the Croatian Online Media During the COVID-19 Pandemic: One-Year Longitudinal Study Using Natural Language Processing”, JMIR Public Health and Surveillance, vol. 7, no. 12, pages e31540, Dec. 2021. http://dx.doi.org/10.2196/31540

Karlo Babić, Milan Petrović, Slobodan Beliga, Sanda Martinčić-Ipšić, Mihaela Matešić, Ana Meštrović: “Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model ”, Applied Sciences, vol. 11, no. 21, p. 10442, Nov. 2021. http://dx.doi.org/10.3390/app112110442

Language and classification models

The resources are published under the CC BY-NC-SA 4.0 license.

Cro-CoV-cseBERT: general language model (based on cseBERT) trained on large corpora composed of the texts written in Croatian language related to COVID-19.

Cro-CoV-BERTić: general language model (based on BERTIć) trained on large corpora composed of the texts written in Croatian language related to COVID-19.

Senti-CoV-cseBERT: model trained for sentiment classification in the domain of COVID-19.

Multi-Cro-CoV-cseBERT: model trained for prediction of spreading in the domain of COVID-19.

Please cite the following paper:
Karlo Babić, Milan Petrović, Slobodan Beliga, Sanda Martinčić-Ipšić, Mihaela Matešić, Ana Meštrović: “Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model ”, Applied Sciences, vol. 11, no. 21, p. 10442, Nov. 2021. http://dx.doi.org/10.3390/app112110442