Slovenian Press Agency (STA, https://www.sta.si) is the leading Slovenian provider of media content for domestic and foreign audiences. One of the standard products of press agencies and news outlets in general is also packaging articles by topic (this topic clusters of articles are usually created for most important long-term themes). This enables our customers and end users to quickly find all articles about a certain topic. A solution for automated or semi-automated classifying articles in topic clusters would save time, improve consistency and enable to have more on-going topic clusters of articles.
The objective of the contest is to correctly classify the articles and assign them to topics they should belong to.
The files provided for you to download are the articles published in Slovenian by STA agency in 2015 (only in Slovenian) plus supplementary data. See Data Files section of the challenge.
You are being provided with a dataset containing the articles published by STA in 2015 (only the ones published in Slovenian language). The dataset consists of two parts: learning dataset which has the articles assigned to given topics and the testing dataset which does not have the topic group assigned.
Your task is to correctly classify the articles from the testing dataset into topic groups. To do so, you are required to train the classifier using the learning data and classify the articles from the testing dataset.
Please note that each article belongs only to one and only one topic and the articles in the testing dataset do not use any classes that weren’t assigned previously in at least one of the articles in the training dataset.
Your submission will be evaluated by using F1 score (https://en.wikipedia.org/wiki/F1_score). You are required to provide the whole testing data for evaluation (i.e. all articles have to be assigned to categories).
The submission has to follow the format presented in the file sta-special-articles-2015-submission.csv – one header row and the following rows containing ids of articles and topic id, separated by a semicolon.
The accepted line breaks are “\n” (Unix style) and “\r\n” (Windows style).
Please note that the winners (i.e. participants that qualify to receive the prize) are obligated to provide us with the working proof of concept to obtain a given solution (submission). Random submission generators or similar approaches taking into account the output of the evaluation will not be rewarded.
Your submissions will be evaluated every 10 minutes, i.e. the program that evaluates the submissions runs every ten minutes and your most recent submission will be evaluated becoming your official result.The contest ends on December 8th, 23:59:59 UTC time. On December 9th you will be contacted by us to provide working source code to generate your solution (you will have to provide it to us on that day!) and official results will be announced on December 10th. Only active accounts will be taken into account (at the date of the end of the contest and at the date of official results).
The prizes are based on the position in the leaderboard and are the following:
- First place – $ 150
- Second place – $ 100
- Third place – $ 50
Apart from that, all three leaders will receive a letter of recommendation from Slovenian Press Agency proving their skills in solving the problem.
If a situation occurs in which there will be participants with the same leaderboard position, the prize for a given place and the next place(s) will be merged and split across all the participants holding given position and the positions of subsequent participants will be accordingly lowered.
After the challenge ends, the participants holding the positions 1-3 will be asked to provide their code for evaluation. If the code does not fit the criteria of this contest or terms and conditions of the SciCup platform, their submissions will be discarded and participants holding next positions will be considered.