Slovenian Press Agency (STA, https://www.sta.si) is the leading Slovenian provider of media content for domestic and foreign audiences and this is their second contest organized using sci·cup platform. The goal of this challenge is to find topical clusters based on a set of articles gathered via RSS-feeds from Slovenian news providers. These topical clusters will contain the articles that refer to the same topic and should not contain the articles that are far from the scope of the cluster.

The file provided for you to download is the list of RSS sources to obtain articles from. See Data Files section of the challenge.

You are being provided with the list of Slovenian news providers (as a file containing the links to RSS feeds). Your goal is to develop a script that analyzes the articles from those sources published a single day (from midnight to the time of execution of a script) and based on these articles you should generate a list of topics that are being covered by those providers as clusters. Each cluster should be described by at most ten keywords (or a sentence up to ten words) and should contain the list of articles (title, link, description fields from RSS feed) that belong to that cluster. Each article can belong only to one clusters and the number of clusters is not limited and their size (number of containing articles) can vary.

Evaluation criteria

This challenge makes it impossible to introduce automated evaluation, since there is no ground truth. Instead of that the submissions will be evaluated by STA expert.

To make the evaluation feasible, the challenge will be organized as a two-phase one. First phase will end on February 14th, 2018 at 16:00:00 UTC time. Last submissions submitted before (one last per account) that date will be run the same day and the contestants will receive feedback on their solutions in two days, February 16th, 2018 (as a quality ranked from 0 to 1 supplemented by comments by STA expert). Only 10 best individuals will qualify based on the quality score from the first phase will qualify for the second phase.

The second phase will start February 16th, 2018 16:00:00 UTC time and will last to February 28th, 2018, 16:00:00 UTC time. The results will be evaluated similarly as during the first phase (last submission per contestant) and final results will be announced on March 6th, 2018.

The prizes are based on the position in the leaderboard after the second phase and are the following:

First place – 150 EUR

Second place – 100 EUR

Third place – EUR 50

Apart from that, all three leaders will receive a letter of recommendation from Slovenian Press Agency proving their skills in solving the problem.

If a situation occurs in which there will be participants with the same leaderboard position, the prize for a given place and the next place(s) will be merged and split across all the participants holding given position and the positions of subsequent participants will be accordingly lowered.

After the challenge ends, the participants holding the positions 1-3 will be asked to provide their final code for evaluation and this code will be shared on GitHub sci·cup profile. If the code would not fit the criteria of this contest or terms and conditions of the sci·cup platform, their submissions will be discarded and participants holding next positions will be considered.