ArtLabs team got 4th place in Telegram Data Clustering Contest

Introduction

The Telegram Data Clustering contest was held in 2021 by the Telegram team. The task of the contest was to create a C/C++ library that can determine the language and topic of a Telegram channel.

ArtLabs team consisted of 2 people in this contest:

We were able to achieve 4th place in the contest with our solution.

Problems of the contest included:

  • No labeled data provided
  • Submissions should be strictly in C/C++
  • Speed is crucial, no transformers / heavy Deep Learning models possible
  • Very strict deadline, only 2 weeks for the solution

The task was divided into 2 parts, determine channel language and for channels in English and Russian determine the channel topic. The data set was a text file, where each line contained information about a channel in JSON format:

{
  title:        "Channel title",
  description:  "Channel description",
  recent_posts: [
    "text #1 of message or caption of media or content of poll etc.",
    "text #2 of message or caption of media or content of poll etc.",
    ...
  ]
}

Our library had to work in 2 modes: category and language.

For mode=language the library predicted the language of the given channel.

{
  "lang_code": "en"
}

For mode=category the library predicted language + categories of the given channel.

{
  "lang_code": "en",
  "category": {
    "Art & Design": 0.9,
    "Other": 0.1
  }
}

Channel Language Detection

To detect the language in the real world you shouldn’t really reinvent the wheel. There are a lot of existing solutions that do language detection tasks decently. To solve the problem we used the pre-trained language detection model from the fastText package. fastText supports C++, so it was really easy for us to embed it directly into our solution. Our language detection solution received 92/100 points from the judges.

Channel Topic Detection

Detecting channel topics was more problematic since we had to determine the topic among 42 classes. These included classes like Culture & Events, Celebrities & Lifestyle, Religion & Spirituality, and more. To label the data we had a few options available. We could label the data telegram provided or scrape the data from some other resource. As far as the timeline was very strict we decided to follow the 2nd path. Fortunately, we found a tgstat website. This website contains statistics for a lot of telegram channels, along with 10-20 latest posts from the channel, but more importantly, the channels are already categorized on classes. Those classes are very much similar to what Telegram asked us to predict, so we were able to get around 20000 channels of different categories.

Once we downloaded and formatted the data, we had to decide which model to use. In C++ we were pretty much limited by fastText, XGBoost, and Lightgbm models. After a few experiments, we found that the fastText solution worked fastest and best on the validation set.

To determine the channel topic we had to follow a very old way of doing text classification, since we didn’t have any deep learning methods at hand we created a vectorized text representation with the Bag of Words methods and included bigrams and trigrams to the solution. After that, we trained the fastText model to classify the text on 42 classes. Our solution received 30 out of 100 points and 35 out of 100 points for Russian and English datasets respectively.

Conclusion

All in all, this contest was a real challenge for us. Adding a Machine Learning solution to a C++ library in just 2 weeks was very problematic. Moreover, we had to be very creative to come up with a way to get the training data for our models. But still, we enjoyed the challenge very much and are planning to participate in other Machine Learning & Software development competitions soon!