Datasets for classification tasks related to social media text

(an incomplete list)

If you use any of the provided datasets, please make sure to cite the corresponding papers! Links are given without warranty of any kind.

You can notify me of expired links or other datasets I should add here.

Sentiment

Germeval Task 2017 - Shared Task on Aspect-based Sentiment in Social Media Customer Feedback (also includes on task on "relevance" (to a topic) that may be interesting)

Sarcasm

http://nlp.cs.princeton.edu/SARC/ The paper: https://arxiv.org/pdf/1704.05579.pdf

Geolocation

W-NUT geolocation Shared Task

User classification

TWISTY: a Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling http://www.clips.ua.ac.be/datasets/twisty-corpus Paper dazu: http://www.clips.ua.ac.be/~walter/papers/2016/vdp16.pdf
Author Profiling Shared Task http://pan.webis.de/clef17/pan17-web/author-profiling.html unten gibt es andere Datasets (u.a. von vorigen Jahren (2013-2016)).

Hate Speech

Twitter hate speech (English; by Zeerak Waseem)
German hate speech & cyberbullying datasets (by Uwe Brettschneider et al.)

Others

Clickbait Challenge (identify clickbait)
Dialog act annotation for German Twitter conversations (contact me)

Last modified: Mon Jul 3 15:49:32 CEST 2017