Datasets for Concept Drift

  1. SEA Concepts
    (SEA Concepts Dataset
    Dataset (proposed by Street and Kim, 2001) with 60,000 examples, 3 attributes and 3 classes. Attributes are numeric between 0 and 10, only two are relevant. There are four concepts, 15,000 examples each, with different thresholds for the concept function, which is if relevant_feature1 + relevant_feature2 > Threshold then class = 0. Threshold values are 8,9,7, and 9.5. Dataset has about 10 % of noise.
  2. Usenet
    (Usenet Dataset
    Text dataset, inspired by Katakis et al. (2010), is a simulation of news filtering with a concept drift related to the change of interest of a user over time. For this purpose we use the data from 20 Newsgroups (Rennie, 2008) and handle it as follows. There are six topics chosen and the simulated user in each concept is subscribed to mailing list of four of them being interested only in two. Over time the virtual user decides to unsubscribe from those groups that he was not interested in and subscribe for two new ones that he becomes interested in. The previously interesting topics become out of his main interest. The Table 1 summarizes the concepts. Note that the topics of interest are repeated to simulate recurring concepts. The original dataset is divided into train and test. Data from train appears in the first three concepts whereas data from test is in the last three (recurring) concepts. The data is preprocessed with tm (Feinerer, 2010) package for R keeping only attributes (words) longer than three letters and with minimal document frequency greater than three. Moreover, from the remaining only those that are informative are kept (entropy > 75 x 10-5 ). Attribute values are binary indicating the presence or absence of the respective word. At the end the set has 659 attributes and 5,931 examples. 
  3. Intrusion Detection
    (KDD Cup 10 Percent Dataset
    This data set was used in KDD Cup 1999 Competition (Frank and Asuncion, 2010). The full dataset has about five million connection records, this is a set with only 10 % of the size. The original task has 24 training attack types. The original labels of attack types are changed to label abnormal in our experiments and we keep the label normal for normal connection. This way we simplify the set to two class problem.
  4. Spam Detection
    (Spam Dataset
    Real world textual data set that uses SpamAssasin data collection (Katakis et al., 2010). This spam dataset consists of 9,324 examples with 40,000 attributes and represents the gradual concept drift. There are two classes, legitimate and spam, with the ratio around 20 %.
  • Street, W. N. and Y. Kim (2001). A streaming ensemble algorithm SEA for large- scale classification. pp. 377-382. ACM Press.
  • Katakis, I., G. Tsoumakas, and I. Vlahavas (2010). Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems 22, 371-391.
  • Tsoumakas, and I. Vlahavas (2008). An ensemble of classifiers for coping with recurring contexts in data streams. In Proceeding of the 2008 conference on ECAI 2008, Amsterdam, The Netherlands, The Netherlands, pp. 763-764. IOS Press.
  • Frank, A. and A. Asuncion (2010). UCI machine learning repository.