• BlogCatalog: a rich ingredient social media dataset.
    • Description: an ideal data set for learning tasks with rich social networking information. Especially suitable for prediction and community detection tasks with ground truth in place to verify your hypotheses. It has link information (i.e., friends), content information (e.g., tags, posts), and label information (i.e., user interests).

    • The dataset contains around 90,000 users with social network (undirected), users' blogs, tags (for each blog), categories (could be deemed as user interests or class labels), and 6 snippets for each blog. The social network is the largest component on BlogCatalog (http://www.blogcatalog.com) as of September 2009.

    • Features

      • Connections or Links: undirected social network
      • User Generated Content:
        • Tag: user generated keywords that are used to describe each blog they own
        • Post: 6 most recent posts in each blog
      • Class Label: selected from a predefined list of categories, indicating the interests of each user
    • Citations and Acknowledgements

    • The data set is shared in two different versions [the preprocessed version in Matlab format] and [the original version with no editions]. It would be highly appreciated if you can cite or acknowledge our ICDM 2010 work "Discovering Overlapping Groups in Social Media", [Bibtex]

      @CONFERENCE{Wang-etal10,
      author = {Xufei Wang and Lei Tang and Huiji Gao and Huan Liu},
      title = {Discovering Overlapping Groups in Social Media},
      booktitle = {the 10th IEEE International Conference on Data Mining series (ICDM2010)},
      year = {2010},
      address = {Sydney, Australia},
      month = {December 14 - 17}
      }

    • The dataset should not be distributed without the permission from the authors (and also the BlogCatalog website if their privacy statements are changed accordingly). Should you have any question, please do not hesitate to contact me at: xufei.wang@asu.edu, we are willing to help.



  • Flickr: a photo sharing dataset
    • Descriptions: it includes more than 35,000 users, with their joined groups, tags. It also includes the friendship and the commentship (i.e., who comments on whose photos) among the set of users. The joined groups can be treated as class labels in classification tasks, or ground truth for community detection tasks.

    • Features

      • Connections or Links: undirected social network and commentship
      • User Generated Content: Tags aggregated on their photos
      • Class Label: groups that users joined
    • Citations and Acknowledgements

    • The data set is shared in two different versions [the preprocessed version in Matlab format] and [the original version with no editions]. It would be highly appreciated if you can cite or acknowledge our recent journal paper "Learning with multi-resolution overlapping communities" [Bibtex]

      @ARTICLE{Wang-etal12,
      author = {Xufei Wang and Lei Tang and Huan Liu and Lei Wang},
      title = {Learning with Multi-Resolution Overlapping Communities},
      journal = {Knowledge and Information Systems (KAIS)},
      year = {2012},
      DOI = {10.1007/s10115-012-0555-0}
      }

    • The dataset should not be distributed without the permission from the authors (and also the Flickr website if their privacy statements are changed accordingly). Should you have any question, please do not hesitate to contact me at: xufei.wang@asu.edu, we are willing to help.



  • Delicious: a social bookmarking dataset
    • Descriptions

    • Coming soon after I clean the data
    • Readme


  • StumbleUpon: a social bookmarking dataset
    • Descriptions

    • Coming soon after I clean the data
    • Readme


  • Twitter: an information diffusion network with rich content information
    • Description: good for studying (but not limited to) information diffusion such as the diffusion of #hashtag and knowledge through the Twitter follower network, etc.

    • Features

      • User Id: Integer
      • Follower Id: Integer
      • Friend Id: Integer
      • Hashtag: string
      • Tweet Id: integer
      • Tweet Timestamp: Date
      • Tweet: string
      • URL: string
      • Retweet
      • Reply
      • Mention
      • User Profile: Jason
      • Geo-Location: double
    • Readme

    • We (Shamanth Kumar and I) crawled the Twitter data set between 02/01/11 and 08/31/11 with rich information, including user profile, user tweets, retweeting, hashtag, reply, mention, geo-location, timestamps associated with each tweet, etc. The dataset contains more than 660,000 users, and 12 million tweets among the set of users. Currently the social network and Tweet ids are available for download and sharing. But Tweets are not allowed to be shared with any third party, complying to the Twitter Terms of Use. For more details, you can reach me at xufei.wang@asu.edu or visit our TwitterTracker project website, which is supported by the Office of Naval Research (ONR). We are willing to provide any assistance that we can to make your work more productive and successful.

      To obtain the full data set. You can first download the network file (with a readme file) [from this link] and the Tweet ids [from this link]. Then you can choose one of the [open source libraries] to download all tweets with Tweet ids provided. More information about this data set can be found in my recent Technical Report "Identifying Information Spreaders in Twitter Follower Networks". Citations and acknowledgements would be highly appreciated if you find the dataset is useful. [Bibtex]

      @TECHREPORT{Wang-etal12,
      author = {Xufei Wang and Huan Liu and Peng Zhang and Baoxin Li},
      title = {Identifying Information Spreaders in Twitter Follower Networks},
      institution = {School of Computing, Informatics, and Decision Systems Engineering, Arizona State University},
      year = {2012},
      number = {TR-12-001},
      address = {Tempe, AZ 85287, USA}
      }


  • If you want more data, send me an email or visit this link: Social Media Data Repository
Back to My Home