During the last few years, collaborative tagging systems like Flickr, Del.icio.us or Bibsonomy got more and more popular because they allow users to easily upload resources like photos, bookmarked URLs and BibTeX entries and to share them with other users. Additionally, the users can organize their resources by assigning tags or keywords to them. Over time, one can observe the emergence of a loose categorization system which can be used for retrieving specific resources and navigating through the large set of resources, which is frequently called a folksonomy.
Thus, folksonomies constitute intriguing dynamic systems constructed by the collaboration and interaction of its users. They offer new possibilities for finding resources. But at the same time they constitute a challenge for existing models of categorization and retrieval of resources because the usage of tags at the micro-level of the individual user and at the macro-level of groups of users and of the complete user community has neither been understood nor has been put in a relationship with each other.
Recent research has brought forward an interesting temporal perspective on the understanding of folksonomies by viewing them as dynamic stochastic systems with memory. But this perspective abstracts away the background knowledge common to folksonomy users putting too much emphasis on imitation of other users and random generation of vocabulary. We advocate the hypothesis that both components, i. e. the background knowledge and the imitation, are needed for explaining and understanding the tagging behavior of users. We describe our proposal in the technical report below. It better approximates behavior found in actual tagging systems and it thus gives us more meaningful insights into the tagging process. For example, it helps us to distinguish between effects in the tagging system caused by the natural language behavior of users and effects that are specific to the user interface of tagging systems.
Data Set and Software Simulator
In the following, we provide for each of the co-occurrence streams from the technical report three files:
- Stream: In the files with the streams, each row corresponds to a single tag assignment. The first column contains an artificial tag ID and the second column an artificial resource ID. The order of the rows corresponds to the order in which the tag assignments were made by the users.
- URLs: Each row contains a single URL that was crawled for the web corpus of the corresponding stream. The URLs are alphabetically ordered.
- Web Corpus: These files contain the word occurrence probabilities in the web corpora. Each row contains three columns: The first column contains an artificial tag ID that is the same as in the tag streams. If the word doesn't exist in the stream, a negative integer ID is used. The second column contains the word and the third column its relative occurrence probability. The rows are ordered by descending occurrence probability and the sum of all occurrence probabilities is 1.
|Tag||Tag Assignments||Users||Tags||Resources||Stream||URLs||Web Corpus|
Finally, we provide the Java software that was used for doing the simulations described in the technical report and the generated artificial tag streams:
- TaggingModels.jar Software simulator of the TopN-Model and the Yule-Simon Process with Memory. The source code of the software is also contained in the jar file. It can e.g. be extracted with any zip utility. See the README for more details about how to start the simulator.
- TaggingSimulation.tgz Archive containing all generated tag streams, the software simulator and the technical report.
- README File describing how to start the software simulator and which files are contained in the archive with the artificial tag streams.
Dr. Klaas Dellschaft
Prof. Dr. Steffen Staab