previous post: «How tagging could gain ground»
Statistics is a broad mathematical discipline which studies ways to collect, summarize and draw conclusions from data. [Wikipedia]
Statistics help us to draw conclusions from data. In a way this whole tagging thing just popped up and now we are trying to figure out what really is happening. I think statistics can help us to understand tags.
When I did set up my performance test system I wanted to know the metrics of delicious so I did try to extrapolate some hand collected data but it didn’t turn out that well.
After that I started collecting post data from del.icio.us and am happy to announce that I’ve set up a site with delicious statistics that is fully automated (my hands can rest now..). There are trends about number of posts per day as well as numbers of tags per post.

The stats are based on data I extract from the most recent posts feed, which I’m grabbing 6 times an hour (I’m trying to not be evil: No screen scraping, no grabbing each minute.) I miss a big portion of the posts (actually I record just about 10% of the data) but I guess the stats are precice enough to draw some conclusions.
I’m fond of del.icio.us (as you may know) and when I’m fond of a website I urge to know how many people are using it, if the service is attracting or scaring away folk, I feel a need to know what’s up. Especially after delicious has been acquired by Yahoo, you may ask “do people stay?”.
Anyway, that’s not the only cause for stats. When I set up the performance tests I wanted to have real numbers. I also asked on the delicious mailing list. That same question was asked a few times, but no answers..
Now my stats don’t answer all question. If you’re asking yourself “how many inserts has my tag system to scope with if it gets really big” these will help you. But I cannot do any query-stats, maybe alexa may give you some query trends (maybe you subtract my number from alexas and will get the query stats?).
From the stats you can see the two downtimes of delicious since August.


You also see that the recent growth of del.icio.us merely started in december. I think it has got to do with the more elaborated look and feel (changed in the middle of november) as well as with the new firefox plugin that does give a more professional touch to the service. This grow is a thank you to Joshua and this team.
Then, take a look at the “tag hump” at 10 tags per posts:

My first quick investigations show that this is caused by - you guess it - tag spammers.
I found two spammers that constantly post bookmarks with 10 tags (look out, the first link has got chinese characters in it, my firefox slowed down big time). This shows that stats can help finding anomalies such as spam.
I also thought that maybe the lazy sheep bookmarklet can cause such humps but, by default, lazy sheep’s posts have a maximum of 6 tags. There’s no irregularity at “6″ so I guess lazy sheep doen’t have a big influence (which is a fact I’m quite happy with).
I think it will be interesting to observe these tag graphs when the bookmark post user interface changes. I believe the interface plays a big role in how people tag and this sort of graphs could prove that.
I may give statistics about the number of estimated users (currently tracked: 100k) and number of bookmarks (currently tracked: 500k) but I’m not yet sure how I can compute numbers that seem accurate.
I plan to come up with a few other del.icio.us services such as tag clusters but I’m not yet sure if that project comes to an end so I’ve decided to put up the stats so you’ll have at least this.. :-)
Uh, all this talk about del.icio.us is too much [Otis]
Yeah, you are right. The point is that this stats can be computed from all tagging-powered webservices that serve a “most recent posts” feed. If you’re interested to have a stas on a different service or you want to do del.icio.us stats by your own just leave a comment. If there is enough request, I’ll comment&refactor the code and will publish it as LGPL.
Dorrian Porter has tracked the number of posts of Yahoo’s MyWeb2.0:
Newly saved pages have averaged between 10,000 to 20,000 per week
These numbers are per week. Del.icio.us has got an average of about 55′000 posts per day! This means that right now the data base at del.icio.us grows about 20 times as fast as the one of Yahoo’s MyWeb2.0. That leaves no question as to why they have aquired delicious.
RSS feed for comments on this post. TrackBack URI
Some very interesting stats, indeed! As far as Lazy Sheep goes, I imagine that the maximum of six tags only kicks in very rarely - in fact it would only happen on super-popular posts where people are providing a large spread of tags to choose from. I’d provide some more statistics to back up my numbers, but I’d feel guilty tracking my user’s personal linking habits.
Comment by John Resig — December 28, 2005 8:12 am #comment-1987
Heh.. comment from the lazy sheep programmer himself!
Yeah, you are right. Lazy sheep does not always add 6 tags, but I suppose that it reaches 6 when about 10 people tagged. Stats would be interesting! You could do stats of your own bookmarking..?
Comment by phred — December 28, 2005 8:36 am #comment-1988
Can you do counts of the number of times tags are used day / month.
I think we could really generate and write about the implications of an analysis of what delicious people are interested in. I have done a top level analysis on Supertaggers (http://www.supertaggers.blog.co.uk) of the leading Technorati tags to give a flavour of what I mean.
Perhaps we can work in conjunction with each other somehow.
Cheers, Jan
Comment by J Wyllie — December 28, 2005 1:30 pm #comment-1990
[...] I am currently analyzing Wikipedia categories and Social tagging systems. Philipp Keller made some wonderful statistics of del.icio.us and provided the public with his raw data. I compared the number of tags per post in del.icio.us with the number of categories per page in Wikipedia - they both fit an exponential distribution with mean (1/λ) between 1.66 and 2. The similarities show that there are structural similarities between collaborative tagging in del.icio.us and Wikipedia - I assume that such exponential distributions occur in all collaborative tagging systems. By the way the long tail (articles with > 10 tags) follows a power law but the exponential distribution covers more than 99% of all cases). I’d like to compare the popularity of tags/categories but I don’t have data of del.icio.us. Other tagging systems would also be interesting. [...]
Pingback by Wikimetrics » How many tags do you assign? — January 7, 2006 4:05 am #comment-2045
Thanks for the data! Could you please collect some data about popularity of tags? Especially I’m interested in less used tags - the most popular tags of del.icio.us are shown here (sadly without absolute numbers) but what about the other tags? I bet it’s also an exponential distribution (mean=10, λ=0.1) with a power law tail because that’s the case for Wikipedia categories. A sample of some thousand posts should be enough?
Comment by Jakob — January 7, 2006 4:20 am #comment-2046
I stumpled upon different people talking about the power-law distribution of the frequency of tags (for instance Clay Shirky) but no detailed statistics and number. Unfortunately this is offline. This can give you a hint but some raw data would be better. I revised my data and it’s a power-law but what are the parameters.
Comment by Jakob — January 7, 2006 3:35 pm #comment-2051
Ok, this is the last time for today :-) I analysed:
The Top 25 most used Dewey Decimal Classes
The Top 25 most used Wikipedia Categories
The Top 25 most used Flickr Tags
DDC and Wikipedia are distributed by a power law with almost the same parameter but popular flickr tags are more logarithmic/exponential distributed. I’d like to know if this is a difference between tagging and classification or between broad and narrow folksonomies. Do you have the Top 20 most used del.icio.us tags with relative numbers?
Comment by Jakob — January 7, 2006 7:40 pm #comment-2052
Eh, cool article you wrote.. As it is already asked by J Willie I think I’ll track the numbers of posts per tag
1) in general
2) per month
3) per week
So maybe in the end of 2006 I could offer a delicious Zeitgeist throughout the year (I could sent the data to Jon Udell as he probably wants to do a visualization out of it.. :-)
Jakob, would this fulfill your need for data?
What format should the data be in? Is XML ok? Or CSV so you could import it into Excel?
Comment by phred — January 8, 2006 3:28 pm #comment-2058
At the moment I just need the 25 most used tags and the total numbers in general - but keep collecting data, you never you when you’ll need it ;-) I think the numbers you collected in a week will be enough (or at least a dozen posts for the 25th popular tag) because I have a deadline. The numbers of Millionsofgames also fit so it’s probably a law. You have my mail address, don’t you? A Zeitgeist would be cool! What work of Jon Udell are you talking about? I use the data in CSV but XML is also ok.
Comment by Jakob — January 10, 2006 11:38 pm #comment-2077
Work of Jon Udell: Sorry, the link was wrong, now I corrected it.
Comment by phred — January 13, 2006 3:14 pm #comment-2095
check out those long tails…
Comment by james governor — January 19, 2006 5:04 pm #comment-2132
James: What do you mean by “check out those long tails”?
Comment by phred — January 19, 2006 5:16 pm #comment-2133
Excellent tool. Found you through this huge list of del.icio.us tools.
Comment by sam — March 28, 2006 8:41 am #comment-2493
sorry - singular. tags per bookmark in december.
Comment by James Governor — July 11, 2006 1:05 pm #comment-4799
Here is some additional data about the del.icio.us database contrasted against another mature bookmark service, BookmarkSync.com. The data is striking.
http://www.bookmarksync.com/press/041207_1
Your comments are appreciated.
Comment by Jack — September 19, 2007 2:58 pm #comment-85721