<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Then each went to his own home &#187; Del.icio.us</title>
	<atom:link href="http://www.pui.ch/phred/archives/category/delicious/feed" rel="self" type="application/rss+xml" />
	<link>http://www.pui.ch/phred</link>
	<description>Philipp Kellers weblog</description>
	<lastBuildDate>Wed, 15 Dec 2010 12:37:04 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Remembering on the web &#8211; 5 reasons why online bookmarking is the wrong tool</title>
		<link>http://www.pui.ch/phred/archives/2007/10/remembering-on-the-web-5-reasons-why-social-bookmarking-doesnt-work.html</link>
		<comments>http://www.pui.ch/phred/archives/2007/10/remembering-on-the-web-5-reasons-why-social-bookmarking-doesnt-work.html#comments</comments>
		<pubDate>Tue, 23 Oct 2007 14:28:38 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Bookmarking]]></category>
		<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2007/10/remembering-on-the-web-5-reasons-why-social-bookmarking-doesnt-work.html</guid>
		<description><![CDATA[One common task while browsing the web is making sure you will be able to recall a valuable information you are just looking at. This article aims to prove that social bookmarking as in delicious, simpy, magnolia et al. is the wrong tool for that task.
Clarification
According to comments here and on reddit, it was obvious [...]]]></description>
			<content:encoded><![CDATA[<p>One common task while browsing the web is making sure you will be able to recall a valuable information you are just looking at. This article aims to prove that social bookmarking as in <a href="http://www.delicious.com">delicious</a>, <a href="http://www.simpy.com">simpy</a>, <a href="http://ma.gnolia.com/">magnolia </a>et al. is the wrong tool for that task.</p>
<h2>Clarification</h2>
<p>According to comments here and on <a href="http://programming.reddit.com/info/5yy1h/comments/">reddit</a>, it was obvious that my intention of this post was somehow misunderstood &#8211; partly because of the original misleading title (was: &#8220;.. &#8211; 5 reasons why social bookmarking doesn&#8217;t work&#8221;). Maybe these adaptions from <a href="http://xkcd.com/187/">an xkcd comic</a> does clarify:</p>
<h3>Right tool: Use bookmarks to get things done</h3>
<p><img id="image61" src="http://www.pui.ch/phred/wp-content/uploads/2007/10/clarification_gtd.png" alt="clarification_gtd.png" style="float: none" /><br />
I think, <a href="http://programming.reddit.com/info/5yy1h/comments/c02axbo">derefr sums this up very nice</a>:</p>
<blockquote><p>I find a GTD approach works well: what next action are you going to apply to this bookmark? If it&#8217;s just &#8220;well, it was neat!&#8221; you have no reason to save it (perhaps share it, but not save it), and can throw it away.</p></blockquote>
<p>The same goes for using the tag &#8220;mycomment&#8221; to follow up discussions you&#8217;ve partaken or &#8220;toread&#8221; to know what to read once you&#8217;ve got some free time. These bookmarks all serve a purpose that is clear to you while bookmarking. This also helps you picking an appropriate tag. No critique on that one.</p>
<h3>Right tool: Sharing links</h3>
<p><img id="image62" src="http://www.pui.ch/phred/wp-content/uploads/2007/10/clarification_sharing.png" alt="clarification_sharing.png" style="float: none" /><br />
It is clear that bookmark sharing sites such as <a href="http://reddit.com">reddit</a>, <a href="http://www.digg.com/">Digg</a>, or <a href="http://www.stumbleupon.com/">Stumbleupon</a> that all focus on link sharing have proven that this concept works. Delicious, Simpy, Magnolia et al. all have features to help you share your bookmarks. No critique on that one.</p>
<h3>Wrong tool: Remembering potentially interesting links</h3>
<p><img id="image60" src="http://www.pui.ch/phred/wp-content/uploads/2007/10/clarifiction_interesting.png" alt="clarifiction_interesting.png" style="float: none; margin-left: 0" /><br />
This is what this article is dealing about: Saving bookmarks that are not useful to you now but &#8211; without yet knowing what you&#8217;ll use this bookmark for &#8211; you save it because it is potentially interesting in the future. I think that doesn&#8217;t work and the 5 points should prove that.</p>
<p><span id="more-50"></span></p>
<h2>Reason 1: You can&#8217;t foresee the future</h2>
<p>Deciding which web site will be valuable in the future is a very very hard task. I&#8217;m not too good at it. I pile up tons of bookmarks I never look at afterwards and on the other hand I decided to not bookmark sites which I needed afterwards. In fact I&#8217;m so unsure about my ability to bookmark the right pages I often don&#8217;t try searching for a link in my pile of bookmarks but instead google first because I expect being faster this way. Too often I searched my bookmarks altering tags and search terms and didn&#8217;t find the bookmark in the end.</p>
<p>Additionally: Even if I would know which links will be of interest in the future, I can&#8217;t decide how I should tag (categorize) my bookmarks. When I tag an article, I normally have skimmed it and while categorizing I look at its title. When I tag I&#8217;m in a completely different situation &#8211; information wise &#8211; from when I search for the link.</p>
<div class="caption"><img id="image53" src="http://www.pui.ch/phred/wp-content/uploads/2007/10/ipod.png" alt="ipod.png" /><br />Your categories may change when you get<br />familiar with a product or topic</div>
<div class="caption"><img id="image54" src="http://www.pui.ch/phred/wp-content/uploads/2007/10/strategy.png" alt="strategy.png" /><br />Your information level when looking at a document<br />differs from when trying to recall that document</div>
<h2>Reason 2: You tear links out of its context</h2>
<div class="caption"><a href="http://www.flickr.com/photos/ilikespoons/84355382/"><img id="image59" src="http://www.pui.ch/phred/wp-content/uploads/2007/10/dissect_small.jpg" alt="dissect_small.jpg" /></a><br />Bookmarking is like cutting passages<br />from books: you remove information<br />from the context you originally found it</div>
<p>The word &#8220;bookmark&#8221; relates to the pretty carton markers you use when reading books. Although the way it is used in the web is far far from what it means in books lets delve into that comparison a bit:<br />
To go sure you will be able to find an important passage once you finished a book, you underline or write a few words into the margin to outline a paragraph. Then, when you recall that great sentence you most certainly know in which book it was written (unless that book is a conglomeration of quotes). Then, you often can remember the way that statement was used in the argumentation and in what topic it was embedded. And finally, amazingly, your brain often tells you where on a page (e.g. bottom left) the searched sentence is written. So you normally get quite a bunch of context information to guide you in your search and you will find the wanted sentence within a short amount of time, even if it wasn&#8217;t underlined. And even if you don&#8217;t find it, you often have a good time reading through the other amazing statements and end up quoting something you didn&#8217;t intend.</p>
<p>The way bookmarks are handled in the web would mean to books that you tear out that sentence out of the book, stick a few colored post-its to it and throw that snippet onto the pile with the 1325 other quotes. Bookmarking means taking information out of the context you originally found the information in. On the web context means how you found that link: Was it on Google or in your feed aggregator? Was it a blog post of one of your colleagues? Was it in an email? I often remember these things. Without being a psychologist or having an education in these things I guess our brain is pretty good in remembering context. So why don&#8217;t we use techniques that help our brain instead of trying to replace it?</p>
<h2>Reason 3: It takes too much time</h2>
<p>Bookmarking should save you time &#8211; and frustration. Leaving out the frustration bit: Does it really save you time?<br />
Lets say it takes 10 seconds to categorize a bookmark and lets say you&#8217;ll use every 20th of your saved bookmarks (which are rather optimistic guesses). That means that when trying to recall an url from your bookmarking service you need to be 200 seconds faster than when you didn&#8217;t bookmark any pages at all (as it took you 200 seconds for bookmarking the 20 bookmarks out of which you used 1).</p>
<p>I&#8217;m pretty sure you won&#8217;t save over 3 minutes in average searching in your pile of bookmarks compared to thinking for halve a minute where you found that link and then going down that trail. So: Why the hassle?</p>
<h2>Reason 4: It didn&#8217;t work for me</h2>
<p>I tried it. I gathered 3444 bookmarks in 2 years using 3034 tags. I asked myself how I could change my tagging practices to improve the recall. I failed. <a href="http://www.pui.ch/phred/archives/2007/09/the-delicious-lesson-revisited.html">I gave up</a>. I cannot believe there&#8217;s no one out there feeling the same.</p>
<p>I stopped bookmarking nearly two months ago. First, when reading articles that felt so interesting it was hard to not bookmark them. Then, it was kind of liberating not having to think &#8220;is this page valuable in the future?&#8221; &#8220;what tags should I use?&#8221;.</p>
<p>I never missed it. I always found that link. I don&#8217;t regret.</p>
<h2>Reason 5: Social bookmarking won&#8217;t improve that soon</h2>
<p>You may argue that there soon will be techniques to overcome the problems I just mentioned. But my claim is that social bookmarking sites won&#8217;t improve that soon.</p>
<p>In my last post I asked: &#8220;Why is tagging stuck?&#8221;. Gene Smith <a href="http://www.atomiq.org/archives/2007/09/is_tagging_stuck_hardly.html">argues correctly that tagging isn&#8217;t stuck</a>. He continues:</p>
<blockquote><p>
Want to know what <em>is</em> stuck? Del.icio.us
</p></blockquote>
<p>The same is true for all the other social bookmarking sites. RawSugar did a <a href="http://vanderwal.net/random/entrysel.php?blog=1945#futurepromise">brilliant next step</a> (before it went offline) but the social bookmarking market is quiet ever since. I couldn&#8217;t find fresh ideas in <a href="http://blog.delicious.com/blog/2007/09/taste-test.html">delicious&#8217; current redesign</a>. It seems like they moved buttons from here to there. I hoped they wouldn&#8217;t just redesign the appearance but would also change the way users interact with their data.</p>
<p>So, I guess these services are just as good as it gets. No improvements to wait for. That means it&#8217;s our &#8211; the users &#8211; turn to change our habits, to find the right tool for the job.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2007/10/remembering-on-the-web-5-reasons-why-social-bookmarking-doesnt-work.html/feed</wfw:commentRss>
		<slash:comments>23</slash:comments>
		</item>
		<item>
		<title>The delicious lesson &#8211; revisited</title>
		<link>http://www.pui.ch/phred/archives/2007/09/the-delicious-lesson-revisited.html</link>
		<comments>http://www.pui.ch/phred/archives/2007/09/the-delicious-lesson-revisited.html#comments</comments>
		<pubDate>Mon, 03 Sep 2007 15:31:27 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[History]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2007/09/the-delicious-lesson-revisited.html</guid>
		<description><![CDATA[I&#8217;m very happy that a recent post titled «Tag history and gartners hype cycles» stirred up a discussion in the
folksonomy-blog-space that got some people musing about the state of tagging:
Paolo Valdemarin:

4 years later I&#8217;m still wondering when will we get some truly advanced tagging tools.
Where are all these tools to manage all my tags (on [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m very happy that a recent post titled «<a href="http://www.pui.ch/phred/archives/2007/05/tag-history-and-gartners-hype-cycles.html">Tag history and gartners hype cycles</a>» stirred up a discussion in the<br />
folksonomy-blog-space that got some people musing about the state of tagging:</p>
<p><a href="http://paolo.evectors.it/2007/08/28.html">Paolo Valdemarin</a>:</p>
<blockquote><p>
4 years later I&#8217;m still wondering when will we get some truly advanced tagging tools.<br />
Where are all these tools to manage all my tags (on Flickr, on del.icio.us, on technorati, in my RSS reader, on my blog, etc), to help me organizing them, to allow me to gain more advantages from tagging? (maybe they are somewhere and I simply have not found them yet&#8230;)
</p></blockquote>
<p><a href="http://matt.blogs.it/entries/00002618.html">Matt Mower</a>:</p>
<blockquote><p>
I have been surprised, that [...] the state of the art in tagging seems firmly wedged in 2003. Surprised because there seemed [...] to be a momentum building in the use of tagging
</p></blockquote>
<p><a href="http://www.everythingismiscellaneous.com/2007/08/28/tagging-like-it-was-2002/">David Weinberger</a>:</p>
<blockquote><p>
Tagging like it was 2002
</p></blockquote>
<p><a href="http://vanderwal.net/random/entrysel.php?blog=1945">Thomas Vander Wal</a>:</p>
<blockquote><p>
In the consumer space thing have been stagnant for a while, but in the enterprise space there is some good forward movement and some innovation taking place<br />
[...]<br />
While there are examples that tagging services have moved forward, there is so much more room to advance and improve. As people&#8217;s own collection of tagged pages and objects have grown the tools are needed to better refind them.
</p></blockquote>
<p>Vander Wals post is very very insightful and worth a read: He sums up the tagging history and expresses a few brilliant ideas how to proceed.</p>
<p><span id="more-49"></span></p>
<h3>The delicious lesson &#8211; revisited</h3>
<p>The big question remains: Why is tagging stuck?</p>
<p>My suggestion is that we may rethink <a href="http://bokardo.com/archives/the-delicious-lesson/">the delicious lesson</a>: Not in terms of “is it true that personal value precedes network value?” but in terms of “what is the real benefit of the users?” or in other words: “How can we design the itch that causes users to generate valuable metadata?”</p>
<p>Recently I talked with <a href="http://www.keepthebyte.ch/blog.html">Cédric Huesler, a coworker of mine</a> about <a href="http://del.icio.us/keepthebyte">his use of del.icio.us</a>: Instead of using delicious for storing his bookmarks for later retrieval he stores them to exchange links with strangers. Indeed he has <a href="http://del.icio.us/network/keepthebyte">19 regular consumers of his bookmarks</a>, 7 of these users he is consumer as well.</p>
<p>He doesn&#8217;t store his personal bookmarks at all. He can recall from memory where or how he found a certain website and goes back to his <a href="http://www.google.com/history/">google history</a>.</p>
<p>There are just a few entry points into new information on the web: there is Google, <a href="http://beta.bloglines.com/">feed aggregators</a> or <a href="http://programming.reddit.org">frontpage sites</a>. When there are good search utilities in those tools who needs bookmarks? I must confess that searching at those entry points feels more natural to me than remembering the exact tag I used.</p>
<p>Let&#8217;s put it straight: Using tags to find my bookmarks later just doesn&#8217;t work. I give up. And no, it&#8217;s not just the lack of good tools that help me going through my bookmarks to reorganize them. I won&#8217;t do that for all my 3444 bookmarks. And no, this won&#8217;t be solved with better tools to refind my items. What do you want to throw into the mix? Fulltext search and time based drill-down? This has nothing to do with tags.</p>
<p>So, we might have to rephrase the users motivation to tag, as I don&#8217;t think <a href="http://bokardo.com/archives/the-delicious-lesson/">Joshua Porter was right when he wrote</a>:</p>
<blockquote><p>
in order to gain more personal value, <i>they use tags to be able to find their bookmarks later</i>
</p></blockquote>
<p>I&#8217;m not yet at the point where I could correctly rephrase that statement, but I think Cédrics approach in using tags not for personal recall but for publishing is worth a thought. The value therein is close to the value of blogging: You get attention and you communicate. And that&#8217;s what the web is about, isn&#8217;t it?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2007/09/the-delicious-lesson-revisited.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Tagsessibility</title>
		<link>http://www.pui.ch/phred/archives/2006/10/tagsessibility.html</link>
		<comments>http://www.pui.ch/phred/archives/2006/10/tagsessibility.html#comments</comments>
		<pubDate>Tue, 31 Oct 2006 12:38:18 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[RawSugar]]></category>
		<category><![CDATA[Simpy]]></category>
		<category><![CDATA[Web Search]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2006/10/tagsessibility.html</guid>
		<description><![CDATA[Imagine you have a closet where you store all your documents. Each time you want to archive an important document you tell your closet: &#8220;put that under important&#8221;, a magic hand, coming out of the closet, takes your document and puts it into its immense pile of documents. The other day you are in a [...]]]></description>
			<content:encoded><![CDATA[<p>Imagine you have a closet where you store all your documents. Each time you want to archive an important document you tell your closet: &#8220;put that under important&#8221;, a magic hand, coming out of the closet, takes your document and puts it into its immense pile of documents. The other day you are in a hurry to be on time to your next meeting. You need that important document from yesterday so you ask your closet: &#8220;please show me all important documents&#8221;. Then you hear printing and rustling in the closet until, after half a minute, 20 magic hands stick out of your closet, each one holding a document, some of your documents have sticky notes on it saying: &#8220;this is not filed “important” but it is similar to a document you filed important&#8221;, another sticky note says: &#8220;these are all the other categories you put your documents in&#8221; and every document you really filed “important” has got a sticky note on the number of coworker who filed this document under &#8220;important&#8221;. You say to your self: Tomorrow I&#8217;ll reanimate my pile &#8220;important documents&#8221; that was on my desktop before they put this silly closet into my office.</p>
<p>That&#8217;s the feeling that arises when I think of all the bookmark services out there that ought to file urls I somehow find notworthy so that I can quickly recall them afterwards: There&#8217;s too much clutter and the services are just too slow. Therefore I again begin to save my bookmarks at other places: In firefox or in some text documents lying somewhere on my hard drive (probably I should tag them?).<span id="more-43"></span></p>
<h3>«I want it now»</h3>
<p>As for myself, if I don&#8217;t get my data within a fraction of a second, I don&#8217;t feel that I have good access to my data. I think one of the best things of Google Web search is that it is incredible fast. Tag services are not. That&#8217;s a pity because within tag services I search my lowly amount of bookmarks (around 2000 bookmarks at that moment) whereas with Google I search the whole net. I often find myself at the situation searching for a url on Google rather than looking it up on my bookmark service. Even if I have to do 3 Google searches &#8211; I feel at control. If I have to wait for 3 seconds for an answer, that&#8217;s simply too much.</p>
<p>An experiment: I ran <code>for ((i=0;i&lt;10;i++)) do wget -q --user-agent="Mozille Firefox 1.9" $url; done;</code> for the three big players in tagging services: <a id="__CONK_181" href="http://del.icio.us"><span style="border: 1px solid gray; padding: 0pt; color: black; background-color: lightgray; font-weight: normal; font-family: sans-serif; font-size: small; text-align: center; -moz-border-radius-topleft: 0.5em; -moz-border-radius-topright: 0.5em; -moz-border-radius-bottomright: 0.5em; -moz-border-radius-bottomleft: 0.5em; display: none">181</span>del.icio.us</a>, <a id="__CONK_182" href="http://www.simpy.com"><span style="border: 1px solid gray; padding: 0pt; color: black; background-color: lightgray; font-weight: normal; font-family: sans-serif; font-size: small; text-align: center; -moz-border-radius-topleft: 0.5em; -moz-border-radius-topright: 0.5em; -moz-border-radius-bottomright: 0.5em; -moz-border-radius-bottomleft: 0.5em; display: none">182</span>Simpy</a> and <a id="__CONK_183" href="http://www.rawsugar.com"><span style="border: 1px solid gray; padding: 0pt; color: black; background-color: lightgray; font-weight: normal; font-family: sans-serif; font-size: small; text-align: center; -moz-border-radius-topleft: 0.5em; -moz-border-radius-topright: 0.5em; -moz-border-radius-bottomright: 0.5em; -moz-border-radius-bottomleft: 0.5em; display: none">183</span>Rawsugar</a>. For $url I chose a page displaying all bookmarks saved with a certain tag. I also ran a query on Google for that &#8220;tag&#8221;. The numbers are number of seconds per query. The queries were run on different hosts with different internet connection, all in Switzerland.</p>
<table>
<thead>
<th>service</th>
<th>Sun 18:00 CET</th>
<th>Mon 18:50 CET</th>
<th>Tue 09:00 CET</th>
<th>Tue 12:40 CET</th>
<th><strong>Average</strong></th>
</thead>
<tbody>
<tr>
<td>Delicious</td>
<td>2.1s</td>
<td>2.3s</td>
<td>2.6s</td>
<td>2.4s</td>
<td><strong>2.4s</strong></td>
</tr>
<tr>
<td>Simpy</td>
<td>3s</td>
<td>6s</td>
<td>2.7s</td>
<td>3.3s</td>
<td><strong>3.8s</strong></td>
</tr>
<tr>
<td>Rawsugar</td>
<td>1s</td>
<td>1.5s</td>
<td>1s</td>
<td>1.1s</td>
<td><strong>1.2s</strong></td>
</tr>
<tr>
<td>Google</td>
<td>0.2s</td>
<td>0.5s</td>
<td>0.3s</td>
<td>0.3s</td>
<td><strong>0.3s</strong></td>
</tr>
</tbody>
</table>
<p>I&#8217;m happy to see that with Rawsugar I&#8217;m with the fastest bookmark service (2 times faster than Delicious, 3 times faster than Simpy). On the other hand: all the bookmark services are at least 4 times slower than Google. I know that Google has set a high mark &#8211; but emotionally that&#8217;s the response time I&#8217;d like to have when querying for my data.</p>
<h3>«I just want my bookmarks»</h3>
<p>Have a look at the &#8220;result areas&#8221; (highlighted in green) of the three bookmark services.</p>
<div class="caption"><img alt="Delicious results" id="__CONK_186" src="/phred/images/delicious_results.png" /><br />
<strong>Delicious: Result area in green</strong></div>
<div class="caption"><img alt="Simpy results" id="__CONK_187" src="/phred/images/simpy_results.png" /><br />
<strong>Simpy: Result area in green</strong></div>
<div class="caption"><img alt="Rawsugar results" id="__CONK_188" src="/phred/images/rawsugar_results.png" /><br />
<strong>Rawsugar: Result area in green</strong></div>
<p>To come back to the comparison with the closet: Even if there are 20 magic hands with documents sticking out of the closet, it is crucial that the most important document is the one closest to my face. That means, when using those bookmark services, my eye should first notice the most important link. My opinion: The service that solves this best is Delicious. The results appear on the leftmost part of my screen and the result area covers 64% of the screen. Regarding that matter Rawsugar is far worse: My eye has to search for the first result. I find it natural to start reading at the left of the screen. But at Rawsugar there are so many links in the head of the page and then the left column helps me to refine the result &#8211; a great feature but at the wrong place &#8211; furthermore: if I want just get a link I <strong>know</strong> I filed under &#8220;important&#8221;, I don&#8217;t need this at all. Even Delicious&#8217; &#8220;saved by xxx other people&#8221; &#8211; shaped in different colorings &#8211; is too much clutter for me. Maybe I&#8217;m a puritan but &#8211; what the heck &#8211; I just want my 9 bookmarks!</p>
<table>
<thead>
<th>service</th>
<th>result area ratio</th>
</thead>
<tbody>
<tr>
<td>Delicious</td>
<td>63.6%</td>
</tr>
<tr>
<td>Simpy</td>
<td>60.2%</td>
</tr>
<tr>
<td>Rawsugar</td>
<td>51.2%</td>
</tr>
</tbody>
</table>
<h3>Beyond the criticism</h3>
<p>Sorry to only criticize. I like those services a lot. They are helping organizing links and they are for free. To turn this post into a constructive comment, two possible solutions to the problems mentioned.</p>
<h4>Bookmark services offer a «Minimal mode»</h4>
<p>Yes, tagging is about collaboration. But I fear the personal value of tagging is too small &#8211; people might leave tagging services because they don&#8217;t feel their bookmark problem solved. There&#8217;s too much network and too many features. I propose a &#8220;minimal mode&#8221;, a result page that just shows <code>select * from bookmarks where user="phred" and one_of_its_tags="important"</code>. In firefox there are <a id="__CONK_184" href="http://www.mozilla.org/products/firefox/smart-keywords.html"><span style="border: 1px solid gray; padding: 0pt; color: black; background-color: lightgray; font-weight: normal; font-family: sans-serif; font-size: small; text-align: center; -moz-border-radius-topleft: 0.5em; -moz-border-radius-topright: 0.5em; -moz-border-radius-bottomright: 0.5em; -moz-border-radius-bottomleft: 0.5em; display: none">184</span>Smart Keywords</a> also known as &#8220;quick search&#8221;. I set up &#8220;myr&#8221; to search for my rawsugar bookmarks. Typing &#8220;myr important&#8221; into my location bar I end up on www.rawsugar.com/links/phred/important. If there would be a site on rawsugar that displays the same results just in this minimal mode, I would bend my smart keyword to this result page and would be happy.</p>
<h4>Someone codes a «Tag agent»</h4>
<p>I imagine a &#8220;tag agent&#8221; that has incredible response time and no clutter. It could either get the results via rss/api from the bookmark service or it could hold all my tagged bookmarks in it&#8217;s own database. Such an agent could be installed on a password-protected part of my website so I can access it from wherever I am. The problem is: I don&#8217;t have time to write an application my own. I have thought about such an application, I&#8217;ve got plenty of ideas how it should look but I&#8217;m afraid I won&#8217;t find time to code such an application. I repeatedly get emails from people writing yet another thesis on collaborative categorisation. Instead of writing about insights on the mental actions taking place while tagging &#8211; do the tagging world a favour and write such an agent. If there&#8217;s such a tag agent already, please leave a comment.</p>
<p><span style="border: 1px solid gray; padding: 0pt; color: black; background-color: pink; font-weight: normal; font-family: sans-serif; font-size: small; text-align: center; -moz-border-radius-topleft: 0.5em; -moz-border-radius-topright: 0.5em; -moz-border-radius-bottomright: 0.5em; -moz-border-radius-bottomleft: 0.5em; display: none; left: 4px; top: 227px; position: absolute; opacity: 0.8; z-index: 999">185</span><span style="border: 1px solid gray; padding: 0pt; color: black; background-color: pink; font-weight: normal; font-family: sans-serif; font-size: small; text-align: center; -moz-border-radius-topleft: 0.5em; -moz-border-radius-topright: 0.5em; -moz-border-radius-bottomright: 0.5em; -moz-border-radius-bottomleft: 0.5em; display: none; left: 4px; top: 698px; position: absolute; opacity: 0.8; z-index: 999">186</span><span style="border: 1px solid gray; padding: 0pt; color: black; background-color: pink; font-weight: normal; font-family: sans-serif; font-size: small; text-align: center; -moz-border-radius-topleft: 0.5em; -moz-border-radius-topright: 0.5em; -moz-border-radius-bottomright: 0.5em; -moz-border-radius-bottomleft: 0.5em; display: none; left: 4px; top: 967px; position: absolute; opacity: 0.8; z-index: 999">187</span><span style="border: 1px solid gray; padding: 0pt; color: black; background-color: pink; font-weight: normal; font-family: sans-serif; font-size: small; text-align: center; -moz-border-radius-topleft: 0.5em; -moz-border-radius-topright: 0.5em; -moz-border-radius-bottomright: 0.5em; -moz-border-radius-bottomleft: 0.5em; display: none; left: 4px; top: 1237px; position: absolute; opacity: 0.8; z-index: 999">188</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2006/10/tagsessibility.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Automated tag clustering</title>
		<link>http://www.pui.ch/phred/archives/2006/07/automated-tag-clustering.html</link>
		<comments>http://www.pui.ch/phred/archives/2006/07/automated-tag-clustering.html#comments</comments>
		<pubDate>Tue, 11 Jul 2006 06:03:37 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[RawSugar]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2006/07/automated_tag_clustering.html</guid>
		<description><![CDATA[Grigory Begelman (Technion &#8211; Israel Institute of Technology Computer Science Dpt), Frank Smadja (RawSugar) and I did a paper for www2006 called &#8220;automated tag clustering&#8221;. It deals with why clustering the tag space makes sense and how this could be done.
After the presentation at the tagging workshop at www2006 we felt the need to give [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cs.technion.ac.il/%7Egbeg/">Grigory Begelman</a> (<a href="http://www.cs.technion.ac.il/">Technion &#8211; Israel Institute of Technology Computer Science Dpt)</a>, <a href="http://smadja.us/">Frank Smadja</a> (<a href="http://www.rawsugar.com/">RawSugar</a>) and I did a paper for <a href="http://www2006.org">www2006</a> called &#8220;automated tag clustering&#8221;. It deals with why clustering the tag space makes sense and how this could be done.</p>
<p>After the presentation at the <a href="http://blog.rawsugar.com/wikka/wikka.php?wakka=HomePage">tagging workshop</a> at www2006 we felt the need to give our paper a more www-friendly, I-don&#8217;t-want-to-read-through-those-theoretical-equation-flooded-papers face.</p>
<p>So, here you go: <a href="http://www.pui.ch/phred/automated_tag_clustering/">Automated Tag Clustering: Improving search and exploration in the tag space</a>. To read this document you should have a clue what tags are about, you should also know some tag services as <a href="http://del.icio.us">delicious</a> or <a href="http://www.flickr.com">flickr</a> so you can understand the limitations these services currently have. <span id="more-41"></span><a href="http://www.pui.ch/phred/automated_tag_clustering/#cluster"><img title="clustering the tag space" alt="clustering the tag space" id="image42" src="http://www.pui.ch/phred/wp-content/uploads/2006/07/clusters.png" /></a>If you don&#8217;t want to read through the whole papers, the numerous figures give you a good summary. Finally, to wet your appetite, here a few excerpts of the document:</p>
<blockquote><p>Currently tagging services still provide a relatively marginal value for information discovery and we claim that with the use of clustering techniques this can be greatly improved [from <a href="http://www.pui.ch/phred/automated_tag_clustering/#p_motivation">introduction</a>]</p></blockquote>
<blockquote><p>The whole promise of collaborative tagging is that by exploring the tag space you can discover a lot of useful information you would not find with traditional search engines.  When your information need is not well defined, the idea that you can explore and see what other people tagged with certain tags is very attractive. We believe that tagging will be able to reach a very wide audience only when exploration techniques will be effective. [from <a href="http://www.pui.ch/phred/automated_tag_clustering/#p_exploration">limited exploration</a>]</p></blockquote>
<blockquote><p>Although a great visualization paradigm, we believe that with today&#8217;s tagclouds it is hard to find more than one or two tags to click on. Tags are not grouped, there is too much information, so that you find lot of related tags scattered on the tag cloud.  One or two popular topics and all their related tags tend to dominate the whole cloud.  For example, looking at the del.icio.us tagcloud, one would mostly see tags related to web design and technologies. This is because these topics are overwhelmingly more frequent than anything else. [from <a href="http://www.pui.ch/phred/automated_tag_clustering/#p_exploration">limited exploration</a>]</p></blockquote>
<blockquote><p>Tag <em>web2.0</em> nowadays is so popular and is combined wildly with anything. In fact this tag is so overused that if you look at <a href="http://del.icio.us/tag/bookmarks">tag <em>bookmarks</em> in the del.icio.us dataset</a>, the most used cotag is <em>web2.0</em>[...]. Basing tag similarity on these numbers often doesn&#8217;t make sense at all. The similarity measure should be chosen so the popularity of a tag doesn&#8217;t affect the set of a tags related tags. Don&#8217;t cut the <a href="http://en.wikipedia.org/wiki/Long_tail">long tail</a>. The success of blogs is driven by the importance of the long tail. We all know that it is crucial to support the niches. Tagging applications should empower the long tail too. If you just sort by popularity, you&#8217;d loose all those niches. [from <a href="http://www.pui.ch/phred/automated_tag_clustering/#p_similarity">choosing a similarity measure</a>]</p></blockquote>
<p>We&#8217;d be happy to get any kind of feedback on the article. Just post a comment to this blog post.</p>
<p><strong>Edit (4 years later!)</strong>: A few guys asked me about the source code: <a href="http://pastie.org/1098455">Source code with syntax highlighting</a>, <a href="http://www.pui.ch/phred/archives/cluster.py">download</a>.<br />
You need <a href="http://people.sc.fsu.edu/~jburkardt/c_src/kmetis/kmetis.html">kmetis</a> to make this run, see <code>usage()</code> to see how it should be used.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2006/07/automated-tag-clustering.html/feed</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>Is there a world beyond delicious?</title>
		<link>http://www.pui.ch/phred/archives/2006/02/delicious-vs-rawsugar-vs-simpy.html</link>
		<comments>http://www.pui.ch/phred/archives/2006/02/delicious-vs-rawsugar-vs-simpy.html#comments</comments>
		<pubDate>Tue, 28 Feb 2006 06:46:43 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Del.icio.us]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/?p=37</guid>
		<description><![CDATA[There is an enormous diversity in the landscape of social bookmark managers. Nonetheless most of the bloggers tend to stick to del.icio.us.
I rarely hear bloggers writing about another service than delicious. With this post I present two alternatives (RawSugar and Simpy) and want to prove that a spreading of users to alternative services could improve [...]]]></description>
			<content:encoded><![CDATA[<p>There is an enormous diversity in the landscape of social bookmark managers. Nonetheless most of the bloggers tend to stick to <a href="http://del.icio.us/">del.icio.us</a>.<br />
I rarely hear bloggers writing about another service than delicious. With this post I present two alternatives (<a href="http://www.rawsugar.com/">RawSugar</a> and <a href="http://www.simpy.com/">Simpy</a>) and want to prove that a spreading of users to alternative services could improve the world of tagging in general.<br />
<span id="more-37"></span></p>
<h3>Locked in?</h3>
<p>I recently tried out <a href="http://www.simpy.com/">Simpy</a> and <a href="http://www.rawsugar.com/">RawSugar</a> and thought about switching to one of them.<br />
It is true that you are not really locked in to del.icio.us. You can export all of your bookmarks by just typing<br />
<code>curl --user username:password -o backup/delicous_backup.xml -O 'http://del.icio.us/api/posts/all'</code><br />
But in a way you just feel like you&#8217;re locked in, isn&#8217;t it?<br />
Why so?</p>
<ul>
<li>del.icio.us has the most folk using it, it feels good to be with the masses (the mass can&#8217;t be wrong, can it?)</li>
<li>there&#8217;s more feedback while tagging</li>
<li>could I switch back to delicious when this other service turns out not to meet my needs? Well, I would have to find out&#8230;</li>
<li>heck, I think I am addicted to del.icio.us. Sure I <strong>could</strong> change but will my life go on? Will my &#8220;last bookmark list&#8221; still work? And ah, I&#8217;ve got references to del.icio.us at 13 points of my blog and then I would have to tell my friends that I switched and then my to:mybestfriend wouldn&#8217;t work any more and.. ah, <strong>why should I</strong>?</li>
</ul>
<p>Well, guess what? You are locked in! :-)<br />
I think the point is that the advantages and disadvantages of the bookmark services aren&#8217;t known. I haven&#8217;t read much about it, have you?<br />
I recently switched to <a href="http://www.rawsugar.com/">RawSugar</a>, after using del.icio.us for about 1 year. I have a little bit of experience using <a href="http://www.simpy.com/">Simpy</a> while using it for a little project. I will therefore confine myself to describe just this three services, if you have experiences with other services I would appreciate your comment.</p>
<h2 id="delicious"><a href="http://del.icio.us/">del.icio.us</a></h2>
<p>Let&#8217;s start with the one you&#8217;re most probably using.<br />
<strong>Advantages:</strong></p>
<ul>
<li>Clean, simple user interface</li>
<li>Most data, therefore
<ul>
<li>best popular page</li>
<li>you see how popular an url is while tagging</li>
<li>much feedback from other users while tagging</li>
</ul>
</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li>slow roll out of new features</li>
<li>bad performance</li>
<li>spam</li>
<li>I guess my dad couldn&#8217;t use it</li>
</ul>
<p>Joshua and his team are trying hard to limit del.icio.us to what it should do: Saving bookmarks and finding them:</p>
<blockquote><p>&#8220;delicious is a tool, not a community. One reason i donut want it to be a community. some community behaviors are not good.&#8221; the &#8220;you suck&#8221; effect, and flame wars [<a href="http://www.redmonk.com/jgovernor/archives/001262.html">Joshua Schachter at Carson Summit</a>]</p></blockquote>
<p>In fact del.icio.us has got the cleanest ui of all bookmark managers I&#8217;ve seen. On the other hand, I can&#8217;t imagine my dad using it. The whole terminology and the design are not the type of web page my dad would use. But then again this is an advantage. The fact that del.icio.us doesn&#8217;t want to please all users makes it simple for the ones it is made for.<br />
The other disadvantages heavily depend on each other: When Joshua started his service he most probably took a few mysql tables and optimized his queries, when the demand started to grow he&#8217;s put in some cache here and there but then the big wave hit: <a href="http://deli.ckoma.net/stats">in the last half year, the number of bookmark posts per day tripled</a>. I guess he now is working on improving the performance rather than rolling out new features, as</p>
<blockquote><p>&#8220;tags doesn&#8217;t map to sql at all. so use partial indexing.&#8221;[<a href="http://www.redmonk.com/jgovernor/archives/001262.html">Joshua Schachter at Carson Summit</a>]</p></blockquote>
<p>The succeeding bookmarking service programmers already knew the problems of tagging applications before they started, Joshua and his team have to improve a running service. And then they switched to Yahoo&#8217;s location, that probably also helps slowing down the roll out of new features. The &#8220;private bookmark feature&#8221; is planned for about a year already, yet it is still to be released.</p>
<h2 id="rawsugar"><a href="http://www.rawsugar.com/">RawSugar</a></h2>
<p>RawSugar is a bookmarking service run by a small company with a handful of programmers. It became beta in September of 2005.<br />
<strong>Advantages</strong></p>
<ul>
<li>Hierarchic tags</li>
<li>multi-word tags separated by comma</li>
<li>fast service</li>
<li>fast roll out of new features</li>
<li>private bookmarks</li>
</ul>
<p><strong>Disadvantages</strong>:</p>
<ul>
<li>not that much data</li>
<li>different terminology than what you&#8217;re probably used to</li>
</ul>
<p>The fact that RawSugar started their service after the success of del.icio.us helped them to design their engine for big amount of data. They definitively have a good engine, As they never put public information about their engine, I&#8217;m not sure if I&#8217;m allowed to speak about details here, but they don&#8217;t just have a 3-table-mysql-schema :-) In fact I believe their engine would handle a wave of new users without stopping roll out of new features.</p>
<p>They have got <em>hierarchic tags</em>, and this is a very clever idea. First I wasn&#8217;t totally convinced about this feature but it has itself proved to be very helpful. I don&#8217;t like del.icio.us&#8217; tag bundles. Tag bundles may be a good thing if you have 100 to 200 tags, but if you have 500 or 1000 tags (I currently have over 2000), it isn&#8217;t manageable any more. With RawSugar you can put a hierarchy in your tags while tagging an item. if you type &#8220;article&gt;rant&#8221; you say that &#8220;rant&#8221; is a child tag to &#8220;article&#8221;. If you later tag a site with &#8220;rant&#8221;, the tag &#8220;article&#8221; will automatically be added.</p>
<p>A word about their <em>terminology</em>: At first it was difficult to map between the &#8220;delicious terms&#8221; and &#8220;rawsugar terms&#8221;. Sure, tags are still tags but somehow I was (too much) used to the whole terminology of delicious and took it as granted. On the other hand it&#8217;s good to see new viewpoints and that&#8217;s why I think new players in the tag market are a good thing. RawSugar aims to be not only a service for bloggers and geeks. Their UI looks more &#8220;traditional&#8221;&nbsp; like the one of Amazon, for instance: After you have searched your bookmarks after a certain tag you see a search field where you can type in another name of a tag and then hit &#8220;refine&#8221;. I think this is more &#8220;natural&#8221; than clicking the &#8220;+&#8221; signs at del.icio.us.</p>
<p>Finally they have a bunch of features: </p>
<ul>
<li>Nice integration to your blog (tag your articles and then simply add their automatic generated navigation to your blog)
</li>
<li>you can switch users while tagging (means: you can have different &#8220;directories&#8221;: one for your blog, one for your company, etc..)</li>
<li>they are playing with automated clustering techniques (<a href="http://www.rawsugar.com/similarTags">similar tags</a> and <a href="http://www.rawsugar.com/clusters">clusters</a>), I&#8217;m looking forward into the integration of this algorithms!
</li>
</ul>
<h2><a href="http://www.simpy.com">Simpy</a></h2>
<p>Simpy is a one man project of Otis Gospodnetic. As he is/was a developer of <a href="http://lucene.apache.org/java/docs/">Lucene</a> he has a good knowledge of database systems and scaling. He has got a good engine too so I guess his service would still stand after a big wave of new users. Simpy seems to be online since early 2004 (according to the date of Otis&#8217; earliest links).<br />
<strong>Advantages</strong>:</p>
<ul>
<li>group tagging</li>
<li>search results can be sorted by date, popularity or relevance</li>
<li>it&#8217;s possible to search by site (i.e. www.nytimes.com) and/or extension (i.e. mp3) and/or combined with tags</li>
<li>multi-word tags separated by comma</li>
<li>not commercial (free time project)</li>
<li>private bookmarks</li>
</ul>
<p><strong>Disadvantages</strong>:</p>
<ul>
<li>one man project, therefore not many resources to put in</li>
<li>UI is a bit cluttered</li>
</ul>
<p>What I particularly like about Simpy is it&#8217;s spirit: Otis is &#8220;one of us&#8221; giving away his spare time for this service and demanding nothing in return. In return, on Simpy&#8217;s <a href="http://groups.yahoo.com/group/simpy-dev/">developer mailing list</a> he gets a lot of feedback from people that want to contribute, that is: write little applications, that use Simpy&#8217;s API. I am in contact with Otis: he is always very friendly and open for feedback. That said, he knows that his UI lacks some polish :-)<br />
I&#8217;d say the killer-feature of Simpy is group tagging. While tagging you can click the checkbox of one of the groups you&#8217;re in. There&#8217;s an RSS-Feed of each group so this is a very useful feature. Ever tried to do that with del.icio.us? You have to log out and log in into the group account..</p>
<h3>Bottom line</h3>
<p>There are many many more bookmark manager services that I never actually used (I probably have signed up to about 50% to them just to get my favorite username in case they turn to get &#8220;teh r0&#215;0r&#8221;), there&#8217;s  <a href="http://www.listible.com/list/social-bookmarking-sites">a good list at listible</a>.</p>
<p>In the end, there&#8217;s no real winner. And I like it that way. I suppose in long term each bookmark service will have it&#8217;s community.<br />
The popular page on del.icio.us is ruled by entries about design and web2.0. I guess other services will attract other communities.<br />
That said, I envisage two things:</p>
<ol>
<li>a service that asks you about your hobbies and topics of interest or let&#8217;s you submit a <a href="http://www.attentiontrust.org/">attention.xml</a> file and then tells you which service fits best for your needs.</li>
<li>an &#8220;interchange format&#8221; that helps you switch from one bookmark service to another, ore more generally: more collaboration between the bookmark manager services. <ins datetime="2006-03-30T05:37:04+00:00"><strong>Update:</strong> Faber of <a href="http://www.smarking.com/">smarking</a> just <a href="http://blog.smarking.com/2006/03/bookmarks_inter.html">started a project</a> to design a interchange format based on microformats, he&#8217;s set up a <a href="http://mailman-mail1.python-hosting.com/listinfo/bif">mailing list</a>. Great effort!</ins></li>
</ol>
<p>Well, I highly encourage you to look beyond your own nose because there&#8217;s a big world beyond delicious. :-)</p>
<h3>Further reading</h3>
<ul>
<li><a href="http://www.irox.de/roxomatic/1050">social bookmarks review</a>: most complete social bookmark manager comparison chart I&#8217;ve seen</li>
<li><a href="http://3spots.blogspot.com/2006/01/all-social-that-can-bookmark.html">all social that can bookmark</a>: pretty complete list of bookmark services with their features explained at a glance. Pretty neat.</li>
<li><a href="http://blog.simpy.com/blojsom/blog/2005/08/15/Why_Simpy_over_del_icio_us.html">Why simpy over delicious</a></li>
<li><a href="http://about.blinklist.com/category/general/blinklist-vs-delicious/">blinklist vs delicious</a></li>
<li><a href="http://webosphere.wordpress.com/2006/02/11/magnolia-delicious-killer/">Magnolia: delicious killer?</a></li>
<li><a href="http://quimble.com/poll/view/224">poll: what is your preferred bookmark manager?</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2006/02/delicious-vs-rawsugar-vs-simpy.html/feed</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Delicious statistics</title>
		<link>http://www.pui.ch/phred/archives/2005/12/delicious-statistics.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/12/delicious-statistics.html#comments</comments>
		<pubDate>Fri, 23 Dec 2005 21:09:08 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/?p=36</guid>
		<description><![CDATA[Statistics is a broad mathematical discipline which studies ways to collect, summarize and draw conclusions from data. [Wikipedia]
Statistics help us to draw conclusions from data. In a way this whole tagging thing just popped up and now we are trying to figure out what really is happening. I think statistics can help us to understand [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>Statistics is a broad mathematical discipline which studies ways to collect, summarize and draw conclusions from data. [<a href="http://en.wikipedia.org/wiki/Stats">Wikipedia</a>]</p></blockquote>
<p>Statistics help us to draw conclusions from data. In a way this whole tagging thing just popped up and now we are trying to figure out what really is happening. I think statistics can help us to understand tags.</p>
<p>When I did set up my <a href="http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html">performance test</a> system I wanted to know the metrics of <a href="http://del.icio.us/">delicious</a> so I did <a href="http://www.pui.ch/phred/archives/2005/05/delicious-statistics-that-is-extrapolation.html">try to extrapolate some hand collected data</a> but it didn&#8217;t turn out that well.</p>
<p>After that I started collecting post data from del.icio.us and am happy to announce that I&#8217;ve set up <a href="http://deli.ckoma.net/stats">a site with delicious statistics</a> that is fully automated (my hands can rest now..). There are trends about number of posts per day as well as numbers of tags per post.<br />
<span id="more-36"></span></p>
<div class="caption"><img src="/phred/modules/overall.png" alt="overall post trend" title="overall post trend"/><br />
Overall bookmark post trend</div>
<p>The stats are based on data I extract <a href="http://del.icio.us/rss/">from the most recent posts feed</a>, which I&#8217;m grabbing 6 times an hour (I&#8217;m trying to not be evil: No screen scraping, no grabbing each minute.) I miss a big portion of the posts (actually I record just about 10% of the data) but I guess the stats are precice enough to draw some conclusions.</p>
<h2>Why statistics?</h2>
<p>I&#8217;m fond of del.icio.us (as you may know) and when I&#8217;m fond of a website I urge to know how many people are using it, if the service is attracting or scaring away folk, I feel a need to know what&#8217;s up. Especially after delicious has been acquired by Yahoo, you may ask &#8220;do people stay?&#8221;.</p>
<p>Anyway, that&#8217;s not the only cause for stats. When I set up the<a href="http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html"> performance tests</a> I wanted to have real numbers. <a href="http://lists.del.icio.us/pipermail/discuss/2005-April/003002.html">I also asked</a> on the <a href="http://lists.del.icio.us/mailman/listinfo/discuss">delicious mailing list</a>. That same question was <a href="http://lists.del.icio.us/pipermail/discuss/2005-May/003225.html">asked</a> <a href="http://lists.del.icio.us/pipermail/discuss/2005-September/004012.html">a few times</a>, but no answers..<br />
Now my stats don&#8217;t answer all question. If you&#8217;re asking yourself &#8220;how many inserts has my tag system to scope with if it gets really big&#8221; these will help you. But I cannot do any query-stats, maybe <a href="http://www.alexa.com/data/details/traffic_details?&#038;compare_sites=&#038;y=r&#038;q=&#038;url=del.icio.us">alexa may give you some query trends</a> (maybe you subtract my number from alexas and will get the query stats?).</p>
<h2>First impressions</h2>
<p>From the stats you can see the two downtimes of delicious since August.</p>
<div class="caption"><img src="/phred/modules/downtime2.png" alt="del.icio.us downtime august" title="del.icio.us downtime august" /><br />
del.icio.us downtime august
</div>
<div class="caption"><img src="/phred/modules/downtime1.png" alt="del.icio.us downtime december" title="del.icio.us downtime december" /><br />
del.icio.us downtime december
</div>
<p>You also see that the recent growth of del.icio.us merely started in december. I think it has got to do with the more elaborated look and feel (changed in the middle of november) as well as with the new firefox plugin that does give a more professional touch to the service. This grow is a thank you to Joshua and this team.</p>
<p>Then, take a look at the &#8220;tag hump&#8221; at 10 tags per posts:</p>
<div class="caption"><img src="/phred/modules/tags_december.png" alt="tag distribution december" title="tag distribution december"/><br />
del.icio.us tag distribution december</div>
<p>My first quick investigations show that this is caused by &#8211; you guess it &#8211; tag spammers.<br />
I found <a href="http://del.icio.us/software.download">two</a> <a href="http://del.icio.us/dave77">spammers</a> that constantly post bookmarks with 10 tags (look out, the first link has got chinese characters in it, my firefox slowed down big time). This shows that stats can help finding anomalies such as spam.</p>
<p>I also thought that maybe the <a href="http://ejohn.org/apps/sheep/">lazy sheep bookmarklet</a> can cause such humps but, by default, lazy sheep&#8217;s posts have a maximum of 6 tags. There&#8217;s no irregularity at &#8220;6&#8243; so I guess lazy sheep doen&#8217;t have a big influence (which is a fact I&#8217;m quite happy with).</p>
<p>I think it will be interesting to observe these tag graphs when the bookmark post user interface changes. I believe the interface plays a big role in how people tag and this sort of graphs could prove that.</p>
<h2>Further improvements</h2>
<p>I may give statistics about the number of estimated users (currently tracked: 100k) and number of bookmarks (currently tracked: 500k) but I&#8217;m not yet sure how I can compute numbers that seem accurate.<br />
I plan to come up with a few other del.icio.us services such as tag clusters but I&#8217;m not yet sure if that project comes to an end so I&#8217;ve decided to put up the stats so you&#8217;ll have at least this.. :-)</p>
<h2>Hold on, that&#8217;s too much del.icio.us for me</h2>
<blockquote><p>Uh, all this talk about del.icio.us is too much [<a href="http://blog.simpy.com/blojsom/blog/2005/12/14/Del-icio-us-Kaput.html">Otis</a>]</p></blockquote>
<p>Yeah, you are right. The point is that this stats can be computed from all tagging-powered webservices that serve a &#8220;most recent posts&#8221; feed. If you&#8217;re interested to have a stas on a different service or you want to do del.icio.us stats by your own just leave a comment. If there is enough request, I&#8217;ll comment&#038;refactor the code and will publish it as LGPL.</p>
<h2>Comparing to other services</h2>
<h3>Del.icio.us vs. Yahoo MyWeb 2.0</h3>
<p>Dorrian Porter has <a href="http://dorrianporter.typepad.com/silicon_valley_himalayan_/2005/10/lackluster_grow.html">tracked the number of posts of Yahoo&#8217;s MyWeb2.0</a>:</p>
<div class="caption"><a href="http://dorrianporter.typepad.com/silicon_valley_himalayan_/2005/10/lackluster_grow.html"><img src="/phred/modules/yahoo_posts_per_week.jpg"/></a><br />posts per week on Yahoo&#8217;s MyWeb2.0 (graphic by <a href="http://dorrianporter.typepad.com/silicon_valley_himalayan_/2005/10/lackluster_grow.html">Dorrian Porter</a>)</div>
<blockquote><p>Newly saved pages have averaged between 10,000 to 20,000 per week</p></blockquote>
<p>These numbers are <strong>per week</strong>. Del.icio.us has got an average of about 55&#8242;000 posts per day! This means that right now the data base at del.icio.us grows about 20 times as fast as the one of Yahoo&#8217;s MyWeb2.0. That leaves no question as to why they have aquired delicious.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/12/delicious-statistics.html/feed</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>How tagging could gain ground</title>
		<link>http://www.pui.ch/phred/archives/2005/11/how-tagging-could-gain-ground.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/11/how-tagging-could-gain-ground.html#comments</comments>
		<pubDate>Tue, 29 Nov 2005 20:54:28 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/?p=35</guid>
		<description><![CDATA[Is the revolution stuck?
When I first heard about del.icio.us (and after that few days when I didn&#8217;t get it..) I thought: &#8220;This is revolutionary&#8221;. There were many things tags made possible that were just not possible until that day.
Joshua Schachter was the guy that invented tags (or at least that&#8217;s how the story is being [...]]]></description>
			<content:encoded><![CDATA[<h2>Is the revolution stuck?</h2>
<p>When <a href="http://www.pui.ch/phred/archives/2005/02/delicious_is_te.html">I first heard about del.icio.us</a> (and after that few days when I didn&#8217;t get it..) I thought: &#8220;This is revolutionary&#8221;. There were many things tags made possible that were just not possible until that day.</p>
<p><a href="http://burri.to/~joshua/">Joshua Schachter</a> was the guy that invented tags (or at least that&#8217;s how the story is being told). Originally <a href="http://loosewire.typepad.com/blog/2005/01/the_tag_report__3.html">thought as a way to organize ones own bookmarks</a> the social effect became obvious:</p>
<blockquote><p>If everyone tags, the &#8220;community&#8221; profits.</p></blockquote>
<p>Now, we have del.icio.us. Now we organize our bookmarks with tags. <a href="http://www.flickr.com">And our photos</a>.<br />
And our <a href="http://www.librarything.com/">books</a>, <a href="http://www.millionsofgames.com/">our games</a>, <a href="http://myprogs.net/">our software</a>, <a href="http://supr.c.ilio.us/">our tagging sites</a>, and <a href="http://bulldogster.ning.com/">also your bulldogs</a>, if you have any.</p>
<p>However, as we have tagged our whole life, what do we do with it? What is it good for?<br />
I fear the tagging-revolution is about to calm. And I believe that&#8217;s because many people don&#8217;t see the advantages in tagging. I believe that <strong>many many</strong> things can be made possible by using tag-based systems. If we realized this, tagging would get some fresh air and eventually tagging gets mainstream.</p>
<p>Is it just me, or is the tagging revolution really stuck? I desperately miss new, visionary, inventive articles on tags.</p>
<ul>
<li>To all smart people, where are your ideas?</li>
<li>To all programming geeks: Where are your algorithms, your &#8220;proof of concept&#8221; web services?</li>
</ul>
<p>I could stop here with my article, but, hey, I don&#8217;t want to be the grumbling guy that sits and waits for new things coming up, so here I am, trying to expose my brain to you.<br />
In this article I want to take a look at what areas tags are already strong in and how tagging could gain ground in these areas.<br />
<span id="more-35"></span></p>
<h2>Tags help you to organize</h2>
<p>When Joshua came up with the idea of tags, it was purely meant for organizing. It was only when also other people started organizing by tags, when the whole idea of &#8220;folksonomy&#8221; came up.<br />
What does organizing mean? It is like tidying ones room: You put every paper and pencil you have at a place you can remember and seems logical to you so you can easily remember where you have put that thing. Now as we are not limited into physical means when we organize data we have many new possibilities. There are already <a href="http://wiki.osafoundation.org/bin/view/Journal/HierarchyVersusFacetsVersusTags">good articles</a> about this so I won&#8217;t discuss this in detail here.</p>
<p>At the end of the day, the question arises: Is organizing your bookmarks by tags really that good? </p>
<p>Just to make the point I come up with another way to remember things: While browsing, the browser could save all pages in a cache and when you are searching for a page you have visited (which is why you originally bookmark the page anyway), you make a fulltext search through all your cached pages. It&#8217;s a kind of &#8220;Google search&#8221; over pages you have already visited. I know this would have some downsides but it would have some advantages too. I often searched for a page I bookmarked and couldn&#8217;t remember the tag I used. This problem wouldn&#8217;t occur in the &#8220;searching through cache&#8221; system.</p>
<p>What I am trying to say is: <strong>If tagging solely would be for organizing your stuff, it wouldn&#8217;t be worth the trouble</strong>.</p>
<h2>Folksonomy &#8211; Classification of the masses</h2>
<p><a href="http://en.wikipedia.org/wiki/Folksonomy">Folksonomy</a> is &#8211; as I understand it &#8211; the distributed classification of data by the big mass of people who tag stuff. Folksonomy often is said as a new system to build a <a href="http://en.wikipedia.org/wiki/Taxonomy">taxonomy</a>. It&#8217;s like building the <a href="http://dmoz.org/about.html">Open Directory</a> by thousands of people tagging stuff.</p>
<p>What is folksonomy good for? Why do we want to put bookmarks into categories?</p>
<h3>Folksonomy enables to explore</h3>
<p>Where do you go to start building a new expertise? Is it del.icio.us? Is it Google?<br />
Let&#8217;s say your boss tells you that the data your software saves in the database should be encoded. You didn&#8217;t think much about cryptography, it merely was a topic that you &#8220;should know about&#8221; but you were never really interested in cryptography (I&#8217;m speaking for myself here.. :-) ). You don&#8217;t really know where to start. You know you want to know something about cryptography, but you don&#8217;t know exactly what.<br />
A good list of articles or even starting points could shorten your learning curve.<br />
Thereafter, you may want to &#8220;travel through the cryptography universe&#8221;. And to travel means knowing which articles are related to the one you just read and are so enthusiastic about. You need a map of the cryptography universe, you want to know what is left and right, top and bottom, you want to know everything and everyone related to &#8220;cryptography&#8221;.<br />
Now then: What would you do?</p>
<h3>Do tag systems help you to explore?</h3>
<h4>Delicious on cryptography</h4>
<div class="caption"><a href="http://del.icio.us/tag/cryptography"><img src="/phred/modules/delicious_cryptography.png" alt="delicious results on cryptography" title="delicious results on cryptography" /></a><br />
Delicious results on &laquo;cryptography&raquo;</div>
<p>I would go on <a href="http://del.icio.us/tag/cryptography+introduction">del.icio.us/tag/cryptography+introduction</a>. There I find a nice article titled &#8220;<a href="http://www.garykessler.net/library/crypto.html">An Overview of Cryptography</a>&#8220;. I guess I&#8217;m lucky! If I&#8217;d read the article, I&#8217;d probably find out which subtopics exist, how cryptography is related to similar issues and so on. You kind of get this &#8220;map of the cryptography&#8221; universe. But, this is done by only one author. Probably I don&#8217;t trust him (probably I should do so, after reading his <a href="http://www.garykessler.net/resume.html">cv</a>), or you simply do not have time and/or energy to read through 44 pages, although the article looks good. I&#8217;ll probably <a href="http://del.icio.us/tag/cryptography">go back to delicious and find out</a>, that the related tags of &#8220;cryptography&#8221; are:</p>
<ul>
<li>security</li>
<li>reference</li>
<li>encryption</li>
<li>crypto</li>
<li>algorithms</li>
<li>computing</li>
<li>software</li>
<li>nsa</li>
<li>tutorial</li>
<li>kids</li>
<li>education</li>
</ul>
<p>Now this is not very convincing, is it? You argue:</p>
<blockquote><p>Yeah, but this is far better that what I get on Google</p></blockquote>
<p>. </p>
<h4>Google on cryptography</h4>
<div class="caption"><a href="http://www.google.ch/search?q=cryptography"><img src="/phred/modules/google_cryptography.png" alt="Google results on cryptography" title="Google results on cryptography" /></a><br />
Google results on &laquo;cryptography&raquo;</div>
<p><a href="http://www.google.ch/search?q=cryptography">It is</a>. When looking at this Google results I remember that Google is meant for searching when I already know what I search for. But now I am at a different stage. I don&#8217;t know exactly what to search for. I don&#8217;t know, because I don&#8217;t have any expertise in cryptography. BTW: Google does come up with an article that looks like a good introduction into cryptography as well..</p>
<h4>Open directory on security</h4>
<p>What about <a href="http://dmoz.org/about.html">open directory</a>? Let&#8217;s give it a try: After typing in &#8220;cryptography&#8221; I find out that this topic is classified in <a href="http://www.google.com/Top/Science/Math/Applications/Communication_Theory/Cryptography">Science &gt; Math &gt; Applications &gt; Communication_Theory &gt; Communication Theory &gt; Cryptography</a>. Clicking this link you get what you were probably looking for.<br />
You get a nice overview:
<div class="caption"><a href="http://www.google.com/Top/Science/Math/Applications/Communication_Theory/Cryptography"><img src="/phred/modules/google_directory_cryptography.png" alt="Google open directory on cryptography" title="Google open directory on cryptography"/></a><br />
Google open directory on &laquo;cryptography&raquo;</div>
<ul>
<li>Algorithms</li>
<li>Books</li>
<li>Events</li>
<li>Historical</li>
<li>Journals</li>
<li>People</li>
<li>Programming Libraries</li>
<li>Research Groups</li>
<li>Theory</li>
</ul>
<p>Now you stand at a guidepost. You see the &#8220;cryptography universe&#8221;. You probably don&#8217;t see what is left and right to cryptography, but here you have a &#8220;cryptography at a glance&#8221;.<br />
Now it&#8217;s up to you: Do you want to explore &#8220;algorithm land&#8221;, take the shortcut and download the programming library of the language of your choice? Or do you even want to get advice from people that are experts on that matter?<br />
Even if the links provided here don&#8217;t give you what you are looking for, here you get a clue what you should look for.</p>
<h4>Comparing the three</h4>
<p>Let&#8217;s compare browsing to a reallife quest: Finding out where your next conference will take place. Say you want to go to the next <a href="http://conferences.oreillynet.com/etech/">etech conference</a>, you don&#8217;t know where it is and you are not an American citizen.</p>
<div class="caption"><img src="/phred/modules/too_near.png" alt="Ouch, nearly bumped my head into horton plaza!" title="Ouch, nearly bumped my head into horton plaza!"/><br />
Ouch, nearly bumped my head into horton plaza!</div>
<p>On the conference websites they often put a map showing the conference place like 10 meters above surface. This map <strong>is</strong> helpful. But only at the point when you are quite next to the conference. </p>
<div class="caption"><img src="/phred/modules/too_far.png" alt="Help, I cannot breathe out there!" title="Help, I cannot breathe out there!"/><br />
Help, I cannot breathe out there!</div>
<p>Then, when you desperately search for a more general map, you&#8217;ll possibly find a map of how it looks from outer space. Yeah, I know that San Diego is in the US, but I&#8217;d like to know which airport is next to the conference.</p>
<div class="caption"><img src="/phred/modules/web_organization.png" alt="Distances between observer and data" title="Distances between observer and data" /><br />
Distances between observer and data</div>
<p>That&#8217;s quite similar to the views we have with del.icio.us and open directory.<br />
Delicious would tell you: &#8220;the roads nearby are &#8216;union street&#8217;, &#8216;Broadway circle&#8217; and &#8216;Broadway&#8217;&#8230;&#8221;, open directory proclaims: &#8220;we have five continents in the world: &#8216;America&#8217;, &#8216;Asia&#8217;, &#8216;Africa&#8217;, &#8216;Australia&#8217; and &#8216;Europe&#8217;&#8230;&#8221;. Now, I&#8217;m exaggerating a bit but you get the point: Sometimes you need a map that lays between the too detailed and the too general map.<br />
Looking for this type of view is like saying: &#8220;I want a bit more <a href="http://en.wikipedia.org/wiki/Ontology">ontology</a> than tags but not that much <a href="http://en.wikipedia.org/wiki/Taxonomy">taxonomy</a> as open directory&#8221;. That&#8217;s where I&#8217;ve put the question mark. It&#8217;s not that you always want to see the data at that distance but sometimes you desperately want to have that viewpoint.</p>
<p>Now, what has this to do with tagging? I believe that this missing in-between view can be won by analyzing tags.<br />
Have you noticed how flickr does this in-between view?<br />
When you search for love, <a href="http://www.flickr.com/photos/tags/love/clusters/">flickr cluster</a> asks you: &#8220;What do you mean by &#8216;love&#8217;?&#8221;:</p>
<div class="caption"><a href="http://www.flickr.com/photos/tags/love/clusters/"><img src="/phred/modules/flickr_clusters.png" alt="flickr clusters on love" title="flickr clusters on love" /></a><br />
flickr cluster results on &laquo;love&raquo;</div>
<ul>
<li>a <strong>couple</strong> <strong>kiss</strong>ing?</li>
<li>a <strong>mother</strong> holding it&#8217;s <strong>baby</strong>?</li>
<li>a <strong>red</strong> <strong>heart</strong>?</li>
</ul>
<p>&#8220;Wait: Flickr is a bit different from del.icio.us&#8221;, you say. Yup. Flickr uses a <a href="http://www.personalinfocloud.com/2005/02/explaining_and_.html">narrow</a>, del.icio.us a broad <a href="http://www.personalinfocloud.com/2005/02/explaining_and_.html">folksonomy</a> system.<br />
But I believe that the data clusters, flickr creates with it&#8217;s narrow folksonomy data, can also be generated with delicious&#8217; broad folksonomy data. I am programming an algorithm that computes del.icio.us clusters. I&#8217;m still at an early stage but I get clusters like this &#8220;shopping cluster&#8221;:</p>
<div class="caption"><img src='/phred/modules/shopping_cluster.png' alt="shopping cluster" title="shopping cluster" /><br />
&laquo;shopping&raquo; cluster</div>
<p>I realize that even if the cluster data is available, there&#8217;s the question how to navigate through the data. The &#8220;zooming in&#8221; and &#8220;zooming out&#8221; won&#8217;t be as easy as with Google maps.<br />
But anyway, here is the land no one has explored before. I think this is the area we should talk about. Here is room for improvement.</p>
<h3>Folksonomy helps you to stay informed about a certain topic</h3>
<p>Back to what folksonomies are good for: If you have built an expertise in cryptography, you want to stay informed. If <a href="http://en.wikipedia.org/wiki/RSA">RSA</a> is hacked, you certainly want to be informed.<br />
Delicious has got an &#8220;<a href="http://del.icio.us/inbox/phred">Inbox</a>&#8221; where you can subscribe to a tag, e.g. &#8220;cryptography&#8221;.<br />
Each bookmark that is tagged &#8220;cryptography&#8221; gets in your inbox. That&#8217;s a great way to <strong>stay</strong> informed. Alternatively you have a list of <a href="http://del.icio.us/popular/cryptography">of recent popular sites</a> tagged &#8220;cryptography&#8221;. You can subscribe to this lists using RSS and hopefully you get informed timely if RSA is hacked..</p>
<h3>Do tag systems keep you informed?</h3>
<p>I think the comparison with the distance to the data applies here too:<br />
If I&#8217;d <a href="http://del.icio.us/rss/tag/cryptography">subscribe to cryptography</a>, I&#8217;d probably miss some important items, just because the guy who bookmarked it used the tag &#8220;crypto&#8221;. On the other hand, I do not want to be informed about another <a href="http://en.wikipedia.org/wiki/Rijndael">Rijndael</a> algorithm, I want to narrow the incoming links to articles or essays that deal with cryptography.<br />
Delicious already offers to narrow results: I could <a href="http://del.icio.us/rss/tag/cryptography+essay">subscribe to &laquo;cryptography&raquo; and &laquo;essay&raquo;</a>, and, when delicious will support union (and it will, <a href="http://lists.del.icio.us/pipermail/discuss/2005-November/004390.html">as Joshua promises</a>), I also could have <a>subscribe to (cryptography or crypto) and (essay or article)</a> but you see that it doesn&#8217;t really solve the problem.<br />
I imagine that one day you can say:</p>
<blockquote><p>I want to keep being informed about cryptography</p></blockquote>
<p>and the service asks you:</p>
<blockquote><p>Should I keep you informed about</p>
<ul>
<li>new implementations</li>
<li>new articles/essays</li>
<li>security issues</li>
</ul>
</blockquote>
<p>And I believe this is possible. Flickr already asks you this when you are searching for <a href="http://www.flickr.com/photos/tags/love/clusters/">love pictures</a>. I guess it will be based on clusters again.</p>
<h2>Tags help you sharing Lists</h2>
<p>Back to what tags are good for: They help you building lists. Let&#8217;s name a few examples:</p>
<ul>
<li><strong>Wish lists</strong>: I know that <a href="http://www.amazon.com/exec/obidos/wishlist">numerous</a> <a href="http://froogle.google.com/shoppinglist">online</a> <a href="http://www.giftboxhome.com/">shops</a> enable you building whishlists. But I&#8217;d like to have a whishlist that&#8217;s not bound to a company, that I can arrange and rearrange. <a href="http://del.icio.us/mpe/whishlist">Many</a> <a href="http://del.icio.us/janson/wishlist/">are</a> <a href="http://del.icio.us/Lillith_Within/whishlist">already</a> <a href="http://del.icio.us/a9bejo/whishlist">using</a> del.icio.us as a storage of their wish list.</li>
<li><strong>Share your bookmarks</strong>: A friend asked me for some links to javascript WYSIWYG editors. <a href="http://del.icio.us/phred/javascript+editor">I gave him a list</a> of all my bookmarks tagged <code>javascript</code> and <code>editor</code></li>
<li><strong>Offer viewpoints of your data</strong>: Let&#8217;s say your favourite CMS features tagging (<a href="http://dema.ruby.com.br/articles/2005/08/27/easy-tagging-with-rails">featured in many of those new fancy ruby on rails applications</a>), I&#8217;m not speaking about blogs here: To allow &#8220;normal&#8221; visitors to view your data, you&#8217;ll add a navigation providing starting points to your entries; specific locations a visitor can jump in to so he could take bathe in your articles. Probably you would add a link to all items tagged &#8220;references&#8221; and &#8220;networking&#8221; to achieve that.</li>
</ul>
<h3>How can tag lists be improved?</h3>
<p>I&#8217;m often annoyed that I cannot put my del.icio.us links in a specific order. I <a href="http://www.pui.ch/del_list/">did a little script</a> that puts my newest bootkmark at the bottom but it doesn&#8217;t fully solve the problem.<br />
Actually I&#8217;d like being able to compose a <a href="http://en.wikipedia.org/wiki/View_%28database%29">view</a> of tagged bookmarks, i.e. I want to offer a list of all firms our company has built the network for:</p>
<blockquote>
<h3>Networking references</h3>
<h4>Big firms</h4>
<ul>
<li><a href="http://www.ubs.ch">UBS</a></li>
<li><a href="http://www.migros.ch">Migros</a></li>
<li><a href="http://www.abb.ch">ABB</a></li>
</ul>
<h4>Medium-sized firms</h4>
<ul>
<li><a href="http://www.stadlerrail.ch/">Stadlerrail</a></li>
<li><a href="http://www.search.ch/rim.html">Räber Information Management GmbH</a></li>
</ul>
<h4>Small firms</h4>
<ul>
<li><a href="http://www.citrin.ch">Citrin Informatik GmbH</a></li>
<li><a href="http://www.thildykeller.ch">Goldschmiedeatelier Thildy Keller</a></li>
<li><a href="http://www.minifruits.ch">Mini Fruits Trading</a></li>
</ul>
</blockquote>
<p>Nowadays, such a list can&#8217;t be automatically generated from my bookmarks, but it could be, by letting me configure my view as <code>myView = (references+networking, "Networking References", (big_firms, medium-sized_firms, small_firms))</code>.<br />
I know it&#8217;s not a <strong>big</strong> challenge to program such a thing, but nonetheless it doesn&#8217;t exist, as far as I know?</p>
<h2>Bottom line</h2>
<p>It appears to me that there&#8217;s not been much progress being done related to tagging systems lately. What rather became better is the <a href="http://blog.del.icio.us/blog/2005/11/find_the_url_of.html">embedding of tagging systems into already existing technologies such as search</a>. It gives the impression that core issues are done and that there&#8217;s no much room for improvement. In this article I wanted to disprove this.<br />
I think that there&#8217;s much much more than I have written in here, I even believe that todays tagging applications cover just about 5% of all the possible features tagging makes possible. Thus, let&#8217;s gain ground.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/11/how-tagging-could-gain-ground.html/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Does del.icio.us scale?</title>
		<link>http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html#comments</comments>
		<pubDate>Wed, 31 Aug 2005 06:12:50 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html</guid>
		<description><![CDATA[Lately it became very silent around del.icio.us. There are some new features but nothing groundbreaking. Either people are used to it and use it as a daily tool and there&#8217;s no need for new things or otherwise folks just don&#8217;t have faith in the future of del.icio.us.
I am a big fan of delicious. I&#8217;ve got [...]]]></description>
			<content:encoded><![CDATA[<p>Lately it became very silent around <a href="http://del.icio.us">del.icio.us</a>. There are <a href="http://blog.del.icio.us/blog/2005/08/we_rolling.html">some</a> <a href="http://blog.del.icio.us/blog/2005/08/search_me.html">new</a> <a href="http://blog.del.icio.us/blog/2005/08/people_who_like.html">features</a> but nothing groundbreaking. Either people are used to it and use it as a daily tool and there&#8217;s no need for new things or otherwise folks just don&#8217;t have faith in the future of del.icio.us.</p>
<p>I am a big fan of delicious. I&#8217;ve got 1.5K bookmarks there, I like it&#8217;s spirit and how open everything is. This article isn&#8217;t meant to criticize, but I think delicious is facing some problems.<br />
<span id="more-34"></span></p>
<h2>Performance scale</h2>
<p>You might have read my article about <a href="http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html">Tag system performance</a>. To summarize my tests: MySQL is just not built for large tag-systems. It just doesn&#8217;t scale. It does scale up to 1 Million items but delicious does have far more posts.<br />
I am pretty sure delicious is still on the MySQL train, this strong believe comes from my performance tests: The mysql-schemas I tested really have the same characteristics as delicious has.<br />
I fear delicious faces a performance dead end: They <a href="http://blog.del.icio.us/blog/2005/06/moving_to_new_s.html">have put more servers in the mix</a>, they cache quite a bit, it still is slow. I strongly believe that for delicious to have a future it must become much faster. For me this is the number one downside of delicious. I dream of a bookmark service that has billions of bookmark-posts yet it still will perform nicely. I think it is time for new tag-systems to come up. On <a href="http://lists.tagschema.com/mailman/listinfo/tagdb">tagdb mailing list</a>, there are very good ideas how large scaled tagging systems should work (e.g. systems powered by <a href="http://lucene.apache.org/">Lucene</a>).</p>
<h2>Popular link scale</h2>
<p>I think one of the coolest feature of delicious is the <a href="http://del.icio.us/popular/">popular</a> page. When you read this page regularly you are up to date.. wait: you are up to date concerning CSS tips and firefox and live hacks. You all know that if delicious would get mainstream that page wouldn&#8217;t be that interesting any more. It already got boring a bit. As someone put it: </p>
<blockquote><p>I particularly cannot look at that CSS link lists anymore</p></blockquote>
<p>I think this page doesn&#8217;t scale. It is stuck. And moreover it&#8217;s a pity that the coolest page on delicious is not about tags. At first glance you don&#8217;t even see what tags a popular link has.<br />
IMHO what is needed here are clusters. Bookmarks go into categories: &#8220;browsers&#8221;, &#8220;programming&#8221;, &#8220;design&#8221; but also &#8220;health&#8221;, &#8220;politics&#8221;. When delicious gets mainstream there most certainly will be &#8220;sports&#8221; or &#8220;stars&#8221;.<br />
One should then have the possibility to subscribe to certain clusters or better make this subscription automatically out of tags in a users bookmarks.</p>
<h2>Bottom line</h2>
<p>I think there are some fundamental things that must be rearranged at delicious, otherwise there will be</p>
<ul>
<li>a) a big competitor (Google? Yahoo? Microsoft?) coming up or </li>
<li>b) people will spread to different bookmark services that concentrate on certain clusters. Probably some meta-sites will arise where you can have an overview over all the different sites</li>
</ul>
<p>I think this problems will arise for every bigger tagsystem. I hope that people will not sniff at tagging systems thinking that they don&#8217;t perform well enough..</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html/feed</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Analyzing tag-connections</title>
		<link>http://www.pui.ch/phred/archives/2005/07/analyzing-tag-connections.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/07/analyzing-tag-connections.html#comments</comments>
		<pubDate>Sun, 17 Jul 2005 18:03:43 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/07/analyzing-tag-connections.html</guid>
		<description><![CDATA[When you tag an item, for instance a bookmark, you give them different tags, for instance I tagged the bookmark for &#8220;How to Write More Clearly, Think More Clearly, and Learn Complex Material More Easily&#8221; (you know this link if you give attention to delicious popular.. :-)) with 
&#8220;writing&#8221;, &#8220;toread&#8221;, &#8220;productivity&#8221;, &#8220;language&#8221;
Now what instantially pops [...]]]></description>
			<content:encoded><![CDATA[<p>When you tag an item, for instance a bookmark, you give them different tags, for instance I tagged the bookmark for &#8220;<a href="http://www.ai.uga.edu/mc/WriteThinkLearn_files/frame.htm">How to Write More Clearly, Think More Clearly, and Learn Complex Material More Easily</a>&#8221; (you know this link if you give attention to <a href="http://del.icio.us/popular">delicious popular</a>.. :-)) with </p>
<blockquote><p>&#8220;writing&#8221;, &#8220;toread&#8221;, &#8220;productivity&#8221;, &#8220;language&#8221;</p></blockquote>
<p>Now what instantially pops into my mind is, that the tag &#8220;toread&#8221; is quite different from the other tags. In fact it is something I want to do with this bookmark further on. I name this type of tag &#8220;<strong>adjective</strong>&#8221; (I will come back to that name later on..). The other tags I consider as &#8220;<strong>categories</strong>&#8220;.<br />
Now you&#8217;ll probably say &#8220;ah, this is a rare exception&#8221;. This is not true. I often tag items with &#8220;blog&#8221; because it happens that the interesting page I found about my favourite hobby happens to be a blog. Therefore I named this type of tag as &#8220;adjective&#8221; as it is rather a description to the item than it is a category to it.<br />
Other tags used often as adjectives are &#8220;reference&#8221;, &#8220;tutorial&#8221;, &#8220;fun&#8221;, &#8220;cool&#8221;, &#8220;news&#8221;, &#8220;free&#8221;..<br />
<span id="more-33"></span><br />
Now this categorization is not very correct. Sometimes, I use &#8220;blog&#8221; not as a adjective. This is, if I want to bookmark a blog that has no content that interests me but it just looks good. Then, I&#8217;ll probably blog it as &#8220;design blog&#8221;. In that day when I redesign my blog, I want to search for all design blogs I tagged..<br />
You see: it lays all in the connection between those tags, not in the tags itself. This is IMO pretty important.</p>
<h2>What is that for?</h2>
<p></p>
<h3>Clusters</h3>
<p>You probably tried to cluster your bookmarks by using <a href="http://laurie.informatik.uni-bremen.de/clusty/">clusty</a>. What this service does: It tries to put your tags into separate clouds. You know the &#8220;<a href="http://lists.del.icio.us/pipermail/discuss/2005-March/002266.html">tag-bundles</a>&#8221; of delicious? This is something like a &#8220;auto-tag-bundle&#8221; feature. Try it out, if you not already did so and see the problems that arise..<br />
I think the key problem in this cluster-service lies in the fact that this service considers all connections (also the adjectives). But it shouldn&#8217;t do so! Adjectives aren&#8217;t tags I want in my clusters. Adjectives are spread all over my tags, so they should first be cut away from my &#8220;tag-tree&#8221; (the tree that is built out of your tag-connections you built by tagging bookmarks).</p>
<h3>Similar items</h3>
<p>This categorization is also important when you search for &#8220;similar&#8221; items of a bookmark. When I want to search for similar items of that &#8220;how to write more clearly&#8221;-article, I&#8217;ll search for &#8220;writing+productivity+language&#8221; and will leave out the &#8220;toread&#8221; tag (adjective).<br />
Probably this made you realize that categorizing tag-connections is an important task. </p>
<h3>Tag clouds</h3>
<p>Now there are those tag clouds. When I look at <a href="http://kevan.org/extispicious.cgi?name=phred">my taggloud</a> then the &#8220;biggest&#8221; tag is &#8220;resource&#8221;. Now tag clouds are here to easily find bookmarks (I never search my bookmarks for solely &#8220;resource&#8221;) or to have a map of your main interests (&#8220;what is your hobby?&#8221; &#8220;ah, I am a big fan of resources&#8221;.. :-) I am sure you were also annoyed by that. I want those adjective-tags cut away..!</p>
<h2>Synonyms</h2>
<p>Now back to some therory: There is a third type of tag-connections: Synonyms. &#8220;delicious&#8221; and &#8220;del.icio.us&#8221; are classic synonyms. But I consider &#8220;ruby&#8221; and &#8220;rails&#8221; as synonyms too (no, they aren&#8217;t synonyms but up to now they are used as synonyms). You type in the second tag just to be sure that you won&#8217;t search for the second and find nothing.. I don&#8217;t think this category is too important for the cluster-task but I just name it here because I&#8217;ll use it further on.</p>
<h2>Example</h2>
<p>Let&#8217;s go for an example.<br />
Lets consider tags that are connected to the tag &#8220;ajax&#8221;. I gathered some tag-connection-data from delicious (via its <a href="http://del.icio.us/rss/">rss-feed</a>). And I run a query on my statistical data. This is data gathered during the period of one week. It is not complete. But our experiment will work anyway:</p>
<table>
<thead>
<tr>
<td>tag-connection</td>
<td>weight</td>
<td>type</td>
</tr>
</thead>
<tbody>
<tr>
<td><strong>ajax-javascript</strong></td>
<td>234</td>
<td>synonym</td>
</tr>
<tr>
<td><strong>ajax-web</strong></td>
<td>105</td>
<td>category</td>
</tr>
<tr>
<td><strong>ajax-programming</strong></td>
<td>100</td>
<td>category</td>
</tr>
<tr>
<td><strong>ajax-xmlhttprequest</strong></td>
<td>52</td>
<td>synonym</td>
</tr>
<tr>
<td>ajax-css</td>
<td>51</td>
<td>adjective</td>
</tr>
<tr>
<td>ajax-design</td>
<td>46</td>
<td>adjective</td>
</tr>
<tr>
<td>ajax-php</td>
<td>44</td>
<td>adjective</td>
</tr>
<tr>
<td>ajax-development</td>
<td>36</td>
<td>adjective</td>
</tr>
<tr>
<td>ajax-xml</td>
<td>34</td>
<td>adjective</td>
</tr>
<tr>
<td>ajax-DHTML</td>
<td>33</td>
<td>adjective</td>
</tr>
<tr>
<td>ajax-webdev</td>
<td>33</td>
<td>adjective</td>
</tr>
<tr>
<td>ajax-webdesign</td>
<td>31</td>
<td>adjective</td>
</tr>
<tr>
<td>ajax-google</td>
<td>23</td>
<td>adjective</td>
</tr>
<tr>
<td>ajax-HTML</td>
<td>21</td>
<td>adjective</td>
</tr>
<tr>
<td>tutorial</td>
<td>14</td>
<td>adjective</td>
</tr>
</tbody>
</table>
<p>Column &#8220;tag-connection&#8221; is the tag connected to &#8220;ajax&#8221; (i.e. javascript), column &#8220;weight&#8221; depicts the number of times this connection occurred in a bookmark-post on delicious. The tags are ordered by weight. In column &#8220;type&#8221; you see the result of my computations for this tag-connection. Just to make it clear: These are all tags connected to tag &#8220;ajax&#8221; ordered number by occurrence of the connection. If a bookmark-post somebody did on delicious is tagged with &#8220;ajax&#8221; and &#8220;javascript&#8221; that gives one point for the &#8220;weight&#8221;-column for &#8220;ajax-javascript&#8221;.<br />
The outcome is quite good, I think (I must admit that I have taken the example that worked out best :-))<br />
There are some errors, sure: xml-ajax should be a &#8220;category&#8221;-type as well. But we are looking at the usage of these tags not their &#8220;real&#8221; meaning (whatever that is).</p>
<h2>Computation</h2>
<p></p>
<h3>Synonyms</h3>
<p>To compute these categorization I first went for the &#8220;synonyms&#8221;. The connection &#8220;ajax-javascript&#8221; is considered as synonym because &#8220;ajax-javascript&#8221; is &#8220;number one connection&#8221; of all connections where ajax is a part of. And when considering the connections of &#8220;javascript&#8221; (the &#8220;vice-versa-connection&#8221;), ajax is number two.<br />
I consider two tags as synonyms if &#8220;in one direction&#8221; the other tag is number one and in the other &#8220;direction&#8221; the other tag is in the top 10. I made up this rule because I think that in most cases there is one &#8220;stronger&#8221; synonym that is used most of the time when the &#8220;weaker&#8221; one is used. The fact that the tag &#8220;ajax&#8221; is mostly used with tag &#8220;javascript&#8221; could also mean that &#8220;javascript&#8221; is a supercategory of ajax (which it somehow is). To avoid that this sub-super-categogy-connections are considered as synonyms, we go sure that &#8220;ajax&#8221; is also important for &#8220;javascript&#8221; so ajax is not so sub to javascript.. I hope you can follow :-)</p>
<h3>Category/Adjective</h3>
<p>Then I compute the &#8220;category&#8221;. Lets put the values of the above table into a graph.<br />
<img src="/phred/modules/ajax_dist.png" alt="distribution of tags related to ajax" title="distribution of tags related to ajax"/><br />
On the x-axis you see the tags: The tick 1 stands for &#8220;web&#8221;, 2 for &#8220;programming&#8221;, 3 for &#8220;css&#8221;, 4=&#8221;design&#8221;, 5=&#8221;php&#8221; and so on. You see I removed the synonym-connections &#8220;ajax-javascript&#8221; and &#8220;ajax=xmlhttprequest&#8221; as I think they &#8220;disturb&#8221; the distribution.<br />
The y-axis depicts the weight of the connection: ajax-web has weight &#8220;105&#8243;, ajax-programming has weight &#8220;100&#8243; and so on.<br />
The black line is the &#8220;weight&#8221;-column of the table above, the red one is the first <a href="http://en.wikipedia.org/wiki/Derivative">derivative</a>, the blue one the second derivative of the weight function.<br />
This graph makes it clear that &#8220;web&#8221; and &#8220;programming&#8221; are used quite often in combination with &#8220;ajax&#8221;, then, there is quite a &#8220;gap&#8221; followed by the &#8220;adjective tail&#8221;. I consider the &#8220;adjective tail&#8221; as connections to be categorized as &#8220;adjective&#8221;. The tags in this tail are used &#8220;out of context&#8221;: They don&#8217;t really belong to the &#8220;ajax-cluster&#8221;. They sometimes occur together with ajax, but just sometimes. Mostly not. Therefore they are considered as &#8220;adjectives&#8221;.<br />
Now the task is to find this &#8220;gap&#8221;. In my experiments I tried to find the last gap. To find the last gap I started at the end of the tail and searched for the first peak of the first derivative (that is when the second derivative goes from positive to negative) and checked if the peak was high enough. If these to conditions were fulfilled, I snipped the connections into two parts the &#8220;pre-gap&#8221; connections (category) and the &#8220;post-gap&#8221; connections (adjective).<br />
The same computation has to be made for the &#8220;vice-versa&#8221; connection. I considered connections as &#8220;category&#8221; if one of both computations told that it is a &#8220;category&#8221;.</p>
<p><ins datetime="2005-07-18T15:43:36-02:00"></p>
<h2>Further processing: Ambiguous tags</h2>
<p>To achieve good clustering results, I think there is a need of checking if the tag is used in different ways. The prominent example hereof is &#8220;apple&#8221;. Now, when delicious is still restricted to the blogworld, it is clear that apple means Mac-apple. But in future this may change. To recognize if a tag is used in different environments, the algorithm would have to check the &#8220;neighbours of neighbours&#8221; (<a href="http://blog.pietrosperoni.it/2004/09/19/clustering-delicious-tags/">as suggested by Pietro Speroni</a>). That is for ajax: check if the neighbours of &#8220;javascript&#8221; are more or less the same as the neighbours of &#8220;web&#8221;. You see that it all lays in the connections between tags. The tag per se is not well-defined but the tag in connection with another tag defines it quite well. Therefore for clustering I&#8217;m proposing splitting up amiguous tags. That would add much more simplicity to the resulting clusters.</ins></p>
<h2>We are onto something</h2>
<p>I&#8217;m pretty sure we are onto something. I think this is direction it should go. Computations over tag-connection-distributions are cool. Users shouldn&#8217;t insert these infos when posting the bookmarks. Posting should stay easy. I&#8217;m not that sure about this &#8220;synonym&#8221;-computation but I think the &#8220;category&#8221;-computation turned out pretty good. I tried to build some clusters by hand just by considering the category and synonym-connections and I found a completely detached cluster consisting of the tags &#8220;cooking&#8221;, &#8220;health&#8221;, &#8220;recipes&#8221;, &#8220;diet&#8221; and &#8220;food&#8221;. As I said, I think we are onto something..</p>
<h2>Further reading</h2>
<ul>
<li><a href="http://www.rashmisinha.com/archives/05_02/tag-sorting.html">Building tag clusters by hand</a></li>
<li><a href="http://blog.pietrosperoni.it/2004/09/19/clustering-delicious-tags/">Pietro Speronis different approach to clustering tags (with java-mindmap-visualisation!)</a>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/07/analyzing-tag-connections.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Tagsystems: performance tests</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comments</comments>
		<pubDate>Sun, 19 Jun 2005 14:09:53 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html</guid>
		<description><![CDATA[In my previous article named "Tags: database schemas" we analysed different database schemas on how they could meet the needs of tag systems. In this article, the focus is on performance (speed).]]></description>
			<content:encoded><![CDATA[<p>In my <a title="Tags: database schemas" href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html">previous article named &#8220;Tags: database schemas&#8221;</a> we analysed different database schemas on how they could meet the needs of tag systems. In this article, the focus is on performance (speed). That is: if you want to build a tagsystem that performs good with about 1 million items (bookmarks for instance), then you may want to have a look at the following result of my performance tests.<br />
In this article I tested tagging of bookmarks, but as you can tag pretty much anything, this goes for tagging systems in general.</p>
<p><span id="more-32"></span></p>
<p>I tested the following schemas (I keep the naming from the previous article):</p>
<ul>
<li><strong>mysqlicious</strong>: One table. Tags are space separated in column &#8220;tags&#8221;; <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#mysqlicious">as introduced</a></li>
<li><strong>mysqlicious fulltext</strong>: Same schema but with <a href="http://dev.mysql.com/doc/mysql/en/fulltext-search.html">mysql fulltext</a> on the tag column; <a href="http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html">as introduced</a></li>
<li><strong>scuttle</strong>: Two tables: One for bookmarks, one for tags. Tag-table has foreign key to bookmark table; <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#scuttle">as introduced</a></li>
<li><strong>toxi</strong>: Three tables: One for bookmarks, one for tags, one for junction; <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#toxi">as introduced</a></li>
</ul>
<p>You may want to have a close watch at the details of the schemas when having a look at the <a href="http://www.pui.ch/phred/modules/tag_database_schemas.sql">sql-create-table-queries</a>.</p>
<p>But let&#8217;s go directly to the results. The details about the setup of this tests are mentioned at the <a href="#setup">end of this article</a>. The x-axis depicts the number of bookmarks in the corresponding database, on the y-axis you see how much time each query took to execute.</p>
<h3><a name="#results"></a>Results</h3>
<h4>Intersection: 250 tag set</h4>
<p><img title="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" alt="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" src="/phred/modules/intersection_250_3_i300.png" /></p>
<p>The first two tests are done with 250 tags in the small dataset (<a href="#setup">see below</a> for explanation). I think the queries in the &#8220;1 million bookmarks database&#8221; are the only size we should pay attention to. I mean if you have a small number of bookmarks, performance isn&#8217;t really a thing to bother..</p>
<p>We run intersection queries, like</p>
<blockquote><p>I want to search for bookmarks tagged with &#8220;design&#8221; and &#8220;html&#8221;</p></blockquote>
<p>You see that, not surprisingly, mysqlicious with its <code>WHERE tag LIKE "% tag %"</code> is very slow. That is, MySQL has to go through the whole dataset and test each bookmark against the query.<br />
What actually <strong>is</strong> surprising me, is that the fulltext search of mysql is not that high-performance. In fact it is not faster then the <code>LIKE</code>-query in the MySQLicious DB. This really disappointed me. I tried to do any quirks possible to make this faster as <a href="http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html">I think, a tag-database-system with mysql fulltext would be very easy and like the only thing you should head to..</a>.<br />
What is surprising me too, is that the queries on the 3 table schema are about double as fast the the ones on the two-table ones(<a href="http://www.pui.ch/phred/modules/schemas.inc.phps">take a look at the queries</a> if you think you could give me a hint on this). Noticeable is, that in the scuttle and toxi-variant, the more queries were run, the faster they were. I didn&#8217;t do any tests with queries and inserts mixed so this may be coming from just plain good caching and this effect possible doesn&#8217;t show up on live bookmark management systems.</p>
<h4>Intersection: 999 tag set</h4>
<p><img title="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" alt="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" src="/phred/modules/intersection_999_3_300.png" /><br />
Now have a look to what happens if we broaden our small tag set: MySQLicious with fulltext suddenly gets the performance leader. That means, if you have a bookmark management system with diverse tags (this most probably comes from the fact that there are many users), the fulltext solution is possibly the way to go.<br />
So now, as you see, choosing the right schema is all about tag distribution. In my previous post about guessing the overall tag distribution on <a href="http://del.icio.us">del.icio.us</a>, I came to the conclusion, that delicious&#8217; most popular tag &#8220;design&#8221; is showing up in 3.2% of all bookmarks on <a href="http://del.icio.us">del.icio.us</a>. So then, what is the mean tag distribution?</p>
<ul>
<li>If we say 1% (a tag shows up in 1/100 of all bookmarks on an average) that makes our small tag set 250 tags big</li>
<li>If we say 0.25%, the small tag set grows to a size of 1000</li>
<li>If we say 0.1%, the small set will contain 2500 tags</li>
</ul>
<p>So I&#8217;d suggest that if your average distribution is 1%, take &#8220;toxi&#8221;, if the distribution is broader, take &#8220;MySQLicious fulltext&#8221;.<br />
If you take a closer look, you can see that the fulltext schema stayed as fast as when queried in the 250 tag set. That means, if you want to go sure your tag system responds ok in every situation, you should go with the &#8220;mysql fulltext&#8221; schema.<br />
<ins datetime="2005-06-26T14:52:55-02:00"><a href="http://hannes.magiccards.info/get/results.html">Hannes has done some further investigation on mysql fulltext running on MySQL 4.1</a> (my tests were on MySQL 4.0.21)</ins></p>
<h4>Union</h4>
<p><img alt="Union test with 250 tags in small dataset" src="/phred/modules/union_full_250_3.png" /><br />
When doing a union query we say</p>
<blockquote><p>I want to search for all the bookmarks that are tagged either with &#8220;delicious&#8221; or &#8220;del.icio.us&#8221;</p></blockquote>
<p>This queries, you guessed, are handled the fastest by &#8220;MySQLicious schema&#8221; with its <code>LIKE</code>-queries: MySQL seeks through the bookmarks, harvesting all bookmarks with one of the given tags and says &#8220;I&#8217;m finished!&#8221; when it was at bookmark number #968, because it found 50 bookmarks. Whereas in the other schemas, MySQL has to join the tags with the bookmarks first and only then could search though it..</p>
<h4>Insert</h4>
<p><img alt="Setup database schemas with the data: 250 tags in small dataset" src="/phred/modules/setup_250.png" /><br />
When comparing the different schemas on the time of the insert-&#8221;statements&#8221; of one bookmark, the result isn&#8217;t very surprising (notice that I&#8217;ve changed the scale of the y-axis).<br />
Mysqlicious with it&#8217;s 1 table is very fast indeed, its variation with fulltext had to create the fulltext index and therefore is a bit slower. Scuttle, with its 2 tables and toxi with its 3 tables are at least two respectively three times as slow. I have to remark, that I used quite a bit of caching for the toxi schema, as I didn&#8217;t want hours to have the data ready..</p>
<p>I guess it doesn&#8217;t really make sense to base your decision, which schema to take on the time for an insert: Bookmark inserts are about 100 times as fast as the intersection queries..</p>
<h4>«What? That slow??»</h4>
<p>You said it. You don&#8217;t want your intersection queries take 0.2 seconds each. That would bring your system to its knees.<br />
<ins datetime="2005-07-08T08:33:32-02:00"><br />
There are some recipes to avoid that:</ins></p>
<h5><ins datetime="2005-07-08T08:33:32-02:00">Caching</ins></h5>
<p><ins datetime="2005-07-08T08:33:32-02:00">I think, you don&#8217;t come around good old caching. I think that you could cache results to a query like &#8220;mysql+tagging&#8221; for about an hour or so. If a user queries his own items, I would lower the cache time (as up-to-dateness is more important with his own items).<br />
Then, I expect if you for instance cache items per tag and intersection them with a decent algorithm, that could be faster.. </ins></p>
<h5><ins datetime="2005-07-08T08:33:32-02:00">The Best Of Both Worlds</ins></h5>
<p><ins datetime="2005-07-08T08:33:32-02:00">I think you could have &#8220;mysqlicous fulltext&#8221; and &#8220;toxi&#8221; running at the same time. That means you have to update/insert in both schemas but when you have to query, you could take the one you think is faster: For simple union the mysqlicious without fulltext search, for intersection queries with common tags the toxi, and for those with uncommon tags the mysqlicious fulltext variant. </ins></p>
<h5><ins datetime="2005-07-08T08:33:32-02:00">Slicing and dicing</ins></h5>
<p><ins datetime="2005-07-08T08:33:32-02:00">You could &#8220;slice and dice&#8221; data (as Nitin proposed it in <a href="http://tagschema.com/blogs/tagschema/2005/06/slicing-and-dicing-data-20-part-1.html">two</a> of his <a href="http://tagschema.com/blogs/tagschema/2005/06/slicing-and-dicing-data-20-part-2.html">posts</a>): That is: you slice your user/tag/item-room and build fact tables. You &#8220;prebuild&#8221; your results in a way. This way, inserts take long but queries themself should be much faster. In our examples, you would for instance first query the tag-intersections on &#8220;toxi&#8221; and then get the facts about each bookmark out of the &#8220;mysqlicious&#8221;-fact-table. But you really should read Nitins posts, as they give a lot of insight.</ins><br />
<ins datetime="2006-05-01T10:01:24+00:00"></p>
<h5>Using a non RDBMS system</h5>
<p><strong>Update:</strong> It&#8217;s been about a year since I wrote that article, and during that year I came to the conclusion that <a href="http://en.wikipedia.org/wiki/RDBMS">RDBMS</a> systems don&#8217;t scale good in systems that have more than 1 million items. Yes, this is a warning: If you are planning to build a large scale system then look for alternatives to <a href="http://en.wikipedia.org/wiki/RDBMS">RDBMS</a> systems. To quote Joshua Schachter, founder of <a href="http://del.icio.us">delicious</a>:</p>
<blockquote><p>«tags doesn&#8217;t map to sql at all. so use partial indexing.»[<a href="http://www.redmonk.com/jgovernor/2006/02/08/things-weve-learned-josh-schachter-quotes-of-the-day/">Joshua Schachter at Carson Summit</a>]</p></blockquote>
<p>I didn&#8217;t try any of the non-RDBMS system but it looks like <a href="http://lucene.apache.org/java/docs/">Apache Lucene</a> and <a href="http://lucene.apache.org/hadoop/">Hadoop</a>. There has been <a href="http://nelson.textdrive.com/pipermail/tagdb/2006-March/thread.html#164">a discussion on the Tagdb Mailing list</a> about these solutions.</p>
<p></ins></p>
<h4>«I don&#8217;t believe you! I want to try it at home»</h4>
<p><a href="http://www.pui.ch/phred/modules/tag_schemas_performance_test.tar.gz">Download the source code (PHP)</a> I used to run the queries and test yourself, extend them as you like. The source is published as <a href="http://en.wikipedia.org/wiki/LGPL">LGPL</a>.</p>
<h3><a name="setup"></a>Performance Tests Setup</h3>
<p>Now, if you have read that far, you probably want to know some background information: As you noticed, for each schema, I set up 4 databases, one database holding 1000 bookmarks, the next 10&#8242;000, then 100&#8242;000 and the fourth 1 million bookmarks. The inserted tags (as well as urls) are random English words taken from two sets of tags:</p>
<ul>
<li>the large set containing about 44&#8242;000 tags (that are simple English words)</li>
<li>the small set is varying in size (the results shown here are taken from 250 and 999 tag sets)</li>
</ul>
<p>Every bookmark gets one to ten tags attached. Every odd tag is from the large set, alternately taken from small and large set. Every schema got exactly the same bookmarks and tag data.</p>
<p>Then every schema got queried with an alternately 1-3 tag query. So the first query is for instance just &#8220;blog&#8221;, the second &#8220;design+css&#8221;, the third &#8220;webdesign+music+software&#8221;, the fourth again with just one tag an so forth..<br />
All the tags for the queries are taken from the small set so that the queries don&#8217;t all end in empty results..<br />
All the queries are tested and work. The outcome of each query on the three schemas is exactly the same.</p>
<h4>Mysql Setup</h4>
<p>I used mysql 4.0.21.<br />
An excerpt from <code>/etc/my.cnf</code> (I think these are the relevant settings to this performance test)</p>
<pre>key_buffer=300M
query_cache_size=30M
query_cache_limit=30M
table_cache = 64
ft_min_word_len = 2
ft_stopword_file = ''</pre>
<h4>System</h4>
<blockquote><p>CPU: 3GHz Dual Xeon<br />
Cache: 1MB<br />
Harddisk: SCSI Ultra 320 Atlas 10K, no RAID<br />
RAM: 3GB</p></blockquote>
<h4>Assumptions</h4>
<ul>
<li>Queries select just the id of a bookmark. I assume that you have to do a second query to get all the wished data to display. I know that this is not fair towards the mysqlicious schema.</li>
<li>I left out user data, as I assume, user data columns wouldn&#8217;t change the outcome of this tests. I wanted to keep the schemas as simple as possible.</li>
<li>Each query is done with <code>LIMIT 50</code> as I assume that a normal application doesn&#8217;t want to get all bookmarks. I assume nobody wants to <code>order</code> bookmarks by any dimension, because this would be <strong>very</strong> expensive (ever wondered why you cannot sort bookmarks on <a href="http://del.icio.us">del.icio.us</a> by date or similar? You get it..)</li>
</ul>
<h3>Acknowledgements</h3>
<p>Thanks to <a href="http://www.citrin.ch">Citrin</a>, the company I work, to let me use our new server to run the queries. The server didn&#8217;t have much anything else to do so the results should be accurate.<br />
The graphs are done using <a href="http://www.aditus.nu/jpgraph/">JpGraph</a>. Very easy to use and produces beautiful images.</p>
<h3>Further reading</h3>
<ul>
<li><a href="http://www.niallkennedy.com/blog/archives/2004/10/flickr_architec.html">Flickr architecture</a></li>
<li><a href="http://labnotes.blogsome.com/2005/06/06/lab-notes-5-fulltext-not-so-fast/">Lab notes: Fulltext not so fast</a>: Fulltext performance issues</li>
<li><a href="http://www.webmasterworld.com/forum23/3557.htm">WebmasterWorld forum: mysql fulltext performance issues</a></li>
<li><a href="http://vegan.net/tony/supersmack/">Mysql Supersmack: Mysql performance tool</a></li>
<li><a href="http://dev.mysql.com/doc/mysql/en/mysql-benchmarks.html">Mysql Benchmark</a></li>
<li><a href="http://jeremy.zawodny.com/mysql/mysql-optimization.html">Powerpoint article of jeremy zawodny</a>on Mysql optimisation</li>
<li><a href="http://www.petefreitag.com/item/389.cfm">Pete Freitag did a sort of review of this article</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/feed</wfw:commentRss>
		<slash:comments>51</slash:comments>
		</item>
	</channel>
</rss>

