<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Then each went to his own home &#187; MySQL</title>
	<atom:link href="http://www.pui.ch/phred/archives/category/mysql/feed" rel="self" type="application/rss+xml" />
	<link>http://www.pui.ch/phred</link>
	<description>Philipp Kellers weblog</description>
	<lastBuildDate>Tue, 17 Aug 2010 19:58:15 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Does del.icio.us scale?</title>
		<link>http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html#comments</comments>
		<pubDate>Wed, 31 Aug 2005 06:12:50 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html</guid>
		<description><![CDATA[Lately it became very silent around del.icio.us. There are some new features but nothing groundbreaking. Either people are used to it and use it as a daily tool and there&#8217;s no need for new things or otherwise folks just don&#8217;t have faith in the future of del.icio.us.
I am a big fan of delicious. I&#8217;ve got [...]]]></description>
			<content:encoded><![CDATA[<p>Lately it became very silent around <a href="http://del.icio.us">del.icio.us</a>. There are <a href="http://blog.del.icio.us/blog/2005/08/we_rolling.html">some</a> <a href="http://blog.del.icio.us/blog/2005/08/search_me.html">new</a> <a href="http://blog.del.icio.us/blog/2005/08/people_who_like.html">features</a> but nothing groundbreaking. Either people are used to it and use it as a daily tool and there&#8217;s no need for new things or otherwise folks just don&#8217;t have faith in the future of del.icio.us.</p>
<p>I am a big fan of delicious. I&#8217;ve got 1.5K bookmarks there, I like it&#8217;s spirit and how open everything is. This article isn&#8217;t meant to criticize, but I think delicious is facing some problems.<br />
<span id="more-34"></span></p>
<h2>Performance scale</h2>
<p>You might have read my article about <a href="http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html">Tag system performance</a>. To summarize my tests: MySQL is just not built for large tag-systems. It just doesn&#8217;t scale. It does scale up to 1 Million items but delicious does have far more posts.<br />
I am pretty sure delicious is still on the MySQL train, this strong believe comes from my performance tests: The mysql-schemas I tested really have the same characteristics as delicious has.<br />
I fear delicious faces a performance dead end: They <a href="http://blog.del.icio.us/blog/2005/06/moving_to_new_s.html">have put more servers in the mix</a>, they cache quite a bit, it still is slow. I strongly believe that for delicious to have a future it must become much faster. For me this is the number one downside of delicious. I dream of a bookmark service that has billions of bookmark-posts yet it still will perform nicely. I think it is time for new tag-systems to come up. On <a href="http://lists.tagschema.com/mailman/listinfo/tagdb">tagdb mailing list</a>, there are very good ideas how large scaled tagging systems should work (e.g. systems powered by <a href="http://lucene.apache.org/">Lucene</a>).</p>
<h2>Popular link scale</h2>
<p>I think one of the coolest feature of delicious is the <a href="http://del.icio.us/popular/">popular</a> page. When you read this page regularly you are up to date.. wait: you are up to date concerning CSS tips and firefox and live hacks. You all know that if delicious would get mainstream that page wouldn&#8217;t be that interesting any more. It already got boring a bit. As someone put it: </p>
<blockquote><p>I particularly cannot look at that CSS link lists anymore</p></blockquote>
<p>I think this page doesn&#8217;t scale. It is stuck. And moreover it&#8217;s a pity that the coolest page on delicious is not about tags. At first glance you don&#8217;t even see what tags a popular link has.<br />
IMHO what is needed here are clusters. Bookmarks go into categories: &#8220;browsers&#8221;, &#8220;programming&#8221;, &#8220;design&#8221; but also &#8220;health&#8221;, &#8220;politics&#8221;. When delicious gets mainstream there most certainly will be &#8220;sports&#8221; or &#8220;stars&#8221;.<br />
One should then have the possibility to subscribe to certain clusters or better make this subscription automatically out of tags in a users bookmarks.</p>
<h2>Bottom line</h2>
<p>I think there are some fundamental things that must be rearranged at delicious, otherwise there will be</p>
<ul>
<li>a) a big competitor (Google? Yahoo? Microsoft?) coming up or </li>
<li>b) people will spread to different bookmark services that concentrate on certain clusters. Probably some meta-sites will arise where you can have an overview over all the different sites</li>
</ul>
<p>I think this problems will arise for every bigger tagsystem. I hope that people will not sniff at tagging systems thinking that they don&#8217;t perform well enough..</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html/feed</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Tagsystems: performance tests</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comments</comments>
		<pubDate>Sun, 19 Jun 2005 14:09:53 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html</guid>
		<description><![CDATA[In my previous article named "Tags: database schemas" we analysed different database schemas on how they could meet the needs of tag systems. In this article, the focus is on performance (speed).]]></description>
			<content:encoded><![CDATA[<p>In my <a title="Tags: database schemas" href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html">previous article named &#8220;Tags: database schemas&#8221;</a> we analysed different database schemas on how they could meet the needs of tag systems. In this article, the focus is on performance (speed). That is: if you want to build a tagsystem that performs good with about 1 million items (bookmarks for instance), then you may want to have a look at the following result of my performance tests.<br />
In this article I tested tagging of bookmarks, but as you can tag pretty much anything, this goes for tagging systems in general.</p>
<p><span id="more-32"></span></p>
<p>I tested the following schemas (I keep the naming from the previous article):</p>
<ul>
<li><strong>mysqlicious</strong>: One table. Tags are space separated in column &#8220;tags&#8221;; <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#mysqlicious">as introduced</a></li>
<li><strong>mysqlicious fulltext</strong>: Same schema but with <a href="http://dev.mysql.com/doc/mysql/en/fulltext-search.html">mysql fulltext</a> on the tag column; <a href="http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html">as introduced</a></li>
<li><strong>scuttle</strong>: Two tables: One for bookmarks, one for tags. Tag-table has foreign key to bookmark table; <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#scuttle">as introduced</a></li>
<li><strong>toxi</strong>: Three tables: One for bookmarks, one for tags, one for junction; <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#toxi">as introduced</a></li>
</ul>
<p>You may want to have a close watch at the details of the schemas when having a look at the <a href="http://www.pui.ch/phred/modules/tag_database_schemas.sql">sql-create-table-queries</a>.</p>
<p>But let&#8217;s go directly to the results. The details about the setup of this tests are mentioned at the <a href="#setup">end of this article</a>. The x-axis depicts the number of bookmarks in the corresponding database, on the y-axis you see how much time each query took to execute.</p>
<h3><a name="#results"></a>Results</h3>
<h4>Intersection: 250 tag set</h4>
<p><img title="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" alt="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" src="/phred/modules/intersection_250_3_i300.png" /></p>
<p>The first two tests are done with 250 tags in the small dataset (<a href="#setup">see below</a> for explanation). I think the queries in the &#8220;1 million bookmarks database&#8221; are the only size we should pay attention to. I mean if you have a small number of bookmarks, performance isn&#8217;t really a thing to bother..</p>
<p>We run intersection queries, like</p>
<blockquote><p>I want to search for bookmarks tagged with &#8220;design&#8221; and &#8220;html&#8221;</p></blockquote>
<p>You see that, not surprisingly, mysqlicious with its <code>WHERE tag LIKE "% tag %"</code> is very slow. That is, MySQL has to go through the whole dataset and test each bookmark against the query.<br />
What actually <strong>is</strong> surprising me, is that the fulltext search of mysql is not that high-performance. In fact it is not faster then the <code>LIKE</code>-query in the MySQLicious DB. This really disappointed me. I tried to do any quirks possible to make this faster as <a href="http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html">I think, a tag-database-system with mysql fulltext would be very easy and like the only thing you should head to..</a>.<br />
What is surprising me too, is that the queries on the 3 table schema are about double as fast the the ones on the two-table ones(<a href="http://www.pui.ch/phred/modules/schemas.inc.phps">take a look at the queries</a> if you think you could give me a hint on this). Noticeable is, that in the scuttle and toxi-variant, the more queries were run, the faster they were. I didn&#8217;t do any tests with queries and inserts mixed so this may be coming from just plain good caching and this effect possible doesn&#8217;t show up on live bookmark management systems.</p>
<h4>Intersection: 999 tag set</h4>
<p><img title="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" alt="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" src="/phred/modules/intersection_999_3_300.png" /><br />
Now have a look to what happens if we broaden our small tag set: MySQLicious with fulltext suddenly gets the performance leader. That means, if you have a bookmark management system with diverse tags (this most probably comes from the fact that there are many users), the fulltext solution is possibly the way to go.<br />
So now, as you see, choosing the right schema is all about tag distribution. In my previous post about guessing the overall tag distribution on <a href="http://del.icio.us">del.icio.us</a>, I came to the conclusion, that delicious&#8217; most popular tag &#8220;design&#8221; is showing up in 3.2% of all bookmarks on <a href="http://del.icio.us">del.icio.us</a>. So then, what is the mean tag distribution?</p>
<ul>
<li>If we say 1% (a tag shows up in 1/100 of all bookmarks on an average) that makes our small tag set 250 tags big</li>
<li>If we say 0.25%, the small tag set grows to a size of 1000</li>
<li>If we say 0.1%, the small set will contain 2500 tags</li>
</ul>
<p>So I&#8217;d suggest that if your average distribution is 1%, take &#8220;toxi&#8221;, if the distribution is broader, take &#8220;MySQLicious fulltext&#8221;.<br />
If you take a closer look, you can see that the fulltext schema stayed as fast as when queried in the 250 tag set. That means, if you want to go sure your tag system responds ok in every situation, you should go with the &#8220;mysql fulltext&#8221; schema.<br />
<ins datetime="2005-06-26T14:52:55-02:00"><a href="http://hannes.magiccards.info/get/results.html">Hannes has done some further investigation on mysql fulltext running on MySQL 4.1</a> (my tests were on MySQL 4.0.21)</ins></p>
<h4>Union</h4>
<p><img alt="Union test with 250 tags in small dataset" src="/phred/modules/union_full_250_3.png" /><br />
When doing a union query we say</p>
<blockquote><p>I want to search for all the bookmarks that are tagged either with &#8220;delicious&#8221; or &#8220;del.icio.us&#8221;</p></blockquote>
<p>This queries, you guessed, are handled the fastest by &#8220;MySQLicious schema&#8221; with its <code>LIKE</code>-queries: MySQL seeks through the bookmarks, harvesting all bookmarks with one of the given tags and says &#8220;I&#8217;m finished!&#8221; when it was at bookmark number #968, because it found 50 bookmarks. Whereas in the other schemas, MySQL has to join the tags with the bookmarks first and only then could search though it..</p>
<h4>Insert</h4>
<p><img alt="Setup database schemas with the data: 250 tags in small dataset" src="/phred/modules/setup_250.png" /><br />
When comparing the different schemas on the time of the insert-&#8221;statements&#8221; of one bookmark, the result isn&#8217;t very surprising (notice that I&#8217;ve changed the scale of the y-axis).<br />
Mysqlicious with it&#8217;s 1 table is very fast indeed, its variation with fulltext had to create the fulltext index and therefore is a bit slower. Scuttle, with its 2 tables and toxi with its 3 tables are at least two respectively three times as slow. I have to remark, that I used quite a bit of caching for the toxi schema, as I didn&#8217;t want hours to have the data ready..</p>
<p>I guess it doesn&#8217;t really make sense to base your decision, which schema to take on the time for an insert: Bookmark inserts are about 100 times as fast as the intersection queries..</p>
<h4>«What? That slow??»</h4>
<p>You said it. You don&#8217;t want your intersection queries take 0.2 seconds each. That would bring your system to its knees.<br />
<ins datetime="2005-07-08T08:33:32-02:00"><br />
There are some recipes to avoid that:</ins></p>
<h5><ins datetime="2005-07-08T08:33:32-02:00">Caching</ins></h5>
<p><ins datetime="2005-07-08T08:33:32-02:00">I think, you don&#8217;t come around good old caching. I think that you could cache results to a query like &#8220;mysql+tagging&#8221; for about an hour or so. If a user queries his own items, I would lower the cache time (as up-to-dateness is more important with his own items).<br />
Then, I expect if you for instance cache items per tag and intersection them with a decent algorithm, that could be faster.. </ins></p>
<h5><ins datetime="2005-07-08T08:33:32-02:00">The Best Of Both Worlds</ins></h5>
<p><ins datetime="2005-07-08T08:33:32-02:00">I think you could have &#8220;mysqlicous fulltext&#8221; and &#8220;toxi&#8221; running at the same time. That means you have to update/insert in both schemas but when you have to query, you could take the one you think is faster: For simple union the mysqlicious without fulltext search, for intersection queries with common tags the toxi, and for those with uncommon tags the mysqlicious fulltext variant. </ins></p>
<h5><ins datetime="2005-07-08T08:33:32-02:00">Slicing and dicing</ins></h5>
<p><ins datetime="2005-07-08T08:33:32-02:00">You could &#8220;slice and dice&#8221; data (as Nitin proposed it in <a href="http://tagschema.com/blogs/tagschema/2005/06/slicing-and-dicing-data-20-part-1.html">two</a> of his <a href="http://tagschema.com/blogs/tagschema/2005/06/slicing-and-dicing-data-20-part-2.html">posts</a>): That is: you slice your user/tag/item-room and build fact tables. You &#8220;prebuild&#8221; your results in a way. This way, inserts take long but queries themself should be much faster. In our examples, you would for instance first query the tag-intersections on &#8220;toxi&#8221; and then get the facts about each bookmark out of the &#8220;mysqlicious&#8221;-fact-table. But you really should read Nitins posts, as they give a lot of insight.</ins><br />
<ins datetime="2006-05-01T10:01:24+00:00"></p>
<h5>Using a non RDBMS system</h5>
<p><strong>Update:</strong> It&#8217;s been about a year since I wrote that article, and during that year I came to the conclusion that <a href="http://en.wikipedia.org/wiki/RDBMS">RDBMS</a> systems don&#8217;t scale good in systems that have more than 1 million items. Yes, this is a warning: If you are planning to build a large scale system then look for alternatives to <a href="http://en.wikipedia.org/wiki/RDBMS">RDBMS</a> systems. To quote Joshua Schachter, founder of <a href="http://del.icio.us">delicious</a>:</p>
<blockquote><p>«tags doesn&#8217;t map to sql at all. so use partial indexing.»[<a href="http://www.redmonk.com/jgovernor/2006/02/08/things-weve-learned-josh-schachter-quotes-of-the-day/">Joshua Schachter at Carson Summit</a>]</p></blockquote>
<p>I didn&#8217;t try any of the non-RDBMS system but it looks like <a href="http://lucene.apache.org/java/docs/">Apache Lucene</a> and <a href="http://lucene.apache.org/hadoop/">Hadoop</a>. There has been <a href="http://nelson.textdrive.com/pipermail/tagdb/2006-March/thread.html#164">a discussion on the Tagdb Mailing list</a> about these solutions.</p>
<p></ins></p>
<h4>«I don&#8217;t believe you! I want to try it at home»</h4>
<p><a href="http://www.pui.ch/phred/modules/tag_schemas_performance_test.tar.gz">Download the source code (PHP)</a> I used to run the queries and test yourself, extend them as you like. The source is published as <a href="http://en.wikipedia.org/wiki/LGPL">LGPL</a>.</p>
<h3><a name="setup"></a>Performance Tests Setup</h3>
<p>Now, if you have read that far, you probably want to know some background information: As you noticed, for each schema, I set up 4 databases, one database holding 1000 bookmarks, the next 10&#8242;000, then 100&#8242;000 and the fourth 1 million bookmarks. The inserted tags (as well as urls) are random English words taken from two sets of tags:</p>
<ul>
<li>the large set containing about 44&#8242;000 tags (that are simple English words)</li>
<li>the small set is varying in size (the results shown here are taken from 250 and 999 tag sets)</li>
</ul>
<p>Every bookmark gets one to ten tags attached. Every odd tag is from the large set, alternately taken from small and large set. Every schema got exactly the same bookmarks and tag data.</p>
<p>Then every schema got queried with an alternately 1-3 tag query. So the first query is for instance just &#8220;blog&#8221;, the second &#8220;design+css&#8221;, the third &#8220;webdesign+music+software&#8221;, the fourth again with just one tag an so forth..<br />
All the tags for the queries are taken from the small set so that the queries don&#8217;t all end in empty results..<br />
All the queries are tested and work. The outcome of each query on the three schemas is exactly the same.</p>
<h4>Mysql Setup</h4>
<p>I used mysql 4.0.21.<br />
An excerpt from <code>/etc/my.cnf</code> (I think these are the relevant settings to this performance test)</p>
<pre>key_buffer=300M
query_cache_size=30M
query_cache_limit=30M
table_cache = 64
ft_min_word_len = 2
ft_stopword_file = ''</pre>
<h4>System</h4>
<blockquote><p>CPU: 3GHz Dual Xeon<br />
Cache: 1MB<br />
Harddisk: SCSI Ultra 320 Atlas 10K, no RAID<br />
RAM: 3GB</p></blockquote>
<h4>Assumptions</h4>
<ul>
<li>Queries select just the id of a bookmark. I assume that you have to do a second query to get all the wished data to display. I know that this is not fair towards the mysqlicious schema.</li>
<li>I left out user data, as I assume, user data columns wouldn&#8217;t change the outcome of this tests. I wanted to keep the schemas as simple as possible.</li>
<li>Each query is done with <code>LIMIT 50</code> as I assume that a normal application doesn&#8217;t want to get all bookmarks. I assume nobody wants to <code>order</code> bookmarks by any dimension, because this would be <strong>very</strong> expensive (ever wondered why you cannot sort bookmarks on <a href="http://del.icio.us">del.icio.us</a> by date or similar? You get it..)</li>
</ul>
<h3>Acknowledgements</h3>
<p>Thanks to <a href="http://www.citrin.ch">Citrin</a>, the company I work, to let me use our new server to run the queries. The server didn&#8217;t have much anything else to do so the results should be accurate.<br />
The graphs are done using <a href="http://www.aditus.nu/jpgraph/">JpGraph</a>. Very easy to use and produces beautiful images.</p>
<h3>Further reading</h3>
<ul>
<li><a href="http://www.niallkennedy.com/blog/archives/2004/10/flickr_architec.html">Flickr architecture</a></li>
<li><a href="http://labnotes.blogsome.com/2005/06/06/lab-notes-5-fulltext-not-so-fast/">Lab notes: Fulltext not so fast</a>: Fulltext performance issues</li>
<li><a href="http://www.webmasterworld.com/forum23/3557.htm">WebmasterWorld forum: mysql fulltext performance issues</a></li>
<li><a href="http://vegan.net/tony/supersmack/">Mysql Supersmack: Mysql performance tool</a></li>
<li><a href="http://dev.mysql.com/doc/mysql/en/mysql-benchmarks.html">Mysql Benchmark</a></li>
<li><a href="http://jeremy.zawodny.com/mysql/mysql-optimization.html">Powerpoint article of jeremy zawodny</a>on Mysql optimisation</li>
<li><a href="http://www.petefreitag.com/item/389.cfm">Pete Freitag did a sort of review of this article</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/feed</wfw:commentRss>
		<slash:comments>49</slash:comments>
		</item>
		<item>
		<title>Tags with MySQL fulltext</title>
		<link>http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html#comments</comments>
		<pubDate>Thu, 05 May 2005 16:09:53 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html</guid>
		<description><![CDATA[While setting up the promised performance test in my last post, I did some tests with the MySQL fulltext features and it seems that they are built for tagging systems. Take a look at the queries (if it is not clear for you what is done here, please read my previous post).

I took the MySQLicious [...]]]></description>
			<content:encoded><![CDATA[<p>While setting up the promised performance test in my <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html">last post</a>, I did some tests with the <a href="http://dev.mysql.com/doc/mysql/en/fulltext-search.html">MySQL fulltext features</a> and it seems that they are built for tagging systems. Take a look at the queries (if it is not clear for you what is done here, please read <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html">my previous post</a>).<br />
<span id="more-30"></span><br />
I took the <a href="http://nanovivid.com/projects/mysqlicious/">MySQLicious</a> schema and added <code>ALTER TABLE `delicious` ADD FULLTEXT (`tags`)</code>.<br />
The full schema:</p>
<blockquote><p><code>CREATE TABLE `delicious` (<br />
  `id` int(11) NOT NULL auto_increment,<br />
  `url` text,<br />
  `description` text,<br />
  `extended` text,<br />
  `tags` text,<br />
  `date` datetime default NULL,<br />
  `hash` varchar(255) default NULL,<br />
  PRIMARY KEY  (`id`),<br />
  KEY `date` (`date`),<br />
  FULLTEXT KEY `tags` (`tags`)<br />
) ENGINE=MyISAM</code></p></blockquote>
<h2>Queries</h2>
<p>&nbsp;<br />
<h3>Intersection</h3>
<p>Intersections can be done using <a href="http://dev.mysql.com/doc/mysql/en/fulltext-boolean.html">boolean fulltext search</a> (since MySQL 4.01):<br />
Query for semweb+search:<br />
<code>SELECT * FROM delicious WHERE MATCH (tags) AGAINST ('+semweb +search' IN BOOLEAN MODE)</code><br />
Now this was easy. And, you guess it, Minus is very similar:</p>
<h3>Minus</h3>
<p>Query for search+webservice-search:<br />
<code>SELECT * FROM delicious WHERE MATCH (tags) AGAINST ('+search +webservice -search' IN BOOLEAN MODE)</code></p>
<h3>Brackets</h3>
<p>Even brackets are possible:<br />
Query for (del.icio.us|delicious)+(webservice|project):<br />
<code>SELECT * FROM delicious WHERE MATCH (tags) AGAINST ('+(del.icio.us delicious) +(webservice project)' IN BOOLEAN MODE)</code></p>
<h3>Union</h3>
<p><img src='/phred/modules/union_result.png' alt='union DB result' /><br />
For union you could use the already mentioned boolean mode, but if you want to have the results ordered so that the bookmark with the most &#8220;hits&#8221; is the first entry of the result try this sort of query:<br />
<code>SELECT * FROM delicious WHERE MATCH (tags) AGAINST ('delicious clone project webservice')</code><br />
If you take a look at the screenshot of the first 7 results of the query run on my DB, you can see that the first hit has got all four tags we searched for, the second has got two and the rest has got just one of them. Like this you can do a &#8220;find similar entries&#8221; very easily.</p>
<h2>Downsides and problems</h2>
<p>There are two points where difficulties can accur: When MySQL builds its index out of the tags and when searching for specific tags. I stumbled on three problems:</p>
<h3>Stopcharacters</h3>
<p>If you insert tags with characters like &#8220;-&#8221; (as in &#8220;my-comment&#8221;), then MySQL will make two index entries: One for &#8220;my&#8221; and one for &#8220;comment&#8221;. Vice versa if you search for &#8220;my-comment&#8221; you&#8217;ll find bookmarks with tag &#8220;my&#8221; and those with tag &#8220;comment&#8221;. It seems that this problem can be eliminated by <a href="http://dev.mysql.com/doc/mysql/en/fulltext-search.html">setting the character set of the column &#8220;tags&#8221; to <code>latin1_bin</code></a> but this feature is not available before MySQL 4.1.<br />
But nontheless this shouldn&#8217;t be a showstopper. You could replace &#8220;-&#8221; with a string, say &#8220;_minus_&#8221;. This is ugly but should do it..</p>
<h3>Stopwords</h3>
<p>When searching for or indexing tags like &#8220;against&#8221; or &#8220;brief&#8221; (<a href="http://www.databasejournal.com/features/mysql/article.php/1578331">full list of stopwords</a>), these tags will not be regarded.<br />
Since MySQL 4.0.10 you can <a href="http://dev.mysql.com/doc/mysql/en/fulltext-fine-tuning.html">customize your stopwordlist</a>.</p>
<h3>Minimum length of a tag</h3>
<p>Per default, the minimal length of a word indexed by MySQL fulltext is 4 characters. You should therefor <a href="http://dev.mysql.com/doc/mysql/en/fulltext-fine-tuning.html">edit <code>my.cnf</code></a> in order to set the minimal tag length to 1.</p>
<h2>Performance</h2>
<p>This solution scales ok. I did tests with tables from 1000 to 1 million bookmarks.<br />
The time for inserting a bookmark is the same for small as for big tables. The time for an intersection query was 0.001 (finding 0.7 urls averaged) in the 1000-table and 0.1 seconds in the 1 million-table(finding 70 bookmarks averaged). There are some <a href="http://dev.mysql.com/doc/mysql/en/fulltext-search.html">discussions about if MySQLs fulltext search is fast or not (have a look at the user comments)</a>. Quick performance tests showed that it is about 10 times as fast as the LIKE-queries mentioned in <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html">my previous post</a>. But I guess it is not fast enough for webservices like <a href="http://del.icio.us">del.icio.us</a>, I guess this services have to run more than 10 queries a second and then this solution is too slow..<br />
But anyway: I will do an article on the perfomance tests with more accurate data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html/feed</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
		<item>
		<title>Tags: Database schemas</title>
		<link>http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#comments</comments>
		<pubDate>Sun, 24 Apr 2005 13:35:44 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html</guid>
		<description><![CDATA[Recently, on del.icio.us mailinglist, I asked the question &#8220;Does anyone know the database schema of del.icio.us?&#8221; .
I got a few private responses so I wanted to share the knowledge with the world.
The Problem: You want to have a database schema where you can tag a bookmark (or a blog post or whatever) with as many [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, <a href="http://lists.del.icio.us/pipermail/discuss/2005-April/002827.html">on del.icio.us mailinglist</a>, I asked the question &#8220;Does anyone know the database schema of del.icio.us?&#8221; .<br />
I got a few private responses so I wanted to share the knowledge with the world.</p>
<p>The Problem: You want to have a database schema where you can tag a bookmark (or a blog post or whatever) with as many <a href="http://en.wikipedia.org/wiki/Tags">tags</a> as you want. Later then, you want to run queries to constrain the bookmarks to a <a href="http://en.wikipedia.org/wiki/Union_%28set_theory%29">union</a> or <a href="http://en.wikipedia.org/wiki/Intersection_%28set_theory%29">intersection</a> of tags. You also want to exclude (say: minus) some tags from the search result.<br />
<span id="more-29"></span><br />
Apparently there are three different solutions (<strong>Attention:</strong>:If you are building a websites that allows users to tag, be sure to have a look at <a href="http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html">my performance tests</a> as performance seems to be a problem on larger scaled sites.)</p>
<h2><a name="mysqlicious">&#8220;MySQLicious&#8221; solution</a></h2>
<p><img src='/phred/modules/mysqlicious_data.png' alt='mysqlicious sample data' /><img src='/phred/modules/mysqlicious_structure.png' alt='mysqlicious database stucture' /><br />
In this solution, the schema has got just one table, it is <a href="http://en.wikipedia.org/wiki/Denormalization">denormalized</a>.<br />
This type is called &#8220;MySQLicious solution&#8221; because <a href="http://nanovivid.com/projects/mysqlicious/">MySQLicious</a> imports del.icio.us data into a table with this structure.</p>
<h3>Intersection (AND)</h3>
<p>Query for &#8220;search+webservice+semweb&#8221;:<br />
<code>SELECT *<br />
FROM `delicious`<br />
WHERE tags LIKE "%search%"<br />
AND tags LIKE "%webservice%"<br />
AND tags LIKE "%semweb%"</code></p>
<h3>Union (OR)</h3>
<p>Query for &#8220;search|webservice|semweb&#8221;:</p>
<p><code>SELECT *<br />
FROM `delicious`<br />
WHERE tags LIKE "%search%"<br />
OR tags LIKE "%webservice%"<br />
OR tags LIKE "%semweb%"</code></p>
<h3>Minus</h3>
<p>Query for &#8220;search+webservice-semweb&#8221;<br />
<code>SELECT *<br />
FROM `delicious`<br />
WHERE tags LIKE "%search%"<br />
AND tags LIKE "%webservice%"<br />
AND tags NOT LIKE "%semweb%"</code></p>
<h3>Conclusion</h3>
<p>The advantages of this solution:</p>
<ul>
<li>just one table</li>
<li>the queries are very straightforward</li>
<li>one can also achieve results via fulltextsearch. That might be a little faster.</li>
<li>I guess the queries are <del datetime="2005-04-25T12:16:58-02:00">pretty fast (also referring to <a href="http://www.petercooper.co.uk/archives/000648.html">a blog entry of Peter Cooper</a>: section &#8220;Denormalize! Denormalize! Denormalize!&#8221;)</del> <ins datetime="2005-04-25T12:16:58-02:00">quite slow according to <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#comment-57">good</a> <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#comment-62">arguments</a>. Fulltext search would speed up a bit. </ins><ins datetime="2005-06-20T12:43:04-02:00">I <a href="http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html">did some performance tests</a> to prove that.</ins></li>
<li>
<ins datetime="2005-05-06T06:02:15-02:00"><a href="http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html">In my follow up post I dealt with MySQL fulltext concerning tagging</a>.</ins></li>
</ul>
<p>Disadvantages:</p>
<ul>
<li>You have a limit on the number of tags per bookmark. Normally you use a 256byte field in your DB (<code>VARCHAR</code>). Otherwise, if you took a <code>text</code> field or similar, the query times would slow down, I suppose</li>
<li><ins datetime="2005-04-25T12:24:14-02:00">If you paid attention (<a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#comment-63">as Patrice did</a>) you notice that <code>LIKE "%search"</code> will also find tags with &#8220;websearch&#8221;. If you alter the query to <code>LIKE " %search% "</code> you end up having a messy solution: You have to add a space to the beginning of the tags value to make this work.</ins></li>
</ul>
<h2><a name="scuttle">&#8220;Scuttle&#8221; solution</a></h2>
<p>Scuttle organizes its data in two tables. That table &#8220;scCategories&#8221; is the &#8220;tag&#8221;-table and has got a foreign key to the &#8220;bookmark&#8221;-table. <img src='/phred/modules/scuttle_structure.png' alt='database structure of scuttle' /></p>
<h3>Intersection (AND)</h3>
<p>Query for &#8220;bookmark+webservice+semweb&#8221;:<br />
<code>SELECT b.*<br />
FROM scBookmarks b, scCategories c<br />
WHERE c.bId = b.bId<br />
AND (c.category IN ('bookmark', 'webservice', 'semweb'))<br />
GROUP BY b.bId<br />
HAVING COUNT( b.bId )=3</code></p>
<p>First, all bookmark-tag combinations are searched, where the tag is &#8220;bookmark&#8221;, &#8220;webservice&#8221; or &#8220;semweb&#8221; (<code>c.category IN ('bookmark', 'webservice', 'semweb')</code>), then just the bookmarks that have got all three tags searched for are taken into account (<code>HAVING COUNT(b.bId)=3</code>).</p>
<h3>Union (OR)</h3>
<p>Query for &#8220;bookmark|webservice|semweb&#8221;:<br />
Just leave out the <code>HAVING</code> clause and you have union:<br />
<code>SELECT b.*<br />
FROM scBookmarks b, scCategories c<br />
WHERE c.bId = b.bId<br />
AND (c.category IN ('bookmark', 'webservice', 'semweb'))<br />
GROUP BY b.bId</code></p>
<h3>Minus (Exclusion)</h3>
<p>Query for &#8220;bookmark+webservice-semweb&#8221;, that is: bookmark AND webservice AND NOT semweb.<br />
<code>SELECT b. *<br />
FROM scBookmarks b, scCategories c<br />
WHERE b.bId = c.bId<br />
AND (c.category IN ('bookmark', 'webservice'))<br />
AND b.bId NOT<br />
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')<br />
GROUP BY b.bId<br />
HAVING COUNT( b.bId ) =2<br />
</code><br />
Leaving out the <code>HAVING COUNT</code> leads to the Query for &#8220;bookmark|webservice-semweb&#8221;.<br />
Credits go to <a href="http://www.metafilter.com/user/26222">Rhomboid</a> for <a href="http://ask.metafilter.com/mefi/34897#544185">helping me out with this query</a>.</p>
<h3>Conclusion</h3>
<p>I guess the main advantage of this solution is that it is more normalized than the first solution, and that you can have unlimited number of tags per bookmark.</p>
<h2><a name="toxi">&#8220;Toxi&#8221; solution</a></h2>
<p><img src='/phred/modules/toxi_structure.png' alt='' /><br />
<a href="http://toxi.co.uk/">Toxi</a> came up with a three-table structure. Via the table &#8220;tagmap&#8221; the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used by <a href="http://wordpress.org/">wordpress</a>.<br />
The queries are quite the same as in the &#8220;scuttle&#8221; solution.</p>
<h3>Intersection (AND)</h3>
<p>Query for &#8220;bookmark+webservice+semweb&#8221;<br />
<code>SELECT b.*<br />
FROM tagmap bt, bookmark b, tag t<br />
WHERE bt.tag_id = t.tag_id<br />
AND (t.name IN ('bookmark', 'webservice', 'semweb'))<br />
AND b.id = bt.bookmark_id<br />
GROUP BY b.id<br />
HAVING COUNT( b.id )=3</code></p>
<h3>Union (OR)</h3>
<p>Query for “bookmark|webservice|semweb”<br />
<code>SELECT b.*<br />
FROM tagmap bt, bookmark b, tag t<br />
WHERE bt.tag_id = t.tag_id<br />
AND (t.name IN ('bookmark', 'webservice', 'semweb'))<br />
AND b.id = bt.bookmark_id<br />
GROUP BY b.id</code></p>
<h3>Minus (Exclusion)</h3>
<p>Query for &#8220;bookmark+webservice-semweb&#8221;, that is: bookmark AND webservice AND NOT semweb.<br />
<code><br />
SELECT b. *<br />
FROM bookmark b, tagmap bt, tag t<br />
WHERE b.id = bt.bookmark_id<br />
AND bt.tag_id = t.tag_id<br />
AND (t.name IN ('Programming', 'Algorithms'))<br />
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python')<br />
GROUP BY b.id<br />
HAVING COUNT( b.id ) =2</code><br />
Leaving out the <code>HAVING COUNT</code> leads to the Query for &#8220;bookmark|webservice-semweb&#8221;.<br />
Credits go to <a href="http://www.metafilter.com/user/26222">Rhomboid</a> for <a href="http://ask.metafilter.com/mefi/34897#544185">helping me out with this query</a>.</p>
<h3>Conclusion</h3>
<p>The advantages of this solution:</p>
<ul>
<li>You can save extra information on each tag (description, tag hierarchy, &#8230;)</li>
<li>This is the most normalized solution (that is, if you go for <a href="http://en.wikipedia.org/wiki/3NF">3NF</a>: take this one :-)</li>
</ul>
<p>Disadvantages:</p>
<ul>
<li>When altering or deleting bookmarks you can end up with tag-orphans.</li>
</ul>
<p>If you want to have more complicated queries like (bookmarks OR bookmark) AND (webservice or WS) AND NOT (semweb or semanticweb) the queries tend to become very complicated. In these cases I suggest the following query/computation process:</p>
<ol>
<li>Run a query for each tag appearing in your &#8220;tag-query&#8221;: <code>SELECT b.id FROM tagmap bt, bookmark b, tag t WHERE bt.tag_id = t.tag_id AND b.id = bt.bookmark_id AND t.name = "semweb"</code></li>
<li>Put each id-set from the result into an array (that is: in your favourite coding language). You could cache this arrays if you want..</li>
<li>Constrain the arrays with union or intersection or whatever.</li>
</ol>
<p>In this way, you can also do queries like <code>(del.icio.us|delicious)+(semweb|semantic_web)-search</code>. This type of queries (that is: the brackets) cannot be done by using the denormalized &#8220;MySQLicious solution&#8221;.<br />
This is the most flexible data structure and I guess it should scale pretty good (that is: if you do some caching).</p>
<p><ins datetime="2006-05-01T09:20:11+00:00"><strong>Update May, 2006</strong>. This arcticle got quite some attention. I wasn&#8217;t really prepared for that! It seems people keep referring to it and even some new sites that allow tagging give credit to my articles. I think the real credit goes to the contributers of the different schemas: <a href="http://nanovivid.com/projects/mysqlicious/">MySQLicious</a>, <a href="http://sourceforge.net/projects/scuttle/">scuttle</a>, <a href="http://toxi.co.uk/">Toxi</a> and to all the contributors of the comments (be sure to read them!)</p>
<p>P.S. Thanks to <a href="http://toxi.co.uk/">Toxi</a> for sending me the queries for the three-table-schema, Benjamin Reitzammer for pointing me to <a href="http://laughingmeme.org/archives/002918.html">a loughing meme article</a> (a good reference for tag queries) and powerlinux for pointing me to <a href="http://sourceforge.net/projects/scuttle/">scuttle</a>.</p>
<h2>Further reading</h2>
<ul>
<li><ins datetime="2005-06-28T09:01:13-02:00"><a href="http://lists.tagschema.com/mailman/listinfo/tagdb">Taglist: a mailing list dedicated to schemas with tagging</a></ins></li>
<li><ins datetime="2005-06-26T15:02:02-02:00"><a href="http://tagschema.com/blogs/tagschema/">Tagschema: A blog dedicated to tagging schemas</a></ins></li>
<li><a href="http://www.bigbold.com/snippets/tags/tagging">Tag-related Queries on Snippets</a>
	</li>
<li><ins datetime="2005-05-08T18:28:16-02:00"><a href="http://www.getluky.net/freetag/">Freetag</a> is a php &#8220;library&#8221; with which you can add tags to whatever object you like. It actually uses the &#8220;toxi schema&#8221;.</ins></li>
<li><ins datetime="2005-05-10T09:45:38-02:00">Hammy <a href="http://hellojoseph.com/tags-howto.php">gives an insight</a> how he did his tagging system with &#8220;less DB and more code&#8221; (that is: regular expressions), interesting!</ins></li>
<li>Brad Choate <a href="http://bradchoate.com/weblog/2004/10/06/delicious">has got some ideas</a> which tag queries should be possible</li>
<li>Feedmaker has written <a href="http://blog.feedmarker.com/2005/04/26/tagging-in-mysql/">a sort of reply to this article</a></li>
</ul>
<p></ins></p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html/feed</wfw:commentRss>
		<slash:comments>92</slash:comments>
		</item>
	</channel>
</rss>
