<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Then each went to his own home &#187; Performance</title>
	<atom:link href="http://www.pui.ch/phred/archives/category/performance/feed" rel="self" type="application/rss+xml" />
	<link>http://www.pui.ch/phred</link>
	<description>Philipp Kellers weblog</description>
	<lastBuildDate>Tue, 17 Aug 2010 19:58:15 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Does del.icio.us scale?</title>
		<link>http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html#comments</comments>
		<pubDate>Wed, 31 Aug 2005 06:12:50 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html</guid>
		<description><![CDATA[Lately it became very silent around del.icio.us. There are some new features but nothing groundbreaking. Either people are used to it and use it as a daily tool and there&#8217;s no need for new things or otherwise folks just don&#8217;t have faith in the future of del.icio.us.
I am a big fan of delicious. I&#8217;ve got [...]]]></description>
			<content:encoded><![CDATA[<p>Lately it became very silent around <a href="http://del.icio.us">del.icio.us</a>. There are <a href="http://blog.del.icio.us/blog/2005/08/we_rolling.html">some</a> <a href="http://blog.del.icio.us/blog/2005/08/search_me.html">new</a> <a href="http://blog.del.icio.us/blog/2005/08/people_who_like.html">features</a> but nothing groundbreaking. Either people are used to it and use it as a daily tool and there&#8217;s no need for new things or otherwise folks just don&#8217;t have faith in the future of del.icio.us.</p>
<p>I am a big fan of delicious. I&#8217;ve got 1.5K bookmarks there, I like it&#8217;s spirit and how open everything is. This article isn&#8217;t meant to criticize, but I think delicious is facing some problems.<br />
<span id="more-34"></span></p>
<h2>Performance scale</h2>
<p>You might have read my article about <a href="http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html">Tag system performance</a>. To summarize my tests: MySQL is just not built for large tag-systems. It just doesn&#8217;t scale. It does scale up to 1 Million items but delicious does have far more posts.<br />
I am pretty sure delicious is still on the MySQL train, this strong believe comes from my performance tests: The mysql-schemas I tested really have the same characteristics as delicious has.<br />
I fear delicious faces a performance dead end: They <a href="http://blog.del.icio.us/blog/2005/06/moving_to_new_s.html">have put more servers in the mix</a>, they cache quite a bit, it still is slow. I strongly believe that for delicious to have a future it must become much faster. For me this is the number one downside of delicious. I dream of a bookmark service that has billions of bookmark-posts yet it still will perform nicely. I think it is time for new tag-systems to come up. On <a href="http://lists.tagschema.com/mailman/listinfo/tagdb">tagdb mailing list</a>, there are very good ideas how large scaled tagging systems should work (e.g. systems powered by <a href="http://lucene.apache.org/">Lucene</a>).</p>
<h2>Popular link scale</h2>
<p>I think one of the coolest feature of delicious is the <a href="http://del.icio.us/popular/">popular</a> page. When you read this page regularly you are up to date.. wait: you are up to date concerning CSS tips and firefox and live hacks. You all know that if delicious would get mainstream that page wouldn&#8217;t be that interesting any more. It already got boring a bit. As someone put it: </p>
<blockquote><p>I particularly cannot look at that CSS link lists anymore</p></blockquote>
<p>I think this page doesn&#8217;t scale. It is stuck. And moreover it&#8217;s a pity that the coolest page on delicious is not about tags. At first glance you don&#8217;t even see what tags a popular link has.<br />
IMHO what is needed here are clusters. Bookmarks go into categories: &#8220;browsers&#8221;, &#8220;programming&#8221;, &#8220;design&#8221; but also &#8220;health&#8221;, &#8220;politics&#8221;. When delicious gets mainstream there most certainly will be &#8220;sports&#8221; or &#8220;stars&#8221;.<br />
One should then have the possibility to subscribe to certain clusters or better make this subscription automatically out of tags in a users bookmarks.</p>
<h2>Bottom line</h2>
<p>I think there are some fundamental things that must be rearranged at delicious, otherwise there will be</p>
<ul>
<li>a) a big competitor (Google? Yahoo? Microsoft?) coming up or </li>
<li>b) people will spread to different bookmark services that concentrate on certain clusters. Probably some meta-sites will arise where you can have an overview over all the different sites</li>
</ul>
<p>I think this problems will arise for every bigger tagsystem. I hope that people will not sniff at tagging systems thinking that they don&#8217;t perform well enough..</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/08/does-delicious-scale.html/feed</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Tagsystems: performance tests</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html</link>
		<comments>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comments</comments>
		<pubDate>Sun, 19 Jun 2005 14:09:53 +0000</pubDate>
		<dc:creator>Philipp Keller</dc:creator>
				<category><![CDATA[Del.icio.us]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tags]]></category>

		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html</guid>
		<description><![CDATA[In my previous article named "Tags: database schemas" we analysed different database schemas on how they could meet the needs of tag systems. In this article, the focus is on performance (speed).]]></description>
			<content:encoded><![CDATA[<p>In my <a title="Tags: database schemas" href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html">previous article named &#8220;Tags: database schemas&#8221;</a> we analysed different database schemas on how they could meet the needs of tag systems. In this article, the focus is on performance (speed). That is: if you want to build a tagsystem that performs good with about 1 million items (bookmarks for instance), then you may want to have a look at the following result of my performance tests.<br />
In this article I tested tagging of bookmarks, but as you can tag pretty much anything, this goes for tagging systems in general.</p>
<p><span id="more-32"></span></p>
<p>I tested the following schemas (I keep the naming from the previous article):</p>
<ul>
<li><strong>mysqlicious</strong>: One table. Tags are space separated in column &#8220;tags&#8221;; <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#mysqlicious">as introduced</a></li>
<li><strong>mysqlicious fulltext</strong>: Same schema but with <a href="http://dev.mysql.com/doc/mysql/en/fulltext-search.html">mysql fulltext</a> on the tag column; <a href="http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html">as introduced</a></li>
<li><strong>scuttle</strong>: Two tables: One for bookmarks, one for tags. Tag-table has foreign key to bookmark table; <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#scuttle">as introduced</a></li>
<li><strong>toxi</strong>: Three tables: One for bookmarks, one for tags, one for junction; <a href="http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html#toxi">as introduced</a></li>
</ul>
<p>You may want to have a close watch at the details of the schemas when having a look at the <a href="http://www.pui.ch/phred/modules/tag_database_schemas.sql">sql-create-table-queries</a>.</p>
<p>But let&#8217;s go directly to the results. The details about the setup of this tests are mentioned at the <a href="#setup">end of this article</a>. The x-axis depicts the number of bookmarks in the corresponding database, on the y-axis you see how much time each query took to execute.</p>
<h3><a name="#results"></a>Results</h3>
<h4>Intersection: 250 tag set</h4>
<p><img title="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" alt="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" src="/phred/modules/intersection_250_3_i300.png" /></p>
<p>The first two tests are done with 250 tags in the small dataset (<a href="#setup">see below</a> for explanation). I think the queries in the &#8220;1 million bookmarks database&#8221; are the only size we should pay attention to. I mean if you have a small number of bookmarks, performance isn&#8217;t really a thing to bother..</p>
<p>We run intersection queries, like</p>
<blockquote><p>I want to search for bookmarks tagged with &#8220;design&#8221; and &#8220;html&#8221;</p></blockquote>
<p>You see that, not surprisingly, mysqlicious with its <code>WHERE tag LIKE "% tag %"</code> is very slow. That is, MySQL has to go through the whole dataset and test each bookmark against the query.<br />
What actually <strong>is</strong> surprising me, is that the fulltext search of mysql is not that high-performance. In fact it is not faster then the <code>LIKE</code>-query in the MySQLicious DB. This really disappointed me. I tried to do any quirks possible to make this faster as <a href="http://www.pui.ch/phred/archives/2005/05/tags-with-mysql-fulltext.html">I think, a tag-database-system with mysql fulltext would be very easy and like the only thing you should head to..</a>.<br />
What is surprising me too, is that the queries on the 3 table schema are about double as fast the the ones on the two-table ones(<a href="http://www.pui.ch/phred/modules/schemas.inc.phps">take a look at the queries</a> if you think you could give me a hint on this). Noticeable is, that in the scuttle and toxi-variant, the more queries were run, the faster they were. I didn&#8217;t do any tests with queries and inserts mixed so this may be coming from just plain good caching and this effect possible doesn&#8217;t show up on live bookmark management systems.</p>
<h4>Intersection: 999 tag set</h4>
<p><img title="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" alt="Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset" src="/phred/modules/intersection_999_3_300.png" /><br />
Now have a look to what happens if we broaden our small tag set: MySQLicious with fulltext suddenly gets the performance leader. That means, if you have a bookmark management system with diverse tags (this most probably comes from the fact that there are many users), the fulltext solution is possibly the way to go.<br />
So now, as you see, choosing the right schema is all about tag distribution. In my previous post about guessing the overall tag distribution on <a href="http://del.icio.us">del.icio.us</a>, I came to the conclusion, that delicious&#8217; most popular tag &#8220;design&#8221; is showing up in 3.2% of all bookmarks on <a href="http://del.icio.us">del.icio.us</a>. So then, what is the mean tag distribution?</p>
<ul>
<li>If we say 1% (a tag shows up in 1/100 of all bookmarks on an average) that makes our small tag set 250 tags big</li>
<li>If we say 0.25%, the small tag set grows to a size of 1000</li>
<li>If we say 0.1%, the small set will contain 2500 tags</li>
</ul>
<p>So I&#8217;d suggest that if your average distribution is 1%, take &#8220;toxi&#8221;, if the distribution is broader, take &#8220;MySQLicious fulltext&#8221;.<br />
If you take a closer look, you can see that the fulltext schema stayed as fast as when queried in the 250 tag set. That means, if you want to go sure your tag system responds ok in every situation, you should go with the &#8220;mysql fulltext&#8221; schema.<br />
<ins datetime="2005-06-26T14:52:55-02:00"><a href="http://hannes.magiccards.info/get/results.html">Hannes has done some further investigation on mysql fulltext running on MySQL 4.1</a> (my tests were on MySQL 4.0.21)</ins></p>
<h4>Union</h4>
<p><img alt="Union test with 250 tags in small dataset" src="/phred/modules/union_full_250_3.png" /><br />
When doing a union query we say</p>
<blockquote><p>I want to search for all the bookmarks that are tagged either with &#8220;delicious&#8221; or &#8220;del.icio.us&#8221;</p></blockquote>
<p>This queries, you guessed, are handled the fastest by &#8220;MySQLicious schema&#8221; with its <code>LIKE</code>-queries: MySQL seeks through the bookmarks, harvesting all bookmarks with one of the given tags and says &#8220;I&#8217;m finished!&#8221; when it was at bookmark number #968, because it found 50 bookmarks. Whereas in the other schemas, MySQL has to join the tags with the bookmarks first and only then could search though it..</p>
<h4>Insert</h4>
<p><img alt="Setup database schemas with the data: 250 tags in small dataset" src="/phred/modules/setup_250.png" /><br />
When comparing the different schemas on the time of the insert-&#8221;statements&#8221; of one bookmark, the result isn&#8217;t very surprising (notice that I&#8217;ve changed the scale of the y-axis).<br />
Mysqlicious with it&#8217;s 1 table is very fast indeed, its variation with fulltext had to create the fulltext index and therefore is a bit slower. Scuttle, with its 2 tables and toxi with its 3 tables are at least two respectively three times as slow. I have to remark, that I used quite a bit of caching for the toxi schema, as I didn&#8217;t want hours to have the data ready..</p>
<p>I guess it doesn&#8217;t really make sense to base your decision, which schema to take on the time for an insert: Bookmark inserts are about 100 times as fast as the intersection queries..</p>
<h4>«What? That slow??»</h4>
<p>You said it. You don&#8217;t want your intersection queries take 0.2 seconds each. That would bring your system to its knees.<br />
<ins datetime="2005-07-08T08:33:32-02:00"><br />
There are some recipes to avoid that:</ins></p>
<h5><ins datetime="2005-07-08T08:33:32-02:00">Caching</ins></h5>
<p><ins datetime="2005-07-08T08:33:32-02:00">I think, you don&#8217;t come around good old caching. I think that you could cache results to a query like &#8220;mysql+tagging&#8221; for about an hour or so. If a user queries his own items, I would lower the cache time (as up-to-dateness is more important with his own items).<br />
Then, I expect if you for instance cache items per tag and intersection them with a decent algorithm, that could be faster.. </ins></p>
<h5><ins datetime="2005-07-08T08:33:32-02:00">The Best Of Both Worlds</ins></h5>
<p><ins datetime="2005-07-08T08:33:32-02:00">I think you could have &#8220;mysqlicous fulltext&#8221; and &#8220;toxi&#8221; running at the same time. That means you have to update/insert in both schemas but when you have to query, you could take the one you think is faster: For simple union the mysqlicious without fulltext search, for intersection queries with common tags the toxi, and for those with uncommon tags the mysqlicious fulltext variant. </ins></p>
<h5><ins datetime="2005-07-08T08:33:32-02:00">Slicing and dicing</ins></h5>
<p><ins datetime="2005-07-08T08:33:32-02:00">You could &#8220;slice and dice&#8221; data (as Nitin proposed it in <a href="http://tagschema.com/blogs/tagschema/2005/06/slicing-and-dicing-data-20-part-1.html">two</a> of his <a href="http://tagschema.com/blogs/tagschema/2005/06/slicing-and-dicing-data-20-part-2.html">posts</a>): That is: you slice your user/tag/item-room and build fact tables. You &#8220;prebuild&#8221; your results in a way. This way, inserts take long but queries themself should be much faster. In our examples, you would for instance first query the tag-intersections on &#8220;toxi&#8221; and then get the facts about each bookmark out of the &#8220;mysqlicious&#8221;-fact-table. But you really should read Nitins posts, as they give a lot of insight.</ins><br />
<ins datetime="2006-05-01T10:01:24+00:00"></p>
<h5>Using a non RDBMS system</h5>
<p><strong>Update:</strong> It&#8217;s been about a year since I wrote that article, and during that year I came to the conclusion that <a href="http://en.wikipedia.org/wiki/RDBMS">RDBMS</a> systems don&#8217;t scale good in systems that have more than 1 million items. Yes, this is a warning: If you are planning to build a large scale system then look for alternatives to <a href="http://en.wikipedia.org/wiki/RDBMS">RDBMS</a> systems. To quote Joshua Schachter, founder of <a href="http://del.icio.us">delicious</a>:</p>
<blockquote><p>«tags doesn&#8217;t map to sql at all. so use partial indexing.»[<a href="http://www.redmonk.com/jgovernor/2006/02/08/things-weve-learned-josh-schachter-quotes-of-the-day/">Joshua Schachter at Carson Summit</a>]</p></blockquote>
<p>I didn&#8217;t try any of the non-RDBMS system but it looks like <a href="http://lucene.apache.org/java/docs/">Apache Lucene</a> and <a href="http://lucene.apache.org/hadoop/">Hadoop</a>. There has been <a href="http://nelson.textdrive.com/pipermail/tagdb/2006-March/thread.html#164">a discussion on the Tagdb Mailing list</a> about these solutions.</p>
<p></ins></p>
<h4>«I don&#8217;t believe you! I want to try it at home»</h4>
<p><a href="http://www.pui.ch/phred/modules/tag_schemas_performance_test.tar.gz">Download the source code (PHP)</a> I used to run the queries and test yourself, extend them as you like. The source is published as <a href="http://en.wikipedia.org/wiki/LGPL">LGPL</a>.</p>
<h3><a name="setup"></a>Performance Tests Setup</h3>
<p>Now, if you have read that far, you probably want to know some background information: As you noticed, for each schema, I set up 4 databases, one database holding 1000 bookmarks, the next 10&#8242;000, then 100&#8242;000 and the fourth 1 million bookmarks. The inserted tags (as well as urls) are random English words taken from two sets of tags:</p>
<ul>
<li>the large set containing about 44&#8242;000 tags (that are simple English words)</li>
<li>the small set is varying in size (the results shown here are taken from 250 and 999 tag sets)</li>
</ul>
<p>Every bookmark gets one to ten tags attached. Every odd tag is from the large set, alternately taken from small and large set. Every schema got exactly the same bookmarks and tag data.</p>
<p>Then every schema got queried with an alternately 1-3 tag query. So the first query is for instance just &#8220;blog&#8221;, the second &#8220;design+css&#8221;, the third &#8220;webdesign+music+software&#8221;, the fourth again with just one tag an so forth..<br />
All the tags for the queries are taken from the small set so that the queries don&#8217;t all end in empty results..<br />
All the queries are tested and work. The outcome of each query on the three schemas is exactly the same.</p>
<h4>Mysql Setup</h4>
<p>I used mysql 4.0.21.<br />
An excerpt from <code>/etc/my.cnf</code> (I think these are the relevant settings to this performance test)</p>
<pre>key_buffer=300M
query_cache_size=30M
query_cache_limit=30M
table_cache = 64
ft_min_word_len = 2
ft_stopword_file = ''</pre>
<h4>System</h4>
<blockquote><p>CPU: 3GHz Dual Xeon<br />
Cache: 1MB<br />
Harddisk: SCSI Ultra 320 Atlas 10K, no RAID<br />
RAM: 3GB</p></blockquote>
<h4>Assumptions</h4>
<ul>
<li>Queries select just the id of a bookmark. I assume that you have to do a second query to get all the wished data to display. I know that this is not fair towards the mysqlicious schema.</li>
<li>I left out user data, as I assume, user data columns wouldn&#8217;t change the outcome of this tests. I wanted to keep the schemas as simple as possible.</li>
<li>Each query is done with <code>LIMIT 50</code> as I assume that a normal application doesn&#8217;t want to get all bookmarks. I assume nobody wants to <code>order</code> bookmarks by any dimension, because this would be <strong>very</strong> expensive (ever wondered why you cannot sort bookmarks on <a href="http://del.icio.us">del.icio.us</a> by date or similar? You get it..)</li>
</ul>
<h3>Acknowledgements</h3>
<p>Thanks to <a href="http://www.citrin.ch">Citrin</a>, the company I work, to let me use our new server to run the queries. The server didn&#8217;t have much anything else to do so the results should be accurate.<br />
The graphs are done using <a href="http://www.aditus.nu/jpgraph/">JpGraph</a>. Very easy to use and produces beautiful images.</p>
<h3>Further reading</h3>
<ul>
<li><a href="http://www.niallkennedy.com/blog/archives/2004/10/flickr_architec.html">Flickr architecture</a></li>
<li><a href="http://labnotes.blogsome.com/2005/06/06/lab-notes-5-fulltext-not-so-fast/">Lab notes: Fulltext not so fast</a>: Fulltext performance issues</li>
<li><a href="http://www.webmasterworld.com/forum23/3557.htm">WebmasterWorld forum: mysql fulltext performance issues</a></li>
<li><a href="http://vegan.net/tony/supersmack/">Mysql Supersmack: Mysql performance tool</a></li>
<li><a href="http://dev.mysql.com/doc/mysql/en/mysql-benchmarks.html">Mysql Benchmark</a></li>
<li><a href="http://jeremy.zawodny.com/mysql/mysql-optimization.html">Powerpoint article of jeremy zawodny</a>on Mysql optimisation</li>
<li><a href="http://www.petefreitag.com/item/389.cfm">Pete Freitag did a sort of review of this article</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/feed</wfw:commentRss>
		<slash:comments>49</slash:comments>
		</item>
	</channel>
</rss>
