<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Tagsystems: performance tests</title>
	<atom:link href="http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/feed" rel="self" type="application/rss+xml" />
	<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html</link>
	<description>Philipp Kellers weblog</description>
	<lastBuildDate>Wed, 30 Jun 2010 09:40:18 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Philipp Keller</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-130806</link>
		<dc:creator>Philipp Keller</dc:creator>
		<pubDate>Sat, 30 Jan 2010 21:59:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-130806</guid>
		<description>@c-a: thanks for the hint, I&#039;ve corrected the link</description>
		<content:encoded><![CDATA[<p>@c-a: thanks for the hint, I&#8217;ve corrected the link</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: c-a</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-130804</link>
		<dc:creator>c-a</dc:creator>
		<pubDate>Mon, 18 Jan 2010 23:01:38 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-130804</guid>
		<description>«tags doesn’t map to sql at all. so use partial indexing.»[Joshua Schachter at Carson Summit]

FYI the link to Joshua is dead.

Thanks for all the useful information.</description>
		<content:encoded><![CDATA[<p>«tags doesn’t map to sql at all. so use partial indexing.»[Joshua Schachter at Carson Summit]</p>
<p>FYI the link to Joshua is dead.</p>
<p>Thanks for all the useful information.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pat</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-130720</link>
		<dc:creator>Pat</dc:creator>
		<pubDate>Mon, 08 Jun 2009 22:59:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-130720</guid>
		<description>only 2 years since last comment and 4 since the post. 

But at about the time of this post, I was working for LinkedIn.com   I ran some performance tests comparing &lt;a href=&quot;http://lucene.apache.org&quot; rel=&quot;nofollow&quot;&gt;Lucene&lt;/a&gt;, MySQL FULL Text, Oracle Full Text for searching people&#039;s profile. Hands down Lucene was the winner. 

Ever wonder why there is no obvious way to break a connection in LinkedIn? Its because the Lucene index is incrementally added to. Removing a connection from the search results is an expensive operation.

Of course things change -- would be interesting to see the results of 4 years worth of work on all three products.</description>
		<content:encoded><![CDATA[<p>only 2 years since last comment and 4 since the post. </p>
<p>But at about the time of this post, I was working for LinkedIn.com   I ran some performance tests comparing <a href="http://lucene.apache.org" rel="nofollow">Lucene</a>, MySQL FULL Text, Oracle Full Text for searching people&#8217;s profile. Hands down Lucene was the winner. </p>
<p>Ever wonder why there is no obvious way to break a connection in LinkedIn? Its because the Lucene index is incrementally added to. Removing a connection from the search results is an expensive operation.</p>
<p>Of course things change &#8212; would be interesting to see the results of 4 years worth of work on all three products.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peufeu</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-92353</link>
		<dc:creator>Peufeu</dc:creator>
		<pubDate>Mon, 15 Oct 2007 10:45:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-92353</guid>
		<description>Efficiently handling tags is similar to attributes in dating sites (ie. +blonde +tits -fat) except there are a lot more tags than profile attributes.

In order to extract any kind of acceptable performance from a SQL database, you will have to forget about LIKE (full table scan), and foreign keys (scuttle/toxi solution).

Basically you need a SQL database which supports one of the following :

- efficient star join support (that means Oracle), 
- Bitmap index support (coming up in Postgres), 
- efficient fulltext search support (ie Postgres)
- vectors (arrays) as column types and specific index methods to make boolean queries (ie. Postgres) on the values contained in said vectors.

MySQL is not part of the solution ; besides MySQL FULLTEXT is lucicrous.

A better solution may be to use a full text search engine. I tried Xapian and found that, on large data sets consisting of up to a million forum posts, it massively outperformed Postgresql&#039;s fulltext search, which itself massively outperformed MySQL&#039;s fulltext search. This can be used for tags, and obviously to search the articles&#039; full text. Obviously, Lucene is also a solution, however it is less user-friendly than Xapian (uses Java, bleh, hard to interface with Python for update scripts, etc).</description>
		<content:encoded><![CDATA[<p>Efficiently handling tags is similar to attributes in dating sites (ie. +blonde +tits -fat) except there are a lot more tags than profile attributes.</p>
<p>In order to extract any kind of acceptable performance from a SQL database, you will have to forget about LIKE (full table scan), and foreign keys (scuttle/toxi solution).</p>
<p>Basically you need a SQL database which supports one of the following :</p>
<p>- efficient star join support (that means Oracle),<br />
- Bitmap index support (coming up in Postgres),<br />
- efficient fulltext search support (ie Postgres)<br />
- vectors (arrays) as column types and specific index methods to make boolean queries (ie. Postgres) on the values contained in said vectors.</p>
<p>MySQL is not part of the solution ; besides MySQL FULLTEXT is lucicrous.</p>
<p>A better solution may be to use a full text search engine. I tried Xapian and found that, on large data sets consisting of up to a million forum posts, it massively outperformed Postgresql&#8217;s fulltext search, which itself massively outperformed MySQL&#8217;s fulltext search. This can be used for tags, and obviously to search the articles&#8217; full text. Obviously, Lucene is also a solution, however it is less user-friendly than Xapian (uses Java, bleh, hard to interface with Python for update scripts, etc).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: orderlord</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-87607</link>
		<dc:creator>orderlord</dc:creator>
		<pubDate>Wed, 26 Sep 2007 05:57:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-87607</guid>
		<description>What about benchmarks when the queries have ORDER BY (such as ORDER BY date).

For example, when a user wants to see all items with certain tags, sorted newest item first. How is performance then?</description>
		<content:encoded><![CDATA[<p>What about benchmarks when the queries have ORDER BY (such as ORDER BY date).</p>
<p>For example, when a user wants to see all items with certain tags, sorted newest item first. How is performance then?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ryan</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-87321</link>
		<dc:creator>Ryan</dc:creator>
		<pubDate>Tue, 25 Sep 2007 03:19:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-87321</guid>
		<description>It looks like the first commenter (&quot;Go back to school dude!
Rework the queries for the toxi schema and use JOINs.&quot;) never pursued his argument, and it looks like you never &quot;got&quot; what he was really saying. 

I think what he meant was that a query of this form:
SELECT ... WHERE name IN (&quot;word1&quot;, &quot;word2&quot;, &quot;word3&quot;)
HAVING ...

should instead be rewritten as:
SELECT ... WHERE name = &quot;word1&quot;
INNER JOIN
SELECT ... WHERE name = &quot;word2&quot;
INNER JOIN
SELECT ... WHERE name = &quot;word3&quot;

Same thing for the UNION query. Use actual UNIONS instead of &quot;WHERE name IN ...&quot;.

I think this is supposed to give you a good performnance boost.
Would you consider redoing your benchmarks on the TOXI schema using these queries above?</description>
		<content:encoded><![CDATA[<p>It looks like the first commenter (&#8220;Go back to school dude!<br />
Rework the queries for the toxi schema and use JOINs.&#8221;) never pursued his argument, and it looks like you never &#8220;got&#8221; what he was really saying. </p>
<p>I think what he meant was that a query of this form:<br />
SELECT &#8230; WHERE name IN (&#8220;word1&#8243;, &#8220;word2&#8243;, &#8220;word3&#8243;)<br />
HAVING &#8230;</p>
<p>should instead be rewritten as:<br />
SELECT &#8230; WHERE name = &#8220;word1&#8243;<br />
INNER JOIN<br />
SELECT &#8230; WHERE name = &#8220;word2&#8243;<br />
INNER JOIN<br />
SELECT &#8230; WHERE name = &#8220;word3&#8243;</p>
<p>Same thing for the UNION query. Use actual UNIONS instead of &#8220;WHERE name IN &#8230;&#8221;.</p>
<p>I think this is supposed to give you a good performnance boost.<br />
Would you consider redoing your benchmarks on the TOXI schema using these queries above?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Philipp Keller</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-73758</link>
		<dc:creator>Philipp Keller</dc:creator>
		<pubDate>Sun, 22 Jul 2007 14:39:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-73758</guid>
		<description>Geoff: Sure, go for the indexes! Have a look at http://www.pui.ch/phred/modules/tag_database_schemas.sql. I added indexes on tag.tagname, bookmark_tag.tag, bookmark_tag.bookmark and bookmark.url. If you can improve the performence by altering that indexes let me know.</description>
		<content:encoded><![CDATA[<p>Geoff: Sure, go for the indexes! Have a look at <a href="http://www.pui.ch/phred/modules/tag_database_schemas.sql" rel="nofollow">http://www.pui.ch/phred/modules/tag_database_schemas.sql</a>. I added indexes on tag.tagname, bookmark_tag.tag, bookmark_tag.bookmark and bookmark.url. If you can improve the performence by altering that indexes let me know.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Geoff</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-71502</link>
		<dc:creator>Geoff</dc:creator>
		<pubDate>Thu, 12 Jul 2007 19:17:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-71502</guid>
		<description>This is probably a dumb question but here goes. If I opt for the toxi approach, is there any performance benefit to indexing any of the columns?

G</description>
		<content:encoded><![CDATA[<p>This is probably a dumb question but here goes. If I opt for the toxi approach, is there any performance benefit to indexing any of the columns?</p>
<p>G</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Octablog &#187; A review of the Zend Framework - Part 3</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-64802</link>
		<dc:creator>Octablog &#187; A review of the Zend Framework - Part 3</dc:creator>
		<pubDate>Tue, 19 Jun 2007 23:06:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-64802</guid>
		<description>[...] Zend_Search_Lucene allowed me to tackle the complicated issue of tagging - Tagging is a known problem to map effectively to databases (A dude named Phillip Keller wrote a blog on different tagging schemas, and conducted a performance comparison of the schemas. Another dude named Nirin Borwankar suggested yet another schema for tagging. The tagging issue is a long and complicated one.) To quote del.icio.us creator, John Schachter - &#8220;tags don&#8217;t map to sql at all. so use partial indexing.&#8221; Using Zend_Search_Lucene to index tagged items allowed us to implement tags in the Octabox project while still enjoying high performance, which was something that I was quite worried over before. [...]</description>
		<content:encoded><![CDATA[<p>[...] Zend_Search_Lucene allowed me to tackle the complicated issue of tagging &#8211; Tagging is a known problem to map effectively to databases (A dude named Phillip Keller wrote a blog on different tagging schemas, and conducted a performance comparison of the schemas. Another dude named Nirin Borwankar suggested yet another schema for tagging. The tagging issue is a long and complicated one.) To quote del.icio.us creator, John Schachter &#8211; &#8220;tags don&#8217;t map to sql at all. so use partial indexing.&#8221; Using Zend_Search_Lucene to index tagged items allowed us to implement tags in the Octabox project while still enjoying high performance, which was something that I was quite worried over before. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tjerk</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/comment-page-1#comment-14212</link>
		<dc:creator>Tjerk</dc:creator>
		<pubDate>Wed, 08 Nov 2006 15:11:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-14212</guid>
		<description>I just finished a course advanced database systems, and we 
learnede how to increase performance for specific queries schema&#039;s.

For example do you know wich query-plans where used?
You can check that with the sql   EXPLAIN command.

I would recommend a hash index on the tag collumn because you are
not doing any reange searches .. searching for a specific tag would require constant time.

Als it is better to measure your performance test in the number of page transfers between the hard disc and the memory/cpu. Because this is the bottleneck in performance. The miliseconds say more about your system than about those queries.

Anyways, another question: Which indexes did you use? B+ indexes? ISAM indexes? Hash indexes? Which where the search-keys for these indexes?</description>
		<content:encoded><![CDATA[<p>I just finished a course advanced database systems, and we<br />
learnede how to increase performance for specific queries schema&#8217;s.</p>
<p>For example do you know wich query-plans where used?<br />
You can check that with the sql   EXPLAIN command.</p>
<p>I would recommend a hash index on the tag collumn because you are<br />
not doing any reange searches .. searching for a specific tag would require constant time.</p>
<p>Als it is better to measure your performance test in the number of page transfers between the hard disc and the memory/cpu. Because this is the bottleneck in performance. The miliseconds say more about your system than about those queries.</p>
<p>Anyways, another question: Which indexes did you use? B+ indexes? ISAM indexes? Hash indexes? Which where the search-keys for these indexes?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
