<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Tagsystems: performance tests</title>
	<atom:link href="http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html/feed" rel="self" type="application/rss+xml" />
	<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html</link>
	<description>Philipp Kellers weblog</description>
	<pubDate>Thu, 11 Mar 2010 22:26:26 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: Philipp Keller</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-130806</link>
		<dc:creator>Philipp Keller</dc:creator>
		<pubDate>Sat, 30 Jan 2010 21:59:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-130806</guid>
		<description>@c-a: thanks for the hint, I've corrected the link</description>
		<content:encoded><![CDATA[<p>@c-a: thanks for the hint, I&#8217;ve corrected the link</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: c-a</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-130804</link>
		<dc:creator>c-a</dc:creator>
		<pubDate>Mon, 18 Jan 2010 23:01:38 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-130804</guid>
		<description>«tags doesn’t map to sql at all. so use partial indexing.»[Joshua Schachter at Carson Summit]

FYI the link to Joshua is dead.

Thanks for all the useful information.</description>
		<content:encoded><![CDATA[<p>«tags doesn’t map to sql at all. so use partial indexing.»[Joshua Schachter at Carson Summit]</p>
<p>FYI the link to Joshua is dead.</p>
<p>Thanks for all the useful information.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Pat</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-130720</link>
		<dc:creator>Pat</dc:creator>
		<pubDate>Mon, 08 Jun 2009 22:59:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-130720</guid>
		<description>only 2 years since last comment and 4 since the post. 

But at about the time of this post, I was working for LinkedIn.com   I ran some performance tests comparing &lt;a href="http://lucene.apache.org" rel="nofollow"&gt;Lucene&lt;/a&gt;, MySQL FULL Text, Oracle Full Text for searching people's profile. Hands down Lucene was the winner. 

Ever wonder why there is no obvious way to break a connection in LinkedIn? Its because the Lucene index is incrementally added to. Removing a connection from the search results is an expensive operation.

Of course things change -- would be interesting to see the results of 4 years worth of work on all three products.</description>
		<content:encoded><![CDATA[<p>only 2 years since last comment and 4 since the post. </p>
<p>But at about the time of this post, I was working for LinkedIn.com   I ran some performance tests comparing <a href="http://lucene.apache.org" rel="nofollow">Lucene</a>, MySQL FULL Text, Oracle Full Text for searching people&#8217;s profile. Hands down Lucene was the winner. </p>
<p>Ever wonder why there is no obvious way to break a connection in LinkedIn? Its because the Lucene index is incrementally added to. Removing a connection from the search results is an expensive operation.</p>
<p>Of course things change &#8212; would be interesting to see the results of 4 years worth of work on all three products.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peufeu</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-92353</link>
		<dc:creator>Peufeu</dc:creator>
		<pubDate>Mon, 15 Oct 2007 10:45:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-92353</guid>
		<description>Efficiently handling tags is similar to attributes in dating sites (ie. +blonde +tits -fat) except there are a lot more tags than profile attributes.

In order to extract any kind of acceptable performance from a SQL database, you will have to forget about LIKE (full table scan), and foreign keys (scuttle/toxi solution).

Basically you need a SQL database which supports one of the following :

- efficient star join support (that means Oracle), 
- Bitmap index support (coming up in Postgres), 
- efficient fulltext search support (ie Postgres)
- vectors (arrays) as column types and specific index methods to make boolean queries (ie. Postgres) on the values contained in said vectors.

MySQL is not part of the solution ; besides MySQL FULLTEXT is lucicrous.

A better solution may be to use a full text search engine. I tried Xapian and found that, on large data sets consisting of up to a million forum posts, it massively outperformed Postgresql's fulltext search, which itself massively outperformed MySQL's fulltext search. This can be used for tags, and obviously to search the articles' full text. Obviously, Lucene is also a solution, however it is less user-friendly than Xapian (uses Java, bleh, hard to interface with Python for update scripts, etc).</description>
		<content:encoded><![CDATA[<p>Efficiently handling tags is similar to attributes in dating sites (ie. +blonde +tits -fat) except there are a lot more tags than profile attributes.</p>
<p>In order to extract any kind of acceptable performance from a SQL database, you will have to forget about LIKE (full table scan), and foreign keys (scuttle/toxi solution).</p>
<p>Basically you need a SQL database which supports one of the following :</p>
<p>- efficient star join support (that means Oracle),<br />
- Bitmap index support (coming up in Postgres),<br />
- efficient fulltext search support (ie Postgres)<br />
- vectors (arrays) as column types and specific index methods to make boolean queries (ie. Postgres) on the values contained in said vectors.</p>
<p>MySQL is not part of the solution ; besides MySQL FULLTEXT is lucicrous.</p>
<p>A better solution may be to use a full text search engine. I tried Xapian and found that, on large data sets consisting of up to a million forum posts, it massively outperformed Postgresql&#8217;s fulltext search, which itself massively outperformed MySQL&#8217;s fulltext search. This can be used for tags, and obviously to search the articles&#8217; full text. Obviously, Lucene is also a solution, however it is less user-friendly than Xapian (uses Java, bleh, hard to interface with Python for update scripts, etc).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: orderlord</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-87607</link>
		<dc:creator>orderlord</dc:creator>
		<pubDate>Wed, 26 Sep 2007 05:57:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-87607</guid>
		<description>What about benchmarks when the queries have ORDER BY (such as ORDER BY date).

For example, when a user wants to see all items with certain tags, sorted newest item first. How is performance then?</description>
		<content:encoded><![CDATA[<p>What about benchmarks when the queries have ORDER BY (such as ORDER BY date).</p>
<p>For example, when a user wants to see all items with certain tags, sorted newest item first. How is performance then?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ryan</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-87321</link>
		<dc:creator>Ryan</dc:creator>
		<pubDate>Tue, 25 Sep 2007 03:19:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-87321</guid>
		<description>It looks like the first commenter ("Go back to school dude!
Rework the queries for the toxi schema and use JOINs.") never pursued his argument, and it looks like you never "got" what he was really saying. 

I think what he meant was that a query of this form:
SELECT ... WHERE name IN ("word1", "word2", "word3")
HAVING ...

should instead be rewritten as:
SELECT ... WHERE name = "word1"
INNER JOIN
SELECT ... WHERE name = "word2"
INNER JOIN
SELECT ... WHERE name = "word3"

Same thing for the UNION query. Use actual UNIONS instead of "WHERE name IN ...".

I think this is supposed to give you a good performnance boost.
Would you consider redoing your benchmarks on the TOXI schema using these queries above?</description>
		<content:encoded><![CDATA[<p>It looks like the first commenter (&#8221;Go back to school dude!<br />
Rework the queries for the toxi schema and use JOINs.&#8221;) never pursued his argument, and it looks like you never &#8220;got&#8221; what he was really saying. </p>
<p>I think what he meant was that a query of this form:<br />
SELECT &#8230; WHERE name IN (&#8221;word1&#8243;, &#8220;word2&#8243;, &#8220;word3&#8243;)<br />
HAVING &#8230;</p>
<p>should instead be rewritten as:<br />
SELECT &#8230; WHERE name = &#8220;word1&#8243;<br />
INNER JOIN<br />
SELECT &#8230; WHERE name = &#8220;word2&#8243;<br />
INNER JOIN<br />
SELECT &#8230; WHERE name = &#8220;word3&#8243;</p>
<p>Same thing for the UNION query. Use actual UNIONS instead of &#8220;WHERE name IN &#8230;&#8221;.</p>
<p>I think this is supposed to give you a good performnance boost.<br />
Would you consider redoing your benchmarks on the TOXI schema using these queries above?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Philipp Keller</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-73758</link>
		<dc:creator>Philipp Keller</dc:creator>
		<pubDate>Sun, 22 Jul 2007 14:39:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-73758</guid>
		<description>Geoff: Sure, go for the indexes! Have a look at http://www.pui.ch/phred/modules/tag_database_schemas.sql. I added indexes on tag.tagname, bookmark_tag.tag, bookmark_tag.bookmark and bookmark.url. If you can improve the performence by altering that indexes let me know.</description>
		<content:encoded><![CDATA[<p>Geoff: Sure, go for the indexes! Have a look at <a href="http://www.pui.ch/phred/modules/tag_database_schemas.sql" rel="nofollow">http://www.pui.ch/phred/modules/tag_database_schemas.sql</a>. I added indexes on tag.tagname, bookmark_tag.tag, bookmark_tag.bookmark and bookmark.url. If you can improve the performence by altering that indexes let me know.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Geoff</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-71502</link>
		<dc:creator>Geoff</dc:creator>
		<pubDate>Thu, 12 Jul 2007 19:17:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-71502</guid>
		<description>This is probably a dumb question but here goes. If I opt for the toxi approach, is there any performance benefit to indexing any of the columns?

G</description>
		<content:encoded><![CDATA[<p>This is probably a dumb question but here goes. If I opt for the toxi approach, is there any performance benefit to indexing any of the columns?</p>
<p>G</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Octablog &#187; A review of the Zend Framework - Part 3</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-64802</link>
		<dc:creator>Octablog &#187; A review of the Zend Framework - Part 3</dc:creator>
		<pubDate>Tue, 19 Jun 2007 23:06:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-64802</guid>
		<description>[...] Zend_Search_Lucene allowed me to tackle the complicated issue of tagging - Tagging is a known problem to map effectively to databases (A dude named Phillip Keller wrote a blog on different tagging schemas, and conducted a performance comparison of the schemas. Another dude named Nirin Borwankar suggested yet another schema for tagging. The tagging issue is a long and complicated one.) To quote del.icio.us creator, John Schachter - &#8220;tags don&#8217;t map to sql at all. so use partial indexing.&#8221; Using Zend_Search_Lucene to index tagged items allowed us to implement tags in the Octabox project while still enjoying high performance, which was something that I was quite worried over before. [...]</description>
		<content:encoded><![CDATA[<p>[...] Zend_Search_Lucene allowed me to tackle the complicated issue of tagging - Tagging is a known problem to map effectively to databases (A dude named Phillip Keller wrote a blog on different tagging schemas, and conducted a performance comparison of the schemas. Another dude named Nirin Borwankar suggested yet another schema for tagging. The tagging issue is a long and complicated one.) To quote del.icio.us creator, John Schachter - &#8220;tags don&#8217;t map to sql at all. so use partial indexing.&#8221; Using Zend_Search_Lucene to index tagged items allowed us to implement tags in the Octabox project while still enjoying high performance, which was something that I was quite worried over before. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tjerk</title>
		<link>http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-14212</link>
		<dc:creator>Tjerk</dc:creator>
		<pubDate>Wed, 08 Nov 2006 15:11:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html#comment-14212</guid>
		<description>I just finished a course advanced database systems, and we 
learnede how to increase performance for specific queries schema's.

For example do you know wich query-plans where used?
You can check that with the sql   EXPLAIN command.

I would recommend a hash index on the tag collumn because you are
not doing any reange searches .. searching for a specific tag would require constant time.

Als it is better to measure your performance test in the number of page transfers between the hard disc and the memory/cpu. Because this is the bottleneck in performance. The miliseconds say more about your system than about those queries.

Anyways, another question: Which indexes did you use? B+ indexes? ISAM indexes? Hash indexes? Which where the search-keys for these indexes?</description>
		<content:encoded><![CDATA[<p>I just finished a course advanced database systems, and we<br />
learnede how to increase performance for specific queries schema&#8217;s.</p>
<p>For example do you know wich query-plans where used?<br />
You can check that with the sql   EXPLAIN command.</p>
<p>I would recommend a hash index on the tag collumn because you are<br />
not doing any reange searches .. searching for a specific tag would require constant time.</p>
<p>Als it is better to measure your performance test in the number of page transfers between the hard disc and the memory/cpu. Because this is the bottleneck in performance. The miliseconds say more about your system than about those queries.</p>
<p>Anyways, another question: Which indexes did you use? B+ indexes? ISAM indexes? Hash indexes? Which where the search-keys for these indexes?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
