google
yahoo
bing
June 19th, 2005

Tagsystems: performance tests

In my previous article named “Tags: database schemas” we analysed different database schemas on how they could meet the needs of tag systems. In this article, the focus is on performance (speed). That is: if you want to build a tagsystem that performs good with about 1 million items (bookmarks for instance), then you may want to have a look at the following result of my performance tests.
In this article I tested tagging of bookmarks, but as you can tag pretty much anything, this goes for tagging systems in general.

I tested the following schemas (I keep the naming from the previous article):

You may want to have a close watch at the details of the schemas when having a look at the sql-create-table-queries.

But let’s go directly to the results. The details about the setup of this tests are mentioned at the end of this article. The x-axis depicts the number of bookmarks in the corresponding database, on the y-axis you see how much time each query took to execute.

Results

Intersection: 250 tag set

Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset

The first two tests are done with 250 tags in the small dataset (see below for explanation). I think the queries in the “1 million bookmarks database” are the only size we should pay attention to. I mean if you have a small number of bookmarks, performance isn’t really a thing to bother..

We run intersection queries, like

I want to search for bookmarks tagged with “design” and “html”

You see that, not surprisingly, mysqlicious with its WHERE tag LIKE "% tag %" is very slow. That is, MySQL has to go through the whole dataset and test each bookmark against the query.
What actually is surprising me, is that the fulltext search of mysql is not that high-performance. In fact it is not faster then the LIKE-query in the MySQLicious DB. This really disappointed me. I tried to do any quirks possible to make this faster as I think, a tag-database-system with mysql fulltext would be very easy and like the only thing you should head to...
What is surprising me too, is that the queries on the 3 table schema are about double as fast the the ones on the two-table ones(take a look at the queries if you think you could give me a hint on this). Noticeable is, that in the scuttle and toxi-variant, the more queries were run, the faster they were. I didn’t do any tests with queries and inserts mixed so this may be coming from just plain good caching and this effect possible doesn’t show up on live bookmark management systems.

Intersection: 999 tag set

Intersection test with 300 queries, up to three tags in query, 250 tags in small dataset
Now have a look to what happens if we broaden our small tag set: MySQLicious with fulltext suddenly gets the performance leader. That means, if you have a bookmark management system with diverse tags (this most probably comes from the fact that there are many users), the fulltext solution is possibly the way to go.
So now, as you see, choosing the right schema is all about tag distribution. In my previous post about guessing the overall tag distribution on del.icio.us, I came to the conclusion, that delicious’ most popular tag “design” is showing up in 3.2% of all bookmarks on del.icio.us. So then, what is the mean tag distribution?

So I’d suggest that if your average distribution is 1%, take “toxi”, if the distribution is broader, take “MySQLicious fulltext”.
If you take a closer look, you can see that the fulltext schema stayed as fast as when queried in the 250 tag set. That means, if you want to go sure your tag system responds ok in every situation, you should go with the “mysql fulltext” schema.
Hannes has done some further investigation on mysql fulltext running on MySQL 4.1 (my tests were on MySQL 4.0.21)

Union

Union test with 250 tags in small dataset
When doing a union query we say

I want to search for all the bookmarks that are tagged either with “delicious” or “del.icio.us”

This queries, you guessed, are handled the fastest by “MySQLicious schema” with its LIKE-queries: MySQL seeks through the bookmarks, harvesting all bookmarks with one of the given tags and says “I’m finished!” when it was at bookmark number #968, because it found 50 bookmarks. Whereas in the other schemas, MySQL has to join the tags with the bookmarks first and only then could search though it..

Insert

Setup database schemas with the data: 250 tags in small dataset
When comparing the different schemas on the time of the insert-”statements” of one bookmark, the result isn’t very surprising (notice that I’ve changed the scale of the y-axis).
Mysqlicious with it’s 1 table is very fast indeed, its variation with fulltext had to create the fulltext index and therefore is a bit slower. Scuttle, with its 2 tables and toxi with its 3 tables are at least two respectively three times as slow. I have to remark, that I used quite a bit of caching for the toxi schema, as I didn’t want hours to have the data ready..

I guess it doesn’t really make sense to base your decision, which schema to take on the time for an insert: Bookmark inserts are about 100 times as fast as the intersection queries..

«What? That slow??»

You said it. You don’t want your intersection queries take 0.2 seconds each. That would bring your system to its knees.

There are some recipes to avoid that:

Caching

I think, you don’t come around good old caching. I think that you could cache results to a query like “mysql+tagging” for about an hour or so. If a user queries his own items, I would lower the cache time (as up-to-dateness is more important with his own items).
Then, I expect if you for instance cache items per tag and intersection them with a decent algorithm, that could be faster..

The Best Of Both Worlds

I think you could have “mysqlicous fulltext” and “toxi” running at the same time. That means you have to update/insert in both schemas but when you have to query, you could take the one you think is faster: For simple union the mysqlicious without fulltext search, for intersection queries with common tags the toxi, and for those with uncommon tags the mysqlicious fulltext variant.

Slicing and dicing

You could “slice and dice” data (as Nitin proposed it in two of his posts): That is: you slice your user/tag/item-room and build fact tables. You “prebuild” your results in a way. This way, inserts take long but queries themself should be much faster. In our examples, you would for instance first query the tag-intersections on “toxi” and then get the facts about each bookmark out of the “mysqlicious”-fact-table. But you really should read Nitins posts, as they give a lot of insight.

Using a non RDBMS system

Update: It’s been about a year since I wrote that article, and during that year I came to the conclusion that RDBMS systems don’t scale good in systems that have more than 1 million items. Yes, this is a warning: If you are planning to build a large scale system then look for alternatives to RDBMS systems. To quote Joshua Schachter, founder of delicious:

«tags doesn’t map to sql at all. so use partial indexing.»[Joshua Schachter at Carson Summit]

I didn’t try any of the non-RDBMS system but it looks like Apache Lucene and Hadoop. There has been a discussion on the Tagdb Mailing list about these solutions.

«I don’t believe you! I want to try it at home»

Download the source code (PHP) I used to run the queries and test yourself, extend them as you like. The source is published as LGPL.

Performance Tests Setup

Now, if you have read that far, you probably want to know some background information: As you noticed, for each schema, I set up 4 databases, one database holding 1000 bookmarks, the next 10′000, then 100′000 and the fourth 1 million bookmarks. The inserted tags (as well as urls) are random English words taken from two sets of tags:

Every bookmark gets one to ten tags attached. Every odd tag is from the large set, alternately taken from small and large set. Every schema got exactly the same bookmarks and tag data.

Then every schema got queried with an alternately 1-3 tag query. So the first query is for instance just “blog”, the second “design+css”, the third “webdesign+music+software”, the fourth again with just one tag an so forth..
All the tags for the queries are taken from the small set so that the queries don’t all end in empty results..
All the queries are tested and work. The outcome of each query on the three schemas is exactly the same.

Mysql Setup

I used mysql 4.0.21.
An excerpt from /etc/my.cnf (I think these are the relevant settings to this performance test)

key_buffer=300M
query_cache_size=30M
query_cache_limit=30M
table_cache = 64
ft_min_word_len = 2
ft_stopword_file = ''

System

CPU: 3GHz Dual Xeon
Cache: 1MB
Harddisk: SCSI Ultra 320 Atlas 10K, no RAID
RAM: 3GB

Assumptions

Acknowledgements

Thanks to Citrin, the company I work, to let me use our new server to run the queries. The server didn’t have much anything else to do so the results should be accurate.
The graphs are done using JpGraph. Very easy to use and produces beautiful images.

Further reading

49 Comments »

  1. Go back to school dude!
    Rework the queries for the toxi schema and use JOINs.

    Comment by no name — June 21, 2005 9:35 am #comment-273

  2. You realize that you can use mysqlicious approach when doing a Union query when you use the mysqlicious fulltext approach? :P

    Comment by no name #2 — June 21, 2005 4:40 pm #comment-281

  3. [...] , MySQLJune 21, 2005 6:37 pm

    Fulltext search redux

    Philipp Keller tackles tags and MySQL again, this time running a benchmark against four diffe [...]

    Pingback by Lab notes :: Fulltext search redux :: June :: 2005 — June 21, 2005 7:33 pm #comment-282

  4. Yeah, you are right! I didn’t thought of that.. But: When doing union with fulltext, it gives you the extra bonus that your result is ordered by “number of tags matching”. That way you can do “find similar entries”.

    Comment by phred — June 22, 2005 7:07 am #comment-286

  5. Intersection isn’t possible on del.icio.us?

    Comment by no name #2 — June 22, 2005 1:30 pm #comment-287

  6. Nor union? (sorry for double post)

    Comment by no name #2 — June 22, 2005 1:31 pm #comment-288

  7. Intersection is possible on del.icio.us. Union not yet.
    I still wonder why del.icio.us has not yet implemented union.
    I don’t get it because the union queries are faster in all schemas then the corresponding intersection queries..

    Comment by phred — June 22, 2005 3:11 pm #comment-293

  8. Intersection: Oh, indeed. Haven’t noticed that yet :)
    Union: It’s not very interesting…what’s the point of Union actually?

    Comment by no name #2 — June 22, 2005 5:37 pm #comment-300

  9. Isn’t the point of union quite obvious from the example in the article? If I wan’t to find bookmarks about delicious I would see them all at one time instead of first look at the ones tagged “delicious” and then those tagged “del.icio.us”.

    Comment by Daniel — June 22, 2005 10:36 pm #comment-302

  10. Must think and read before complain :D

    Comment by no name #2 — June 23, 2005 11:32 am #comment-318

  11. Daniel: Thank you.. :-)
    To clarify:

    Comment by phred — June 23, 2005 12:26 pm #comment-319

  12. What do you mean by:
    “rework your queries and use joins”?
    I altered the toxi query to be something like this:

    SELECT tm.bookmark_id
    FROM tagmap tm INNER JOIN tag t ON t.id=tm.tag_id
    WHERE t.tag IN (“design”, “blog”)
    GROUP BY tm.bookmark_id
    HAVING COUNT(tm.bookmark_id)=2

    But imho that’s the same as when I say t.id=tm.tag_id in the WHERE clause.
    I nontheless did run a test:
    intersection query with join and small tag set of 999
    Notice that that all the queries run faster. Could be that the server while running the tests of the article wasn’t really idle.. :-) But it doesn’t change the results as the ranking stays the same.

    Comment by phred — June 23, 2005 12:35 pm #comment-320

  13. What MySQL version? And some server details? :D

    Comment by no name #2 — June 23, 2005 4:13 pm #comment-323

  14. And you checked EXPLAIN SELECT for all queries?

    Comment by no name #2 — June 23, 2005 7:50 pm #comment-327

  15. Noname2: Added mysql-version and server-details to the article. Thank you for the hint.

    And about the EXPLAIN SELECT:
    I’m not a MySQL expert, here are the results for “toxi”:

    EXPLAIN SELECT tm.bookmark_id
    FROM tagmap tm, tag t
    WHERE tm.tag_id = t.id AND t.tag IN (“design”, “blog”)
    GROUP BY tm.bookmark_id
    HAVING COUNT( tm.bookmark_id ) =2

    table type possible_keys key key_len ref rows Extra
    t range PRIMARY,tag tag 256 NULL 2 Using where; Using temporary; Using filesort
    tm ref tag tag 5 t.id 126 Using where

    I scanned and skimmed MySQL explain manual and found out that the “using filesort” and “using temporary” ain’t that good.. :-)

    Comment by phred — June 24, 2005 7:17 am #comment-339

  16. Try to use MySQL 4.1. IMHO they greatly improved full-text there. I might try a few things too but creating those 1mio tables takes HOURS. :S

    Comment by no name #2 — June 24, 2005 8:53 am #comment-340

  17. I eh created a 1mio mysqlicious fulltext table on my home computer (took like half an hour, Athlon 2400XP, 512MB RAM, old hard disk, lots of programs open :P) with MySQL 4.1 and intersection takes ~0.03 seconds here (with LIMIT 50) and 0.22 seconds with all results (457). I don’t have ANY optimizations at all, just the standard MySQL 4.1 installation.

    Comment by no name #2 — June 24, 2005 11:08 am #comment-342

  18. Sounds good! Did you use my code or how did you set up your bookmarks/tags?
    I’m gonna do tests when we migrate the server to SuSE 9.3. SuSE 9.3 comes with MySQL 4.1.xx

    Comment by phred — June 24, 2005 11:21 am #comment-344

  19. Used your code to generate the bookmarks/tags but executed the queries right in MySQL without a PHP script.

    Comment by no name #2 — June 24, 2005 12:04 pm #comment-347

  20. Ah cool.. glad somebody used my code.. :-)
    Ok, how many tags did you search for?
    There is a big difference between searching for bookmarks that fulfill just a “one tag query” and searching for bookmarks with a 3-tags-query.
    If you use my script to do the intersection test, then you can generate the graph with php graph.php outputfile

    Comment by phred — June 24, 2005 1:00 pm #comment-349

  21. I just looked for two tags. And my machine at home is windows so I don’t have all that graph stuff, my production machine is hosting busy vBulletin forums so I can’t really try there.

    Comment by no name #2 — June 24, 2005 1:09 pm #comment-350

  22. Can you write me an email?

    Comment by phred — June 24, 2005 2:19 pm #comment-353

  23. http://hannes.magiccards.info/get/results.html

    Weird results sometimes. Don’t know how to interprete those.

    Comment by no name #2 — June 25, 2005 5:15 pm #comment-364

  24. Hi Phred,

    I too am very interested in building a massively scalable schema for my “freetag” type opensource project (which allows you to tag anything you want anywhere)

    I have experimented very heavily into a few schema types.

    As a test for my tagging schema/query code, i built a display front end that is similar to delicious, where users can get a list of objects tagged, as well as statistical info such as “tagged by how many others” etc.

    One must note that the schema must not only be able to support
    Intersections/Unions, but also statistical queries such as “COUNT number of objects tagged with this tag”
    “COUNT number of objects tagged with this tag, belonging to certain user”
    “COUNT number of objects tagged with INTERSECTION(TAG1+TAG2)”

    IMHO Counts are the big killers in terms of performance. A truly normalized approach helps a lot, because u can index lots of columns, which helps counts tremenously.

    However, fully normalized = many tables = many joins= BIG problems when u try to join Table A(2million links/objects) with TABLE B(100000users) with TABLE C(1000000tags).

    Initially, i started with a totally normalized approach. This works pretty well up to about to about 10k links.

    An approach suggested by Nitin of tagschema.com suggests a hybrid normalized+denormalized approach. This will help in the sense that u can achieve faster selects by having less joins, but the “denormalized” FACT tables he suggest increase in size VERY VERY, the more columns u put in.

    Up till now noone knows how delicious does it, but all i know is that Joshua has done quite a few DB schema revisions, but his site with >10m links seems to be performing fine under the load.

    Do email me sometime! Thanks

    Comment by powster — June 26, 2005 11:30 am #comment-379

  25. Hannes: Thank you! I added the link to your results to the article.
    The results are indeed strange. I think the fulltext thingie is like a black box. When I first heard about it, I thought it would have about the same performance as the toxi variant..

    Powster: Good point with the counts. Interesting is, that on del.icio.us, counts are shown only on per-user-views of bookmarks (i.e. /phred, and never on a “global” view (i.e. /tag/web).
    I read all articles of Nitin and he has got very good ideas to build a good and fast database schema. Thank you for pointing me to his posts..
    I’ll add some “conclusions” to the articles with insights from his articles. He promised to set up a mailing list, so we could have a nice discussion there..! I’m looking forward to it!

    And I must admit that my performance tests are a bit too “clean”. That is: Your tests, powster, seem to be more accurate.. Perhaps I’ll do some mixed insert/query-tests.. and add COUNT-tests as well.. I think then we’ll have a basis for selecting the right database for the thing we want to do..

    Comment by phred — June 27, 2005 1:07 pm #comment-408

  26. I think all the slow queries appear when the database has to grab a lot of rows. This happens when comparing a large set of tags to a small one for intersection (two bigs sets are easy, because 50 rows are found easily), or merging two huge sets of bookmarks (union) or counting a large set of tags. “Full” table scans are a thing to avoid in ANY case.

    I think you have to try a completely different approach to solve these problems. Because intersection a small set and a huge set is always slow with all of these three approaches.

    You maybe COULD try to have two different types of tags. Common tags and rare tags and try to come up with a schema that uses this. Not sure though.

    Comment by no name #2 — June 27, 2005 3:01 pm #comment-410

  27. Try these:
    http://del.icio.us/tag/web+design
    http://del.icio.us/tag/web+design+tips

    I get server errors on all of them. Corresponds exactly to my results. Comparing huge sets of tags = crap.

    Comment by no name #2 — June 27, 2005 3:07 pm #comment-411

  28. Hannes: Good thoughts.. one has to come up with a different schema. IMHO this is not that easy..

    And delicious has problems with
    http://del.icio.us/tag/web+@de
    http://del.icio.us/tag/web+ckoma
    as well.. as with any tag intersected with “web”.
    Well.. this just tells us that the delicious schema isn’t voodoo but uses standard schema setups.. :-)

    Comment by phred — June 27, 2005 3:59 pm #comment-413

  29. Indeed. Mine were with two huge sets, yours are with one huge and one small set, as I pointed out above (and in my test).

    And I guess if there was a better schema delicious would use it :P

    Comment by no name #2 — June 27, 2005 6:51 pm #comment-417

  30. And I guess if there was a better schema delicious would use it :P

    Perhaps. But we are here to change the world, aren’t we? :-)
    btw: I asked Joshua before I did my performance test if I could help him out to test some things he wanted to figure out.. he didn’t answer so probably he is quite content with his schema :-)

    Comment by phred — June 28, 2005 6:53 am #comment-427

  31. Hey Phred!

    Just wanted to say thanks heaps for doing these tests for real. Very informative.

    So…. Bottom line, which solution would you choose for your next data-heavy folksonomy app?

    PS I realise (now) that my quick reply to your original post was rather flippant and quick compared to people who are actually doing all this stuff in a serious way. The full text indexing does sound very intereting. I wonder if MSSQL Full Text index would work as well as the MySQL index (probably, huh)?

    Comment by Miles Thompson — July 8, 2005 2:32 am #comment-625

  32. Miles: Your comment and also the voice of others got me doing this performance test.. so, thank you too.. :-)

    About your bottom line question: I added some subparagraphs to my “What? That slow?” paragraph. Hope that answers your question?

    I don’t have any experience with MsSQL.. But I guess you could take my php-code and wouldn’t have to change it much to have it run on MsSql.. would be interesting..

    Comment by phred — July 8, 2005 7:56 am #comment-628

  33. [...] Muy bueno el post “Tagsystems: performance tests“, de Philip Keller, una de las personas que ms estn escribiendo ltimamente sobre el trasfondo tcnico de los sistemas de tags. [...]

    Pingback by mildiez.net » Archivo de bitcora » Cmo modelar un sistema de tags — December 8, 2005 2:01 pm #comment-1887

  34. [...] Tagging Technik, freetag-library , datenbanknormalisierung/performance [...]

    Pingback by doomicile » Blog Archive » Webmontag in Hamburg — March 22, 2006 7:45 am #comment-2436

  35. One of the primary tricks for speeding up this sort of thing is to put more data in the index. Instead of using just “tag” as a key, use “tag, bookmark_id” as athe key. The nice thing about this is that the database can suck up the bookmark_id that it needs out of the index rather than having to go back to the underlying row to extract the data that it needs.

    Note that the toxi tag table uses varchar(255) as the datatype while the scuttle tag table uses varchar(20) as the datatype. While this shouldn’t affect performance, it is important in this type of comparison test to use the same data type just in case.

    Comment by cesium62 — June 22, 2006 12:20 am #comment-4219

  36. [...] Еще о тэгах, разные методики организации данных и тесты производительности http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html [...]

    Pingback by WEB2.0 на Руси » Производительность в WEB2 — August 8, 2006 10:10 am #comment-5672

  37. [...] The best resource that I've found on the subject is this plog post on the different database schemas .  I'm also quite impressed with the author's breakdown of performance times for tagging datastructures as the number of tags increased. For most of the applications that I build, there would be a minimal number of tags as they will be managed from one central administration (unlike Flickr or Delicious where users can define their own tags). [...]

    Pingback by That’s What She Said — October 16, 2006 1:44 am #comment-12084

  38. Excellent follow-up article, thank you for your efforts in producing the performance audits :-)

    The link to JpGraph has been mangled it seems, the link points to ‘http://http//www.aditus.nu/jpgraph/’ instead of ‘http://www.aditus.nu/jpgraph/’.

    Comment by Zeeshan — November 2, 2006 12:20 pm #comment-13650

  39. Zeeshan: Thank you, fixed the link to jpgraph.

    Comment by Philipp Keller — November 5, 2006 1:16 pm #comment-13910

  40. I just finished a course advanced database systems, and we
    learnede how to increase performance for specific queries schema’s.

    For example do you know wich query-plans where used?
    You can check that with the sql EXPLAIN command.

    I would recommend a hash index on the tag collumn because you are
    not doing any reange searches .. searching for a specific tag would require constant time.

    Als it is better to measure your performance test in the number of page transfers between the hard disc and the memory/cpu. Because this is the bottleneck in performance. The miliseconds say more about your system than about those queries.

    Anyways, another question: Which indexes did you use? B+ indexes? ISAM indexes? Hash indexes? Which where the search-keys for these indexes?

    Comment by Tjerk — November 8, 2006 4:11 pm #comment-14212

  41. [...] Zend_Search_Lucene allowed me to tackle the complicated issue of tagging – Tagging is a known problem to map effectively to databases (A dude named Phillip Keller wrote a blog on different tagging schemas, and conducted a performance comparison of the schemas. Another dude named Nirin Borwankar suggested yet another schema for tagging. The tagging issue is a long and complicated one.) To quote del.icio.us creator, John Schachter – “tags don’t map to sql at all. so use partial indexing.” Using Zend_Search_Lucene to index tagged items allowed us to implement tags in the Octabox project while still enjoying high performance, which was something that I was quite worried over before. [...]

    Pingback by Octablog » A review of the Zend Framework - Part 3 — June 20, 2007 12:06 am #comment-64802

  42. This is probably a dumb question but here goes. If I opt for the toxi approach, is there any performance benefit to indexing any of the columns?

    G

    Comment by Geoff — July 12, 2007 8:17 pm #comment-71502

  43. Geoff: Sure, go for the indexes! Have a look at http://www.pui.ch/phred/modules/tag_database_schemas.sql. I added indexes on tag.tagname, bookmark_tag.tag, bookmark_tag.bookmark and bookmark.url. If you can improve the performence by altering that indexes let me know.

    Comment by Philipp Keller — July 22, 2007 3:39 pm #comment-73758

  44. It looks like the first commenter (“Go back to school dude!
    Rework the queries for the toxi schema and use JOINs.”) never pursued his argument, and it looks like you never “got” what he was really saying.

    I think what he meant was that a query of this form:
    SELECT … WHERE name IN (“word1″, “word2″, “word3″)
    HAVING …

    should instead be rewritten as:
    SELECT … WHERE name = “word1″
    INNER JOIN
    SELECT … WHERE name = “word2″
    INNER JOIN
    SELECT … WHERE name = “word3″

    Same thing for the UNION query. Use actual UNIONS instead of “WHERE name IN …”.

    I think this is supposed to give you a good performnance boost.
    Would you consider redoing your benchmarks on the TOXI schema using these queries above?

    Comment by Ryan — September 25, 2007 4:19 am #comment-87321

  45. What about benchmarks when the queries have ORDER BY (such as ORDER BY date).

    For example, when a user wants to see all items with certain tags, sorted newest item first. How is performance then?

    Comment by orderlord — September 26, 2007 6:57 am #comment-87607

  46. Efficiently handling tags is similar to attributes in dating sites (ie. +blonde +tits -fat) except there are a lot more tags than profile attributes.

    In order to extract any kind of acceptable performance from a SQL database, you will have to forget about LIKE (full table scan), and foreign keys (scuttle/toxi solution).

    Basically you need a SQL database which supports one of the following :

    - efficient star join support (that means Oracle),
    - Bitmap index support (coming up in Postgres),
    - efficient fulltext search support (ie Postgres)
    - vectors (arrays) as column types and specific index methods to make boolean queries (ie. Postgres) on the values contained in said vectors.

    MySQL is not part of the solution ; besides MySQL FULLTEXT is lucicrous.

    A better solution may be to use a full text search engine. I tried Xapian and found that, on large data sets consisting of up to a million forum posts, it massively outperformed Postgresql’s fulltext search, which itself massively outperformed MySQL’s fulltext search. This can be used for tags, and obviously to search the articles’ full text. Obviously, Lucene is also a solution, however it is less user-friendly than Xapian (uses Java, bleh, hard to interface with Python for update scripts, etc).

    Comment by Peufeu — October 15, 2007 11:45 am #comment-92353

  47. only 2 years since last comment and 4 since the post.

    But at about the time of this post, I was working for LinkedIn.com I ran some performance tests comparing Lucene, MySQL FULL Text, Oracle Full Text for searching people’s profile. Hands down Lucene was the winner.

    Ever wonder why there is no obvious way to break a connection in LinkedIn? Its because the Lucene index is incrementally added to. Removing a connection from the search results is an expensive operation.

    Of course things change — would be interesting to see the results of 4 years worth of work on all three products.

    Comment by Pat — June 8, 2009 11:59 pm #comment-130720

  48. «tags doesn’t map to sql at all. so use partial indexing.»[Joshua Schachter at Carson Summit]

    FYI the link to Joshua is dead.

    Thanks for all the useful information.

    Comment by c-a — January 19, 2010 12:01 am #comment-130804

  49. @c-a: thanks for the hint, I’ve corrected the link

    Comment by Philipp Keller — January 30, 2010 10:59 pm #comment-130806

RSS feed for comments on this post.

Leave a comment

This page and it's content is licenced under creative commons