previous post: «del.icio.us statistics (that is: extrapolation)»
In my previous article named “Tags: database schemas” we analysed different database schemas on how they could meet the needs of tag systems. In this article, the focus is on performance (speed). That is: if you want to build a tagsystem that performs good with about 1 million items (bookmarks for instance), then you may want to have a look at the following result of my performance tests.
In this article I tested tagging of bookmarks, but as you can tag pretty much anything, this goes for tagging systems in general.
I tested the following schemas (I keep the naming from the previous article):
You may want to have a close watch at the details of the schemas when having a look at the sql-create-table-queries.
But let’s go directly to the results. The details about the setup of this tests are mentioned at the end of this article. The x-axis depicts the number of bookmarks in the corresponding database, on the y-axis you see how much time each query took to execute.

The first two tests are done with 250 tags in the small dataset (see below for explanation). I think the queries in the “1 million bookmarks database” are the only size we should pay attention to. I mean if you have a small number of bookmarks, performance isn’t really a thing to bother..
We run intersection queries, like
I want to search for bookmarks tagged with “design” and “html”
You see that, not surprisingly, mysqlicious with its WHERE tag LIKE "% tag %" is very slow. That is, MySQL has to go through the whole dataset and test each bookmark against the query.
What actually is surprising me, is that the fulltext search of mysql is not that high-performance. In fact it is not faster then the LIKE-query in the MySQLicious DB. This really disappointed me. I tried to do any quirks possible to make this faster as I think, a tag-database-system with mysql fulltext would be very easy and like the only thing you should head to...
What is surprising me too, is that the queries on the 3 table schema are about double as fast the the ones on the two-table ones(take a look at the queries if you think you could give me a hint on this). Noticeable is, that in the scuttle and toxi-variant, the more queries were run, the faster they were. I didn’t do any tests with queries and inserts mixed so this may be coming from just plain good caching and this effect possible doesn’t show up on live bookmark management systems.

Now have a look to what happens if we broaden our small tag set: MySQLicious with fulltext suddenly gets the performance leader. That means, if you have a bookmark management system with diverse tags (this most probably comes from the fact that there are many users), the fulltext solution is possibly the way to go.
So now, as you see, choosing the right schema is all about tag distribution. In my previous post about guessing the overall tag distribution on del.icio.us, I came to the conclusion, that delicious’ most popular tag “design” is showing up in 3.2% of all bookmarks on del.icio.us. So then, what is the mean tag distribution?
So I’d suggest that if your average distribution is 1%, take “toxi”, if the distribution is broader, take “MySQLicious fulltext”.
If you take a closer look, you can see that the fulltext schema stayed as fast as when queried in the 250 tag set. That means, if you want to go sure your tag system responds ok in every situation, you should go with the “mysql fulltext” schema.
Hannes has done some further investigation on mysql fulltext running on MySQL 4.1 (my tests were on MySQL 4.0.21)

When doing a union query we say
I want to search for all the bookmarks that are tagged either with “delicious” or “del.icio.us”
This queries, you guessed, are handled the fastest by “MySQLicious schema” with its LIKE-queries: MySQL seeks through the bookmarks, harvesting all bookmarks with one of the given tags and says “I’m finished!” when it was at bookmark number #968, because it found 50 bookmarks. Whereas in the other schemas, MySQL has to join the tags with the bookmarks first and only then could search though it..

When comparing the different schemas on the time of the insert-”statements” of one bookmark, the result isn’t very surprising (notice that I’ve changed the scale of the y-axis).
Mysqlicious with it’s 1 table is very fast indeed, its variation with fulltext had to create the fulltext index and therefore is a bit slower. Scuttle, with its 2 tables and toxi with its 3 tables are at least two respectively three times as slow. I have to remark, that I used quite a bit of caching for the toxi schema, as I didn’t want hours to have the data ready..
I guess it doesn’t really make sense to base your decision, which schema to take on the time for an insert: Bookmark inserts are about 100 times as fast as the intersection queries..
You said it. You don’t want your intersection queries take 0.2 seconds each. That would bring your system to its knees.
There are some recipes to avoid that:
I think, you don’t come around good old caching. I think that you could cache results to a query like “mysql+tagging” for about an hour or so. If a user queries his own items, I would lower the cache time (as up-to-dateness is more important with his own items).
Then, I expect if you for instance cache items per tag and intersection them with a decent algorithm, that could be faster..
I think you could have “mysqlicous fulltext” and “toxi” running at the same time. That means you have to update/insert in both schemas but when you have to query, you could take the one you think is faster: For simple union the mysqlicious without fulltext search, for intersection queries with common tags the toxi, and for those with uncommon tags the mysqlicious fulltext variant.
You could “slice and dice” data (as Nitin proposed it in two of his posts): That is: you slice your user/tag/item-room and build fact tables. You “prebuild” your results in a way. This way, inserts take long but queries themself should be much faster. In our examples, you would for instance first query the tag-intersections on “toxi” and then get the facts about each bookmark out of the “mysqlicious”-fact-table. But you really should read Nitins posts, as they give a lot of insight.
Update: It’s been about a year since I wrote that article, and during that year I came to the conclusion that RDBMS systems don’t scale good in systems that have more than 1 million items. Yes, this is a warning: If you are planning to build a large scale system then look for alternatives to RDBMS systems. To quote Joshua Schachter, founder of delicious:
«tags doesn’t map to sql at all. so use partial indexing.»[Joshua Schachter at Carson Summit]
I didn’t try any of the non-RDBMS system but it looks like Apache Lucene and Hadoop. There has been a discussion on the Tagdb Mailing list about these solutions.
Download the source code (PHP) I used to run the queries and test yourself, extend them as you like. The source is published as LGPL.
Now, if you have read that far, you probably want to know some background information: As you noticed, for each schema, I set up 4 databases, one database holding 1000 bookmarks, the next 10′000, then 100′000 and the fourth 1 million bookmarks. The inserted tags (as well as urls) are random English words taken from two sets of tags:
Every bookmark gets one to ten tags attached. Every odd tag is from the large set, alternately taken from small and large set. Every schema got exactly the same bookmarks and tag data.
Then every schema got queried with an alternately 1-3 tag query. So the first query is for instance just “blog”, the second “design+css”, the third “webdesign+music+software”, the fourth again with just one tag an so forth..
All the tags for the queries are taken from the small set so that the queries don’t all end in empty results..
All the queries are tested and work. The outcome of each query on the three schemas is exactly the same.
I used mysql 4.0.21.
An excerpt from /etc/my.cnf (I think these are the relevant settings to this performance test)
key_buffer=300M query_cache_size=30M query_cache_limit=30M table_cache = 64 ft_min_word_len = 2 ft_stopword_file = ''
CPU: 3GHz Dual Xeon
Cache: 1MB
Harddisk: SCSI Ultra 320 Atlas 10K, no RAID
RAM: 3GB
LIMIT 50 as I assume that a normal application doesn’t want to get all bookmarks. I assume nobody wants to order bookmarks by any dimension, because this would be very expensive (ever wondered why you cannot sort bookmarks on del.icio.us by date or similar? You get it..)Thanks to Citrin, the company I work, to let me use our new server to run the queries. The server didn’t have much anything else to do so the results should be accurate.
The graphs are done using JpGraph. Very easy to use and produces beautiful images.
RSS feed for comments on this post.
Go back to school dude!
Rework the queries for the toxi schema and use JOINs.
Comment by no name — June 21, 2005 9:35 am #comment-273
You realize that you can use mysqlicious approach when doing a Union query when you use the mysqlicious fulltext approach? :P
Comment by no name #2 — June 21, 2005 4:40 pm #comment-281
[...] , MySQLJune 21, 2005 6:37 pm
Fulltext search redux
Philipp Keller tackles tags and MySQL again, this time running a benchmark against four diffe [...]
Pingback by Lab notes :: Fulltext search redux :: June :: 2005 — June 21, 2005 7:33 pm #comment-282
Yeah, you are right! I didn’t thought of that.. But: When doing union with fulltext, it gives you the extra bonus that your result is ordered by “number of tags matching”. That way you can do “find similar entries”.
Comment by phred — June 22, 2005 7:07 am #comment-286
Intersection isn’t possible on del.icio.us?
Comment by no name #2 — June 22, 2005 1:30 pm #comment-287
Nor union? (sorry for double post)
Comment by no name #2 — June 22, 2005 1:31 pm #comment-288
Intersection is possible on del.icio.us. Union not yet.
I still wonder why del.icio.us has not yet implemented union.
I don’t get it because the union queries are faster in all schemas then the corresponding intersection queries..
Comment by phred — June 22, 2005 3:11 pm #comment-293
Intersection: Oh, indeed. Haven’t noticed that yet :)
Union: It’s not very interesting…what’s the point of Union actually?
Comment by no name #2 — June 22, 2005 5:37 pm #comment-300
Isn’t the point of union quite obvious from the example in the article? If I wan’t to find bookmarks about delicious I would see them all at one time instead of first look at the ones tagged “delicious” and then those tagged “del.icio.us”.
Comment by Daniel — June 22, 2005 10:36 pm #comment-302
Must think and read before complain :D
Comment by no name #2 — June 23, 2005 11:32 am #comment-318
Daniel: Thank you.. :-)
To clarify:
Comment by phred — June 23, 2005 12:26 pm #comment-319
What do you mean by:
“rework your queries and use joins”?
I altered the toxi query to be something like this:
But imho that’s the same as when I say t.id=tm.tag_id in the

WHEREclause.I nontheless did run a test:
Notice that that all the queries run faster. Could be that the server while running the tests of the article wasn’t really idle.. :-) But it doesn’t change the results as the ranking stays the same.
Comment by phred — June 23, 2005 12:35 pm #comment-320
What MySQL version? And some server details? :D
Comment by no name #2 — June 23, 2005 4:13 pm #comment-323
And you checked EXPLAIN SELECT for all queries?
Comment by no name #2 — June 23, 2005 7:50 pm #comment-327
Noname2: Added mysql-version and server-details to the article. Thank you for the hint.
And about the
EXPLAIN SELECT:I’m not a MySQL expert, here are the results for “toxi”:
I scanned and skimmed MySQL explain manual and found out that the “using filesort” and “using temporary” ain’t that good.. :-)
Comment by phred — June 24, 2005 7:17 am #comment-339
Try to use MySQL 4.1. IMHO they greatly improved full-text there. I might try a few things too but creating those 1mio tables takes HOURS. :S
Comment by no name #2 — June 24, 2005 8:53 am #comment-340
I eh created a 1mio mysqlicious fulltext table on my home computer (took like half an hour, Athlon 2400XP, 512MB RAM, old hard disk, lots of programs open :P) with MySQL 4.1 and intersection takes ~0.03 seconds here (with LIMIT 50) and 0.22 seconds with all results (457). I don’t have ANY optimizations at all, just the standard MySQL 4.1 installation.
Comment by no name #2 — June 24, 2005 11:08 am #comment-342
Sounds good! Did you use my code or how did you set up your bookmarks/tags?
I’m gonna do tests when we migrate the server to SuSE 9.3. SuSE 9.3 comes with MySQL 4.1.xx
Comment by phred — June 24, 2005 11:21 am #comment-344
Used your code to generate the bookmarks/tags but executed the queries right in MySQL without a PHP script.
Comment by no name #2 — June 24, 2005 12:04 pm #comment-347
Ah cool.. glad somebody used my code.. :-)
Ok, how many tags did you search for?
There is a big difference between searching for bookmarks that fulfill just a “one tag query” and searching for bookmarks with a 3-tags-query.
If you use my script to do the intersection test, then you can generate the graph with
php graph.php outputfileComment by phred — June 24, 2005 1:00 pm #comment-349
I just looked for two tags. And my machine at home is windows so I don’t have all that graph stuff, my production machine is hosting busy vBulletin forums so I can’t really try there.
Comment by no name #2 — June 24, 2005 1:09 pm #comment-350
Can you write me an email?
Comment by phred — June 24, 2005 2:19 pm #comment-353
http://hannes.magiccards.info/get/results.html
Weird results sometimes. Don’t know how to interprete those.
Comment by no name #2 — June 25, 2005 5:15 pm #comment-364
Hi Phred,
I too am very interested in building a massively scalable schema for my “freetag” type opensource project (which allows you to tag anything you want anywhere)
I have experimented very heavily into a few schema types.
As a test for my tagging schema/query code, i built a display front end that is similar to delicious, where users can get a list of objects tagged, as well as statistical info such as “tagged by how many others” etc.
One must note that the schema must not only be able to support
Intersections/Unions, but also statistical queries such as “COUNT number of objects tagged with this tag”
“COUNT number of objects tagged with this tag, belonging to certain user”
“COUNT number of objects tagged with INTERSECTION(TAG1+TAG2)”
IMHO Counts are the big killers in terms of performance. A truly normalized approach helps a lot, because u can index lots of columns, which helps counts tremenously.
However, fully normalized = many tables = many joins= BIG problems when u try to join Table A(2million links/objects) with TABLE B(100000users) with TABLE C(1000000tags).
Initially, i started with a totally normalized approach. This works pretty well up to about to about 10k links.
An approach suggested by Nitin of tagschema.com suggests a hybrid normalized+denormalized approach. This will help in the sense that u can achieve faster selects by having less joins, but the “denormalized” FACT tables he suggest increase in size VERY VERY, the more columns u put in.
Up till now noone knows how delicious does it, but all i know is that Joshua has done quite a few DB schema revisions, but his site with >10m links seems to be performing fine under the load.
Do email me sometime! Thanks
Comment by powster — June 26, 2005 11:30 am #comment-379
Hannes: Thank you! I added the link to your results to the article.
The results are indeed strange. I think the fulltext thingie is like a black box. When I first heard about it, I thought it would have about the same performance as the toxi variant..
Powster: Good point with the counts. Interesting is, that on del.icio.us, counts are shown only on per-user-views of bookmarks (i.e. /phred, and never on a “global” view (i.e. /tag/web).
I read all articles of Nitin and he has got very good ideas to build a good and fast database schema. Thank you for pointing me to his posts..
I’ll add some “conclusions” to the articles with insights from his articles. He promised to set up a mailing list, so we could have a nice discussion there..! I’m looking forward to it!
And I must admit that my performance tests are a bit too “clean”. That is: Your tests, powster, seem to be more accurate.. Perhaps I’ll do some mixed insert/query-tests.. and add COUNT-tests as well.. I think then we’ll have a basis for selecting the right database for the thing we want to do..
Comment by phred — June 27, 2005 1:07 pm #comment-408
I think all the slow queries appear when the database has to grab a lot of rows. This happens when comparing a large set of tags to a small one for intersection (two bigs sets are easy, because 50 rows are found easily), or merging two huge sets of bookmarks (union) or counting a large set of tags. “Full” table scans are a thing to avoid in ANY case.
I think you have to try a completely different approach to solve these problems. Because intersection a small set and a huge set is always slow with all of these three approaches.
You maybe COULD try to have two different types of tags. Common tags and rare tags and try to come up with a schema that uses this. Not sure though.
Comment by no name #2 — June 27, 2005 3:01 pm #comment-410
Try these:
http://del.icio.us/tag/web+design
http://del.icio.us/tag/web+design+tips
I get server errors on all of them. Corresponds exactly to my results. Comparing huge sets of tags = crap.
Comment by no name #2 — June 27, 2005 3:07 pm #comment-411
Hannes: Good thoughts.. one has to come up with a different schema. IMHO this is not that easy..
And delicious has problems with
http://del.icio.us/tag/web+@de
http://del.icio.us/tag/web+ckoma
as well.. as with any tag intersected with “web”.
Well.. this just tells us that the delicious schema isn’t voodoo but uses standard schema setups.. :-)
Comment by phred — June 27, 2005 3:59 pm #comment-413
Indeed. Mine were with two huge sets, yours are with one huge and one small set, as I pointed out above (and in my test).
And I guess if there was a better schema delicious would use it :P
Comment by no name #2 — June 27, 2005 6:51 pm #comment-417
Perhaps. But we are here to change the world, aren’t we? :-)
btw: I asked Joshua before I did my performance test if I could help him out to test some things he wanted to figure out.. he didn’t answer so probably he is quite content with his schema :-)
Comment by phred — June 28, 2005 6:53 am #comment-427
Hey Phred!
Just wanted to say thanks heaps for doing these tests for real. Very informative.
So…. Bottom line, which solution would you choose for your next data-heavy folksonomy app?
PS I realise (now) that my quick reply to your original post was rather flippant and quick compared to people who are actually doing all this stuff in a serious way. The full text indexing does sound very intereting. I wonder if MSSQL Full Text index would work as well as the MySQL index (probably, huh)?
Comment by Miles Thompson — July 8, 2005 2:32 am #comment-625
Miles: Your comment and also the voice of others got me doing this performance test.. so, thank you too.. :-)
About your bottom line question: I added some subparagraphs to my “What? That slow?” paragraph. Hope that answers your question?
I don’t have any experience with MsSQL.. But I guess you could take my php-code and wouldn’t have to change it much to have it run on MsSql.. would be interesting..
Comment by phred — July 8, 2005 7:56 am #comment-628
[...] Muy bueno el post “Tagsystems: performance tests“, de Philip Keller, una de las personas que ms estn escribiendo ltimamente sobre el trasfondo tcnico de los sistemas de tags. [...]
Pingback by mildiez.net » Archivo de bitcora » Cmo modelar un sistema de tags — December 8, 2005 2:01 pm #comment-1887
[...] Tagging Technik, freetag-library , datenbanknormalisierung/performance [...]
Pingback by doomicile » Blog Archive » Webmontag in Hamburg — March 22, 2006 7:45 am #comment-2436
One of the primary tricks for speeding up this sort of thing is to put more data in the index. Instead of using just “tag” as a key, use “tag, bookmark_id” as athe key. The nice thing about this is that the database can suck up the bookmark_id that it needs out of the index rather than having to go back to the underlying row to extract the data that it needs.
Note that the toxi tag table uses varchar(255) as the datatype while the scuttle tag table uses varchar(20) as the datatype. While this shouldn’t affect performance, it is important in this type of comparison test to use the same data type just in case.
Comment by cesium62 — June 22, 2006 12:20 am #comment-4219
[...] Еще о тэгах, разные методики организации данных и тесты производительности http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html [...]
Pingback by WEB2.0 на Руси » Производительность в WEB2 — August 8, 2006 10:10 am #comment-5672
[...] The best resource that I've found on the subject is this plog post on the different database schemas . I'm also quite impressed with the author's breakdown of performance times for tagging datastructures as the number of tags increased. For most of the applications that I build, there would be a minimal number of tags as they will be managed from one central administration (unlike Flickr or Delicious where users can define their own tags). [...]
Pingback by That’s What She Said — October 16, 2006 1:44 am #comment-12084
Excellent follow-up article, thank you for your efforts in producing the performance audits :-)
The link to JpGraph has been mangled it seems, the link points to ‘http://http//www.aditus.nu/jpgraph/’ instead of ‘http://www.aditus.nu/jpgraph/’.
Comment by Zeeshan — November 2, 2006 12:20 pm #comment-13650
Zeeshan: Thank you, fixed the link to jpgraph.
Comment by Philipp Keller — November 5, 2006 1:16 pm #comment-13910
I just finished a course advanced database systems, and we
learnede how to increase performance for specific queries schema’s.
For example do you know wich query-plans where used?
You can check that with the sql EXPLAIN command.
I would recommend a hash index on the tag collumn because you are
not doing any reange searches .. searching for a specific tag would require constant time.
Als it is better to measure your performance test in the number of page transfers between the hard disc and the memory/cpu. Because this is the bottleneck in performance. The miliseconds say more about your system than about those queries.
Anyways, another question: Which indexes did you use? B+ indexes? ISAM indexes? Hash indexes? Which where the search-keys for these indexes?
Comment by Tjerk — November 8, 2006 4:11 pm #comment-14212
[...] Zend_Search_Lucene allowed me to tackle the complicated issue of tagging - Tagging is a known problem to map effectively to databases (A dude named Phillip Keller wrote a blog on different tagging schemas, and conducted a performance comparison of the schemas. Another dude named Nirin Borwankar suggested yet another schema for tagging. The tagging issue is a long and complicated one.) To quote del.icio.us creator, John Schachter - “tags don’t map to sql at all. so use partial indexing.” Using Zend_Search_Lucene to index tagged items allowed us to implement tags in the Octabox project while still enjoying high performance, which was something that I was quite worried over before. [...]
Pingback by Octablog » A review of the Zend Framework - Part 3 — June 20, 2007 12:06 am #comment-64802
This is probably a dumb question but here goes. If I opt for the toxi approach, is there any performance benefit to indexing any of the columns?
G
Comment by Geoff — July 12, 2007 8:17 pm #comment-71502
Geoff: Sure, go for the indexes! Have a look at http://www.pui.ch/phred/modules/tag_database_schemas.sql. I added indexes on tag.tagname, bookmark_tag.tag, bookmark_tag.bookmark and bookmark.url. If you can improve the performence by altering that indexes let me know.
Comment by Philipp Keller — July 22, 2007 3:39 pm #comment-73758
It looks like the first commenter (”Go back to school dude!
Rework the queries for the toxi schema and use JOINs.”) never pursued his argument, and it looks like you never “got” what he was really saying.
I think what he meant was that a query of this form:
SELECT … WHERE name IN (”word1″, “word2″, “word3″)
HAVING …
should instead be rewritten as:
SELECT … WHERE name = “word1″
INNER JOIN
SELECT … WHERE name = “word2″
INNER JOIN
SELECT … WHERE name = “word3″
Same thing for the UNION query. Use actual UNIONS instead of “WHERE name IN …”.
I think this is supposed to give you a good performnance boost.
Would you consider redoing your benchmarks on the TOXI schema using these queries above?
Comment by Ryan — September 25, 2007 4:19 am #comment-87321
What about benchmarks when the queries have ORDER BY (such as ORDER BY date).
For example, when a user wants to see all items with certain tags, sorted newest item first. How is performance then?
Comment by orderlord — September 26, 2007 6:57 am #comment-87607
Efficiently handling tags is similar to attributes in dating sites (ie. +blonde +tits -fat) except there are a lot more tags than profile attributes.
In order to extract any kind of acceptable performance from a SQL database, you will have to forget about LIKE (full table scan), and foreign keys (scuttle/toxi solution).
Basically you need a SQL database which supports one of the following :
- efficient star join support (that means Oracle),
- Bitmap index support (coming up in Postgres),
- efficient fulltext search support (ie Postgres)
- vectors (arrays) as column types and specific index methods to make boolean queries (ie. Postgres) on the values contained in said vectors.
MySQL is not part of the solution ; besides MySQL FULLTEXT is lucicrous.
A better solution may be to use a full text search engine. I tried Xapian and found that, on large data sets consisting of up to a million forum posts, it massively outperformed Postgresql’s fulltext search, which itself massively outperformed MySQL’s fulltext search. This can be used for tags, and obviously to search the articles’ full text. Obviously, Lucene is also a solution, however it is less user-friendly than Xapian (uses Java, bleh, hard to interface with Python for update scripts, etc).
Comment by Peufeu — October 15, 2007 11:45 am #comment-92353
only 2 years since last comment and 4 since the post.
But at about the time of this post, I was working for LinkedIn.com I ran some performance tests comparing Lucene, MySQL FULL Text, Oracle Full Text for searching people’s profile. Hands down Lucene was the winner.
Ever wonder why there is no obvious way to break a connection in LinkedIn? Its because the Lucene index is incrementally added to. Removing a connection from the search results is an expensive operation.
Of course things change — would be interesting to see the results of 4 years worth of work on all three products.
Comment by Pat — June 8, 2009 11:59 pm #comment-130720
«tags doesn’t map to sql at all. so use partial indexing.»[Joshua Schachter at Carson Summit]
FYI the link to Joshua is dead.
Thanks for all the useful information.
Comment by c-a — January 19, 2010 12:01 am #comment-130804
@c-a: thanks for the hint, I’ve corrected the link
Comment by Philipp Keller — January 30, 2010 10:59 pm #comment-130806