previous post: «Tags: Database schemas»
While setting up the promised performance test in my last post, I did some tests with the MySQL fulltext features and it seems that they are built for tagging systems. Take a look at the queries (if it is not clear for you what is done here, please read my previous post).
I took the MySQLicious schema and added ALTER TABLE `delicious` ADD FULLTEXT (`tags`).
The full schema:
CREATE TABLE `delicious` (
`id` int(11) NOT NULL auto_increment,
`url` text,
`description` text,
`extended` text,
`tags` text,
`date` datetime default NULL,
`hash` varchar(255) default NULL,
PRIMARY KEY (`id`),
KEY `date` (`date`),
FULLTEXT KEY `tags` (`tags`)
) ENGINE=MyISAM
Intersections can be done using boolean fulltext search (since MySQL 4.01):
Query for semweb+search:
SELECT * FROM delicious WHERE MATCH (tags) AGAINST ('+semweb +search' IN BOOLEAN MODE)
Now this was easy. And, you guess it, Minus is very similar:
Query for search+webservice-search:
SELECT * FROM delicious WHERE MATCH (tags) AGAINST ('+search +webservice -search' IN BOOLEAN MODE)
Even brackets are possible:
Query for (del.icio.us|delicious)+(webservice|project):
SELECT * FROM delicious WHERE MATCH (tags) AGAINST ('+(del.icio.us delicious) +(webservice project)' IN BOOLEAN MODE)

For union you could use the already mentioned boolean mode, but if you want to have the results ordered so that the bookmark with the most “hits” is the first entry of the result try this sort of query:
SELECT * FROM delicious WHERE MATCH (tags) AGAINST ('delicious clone project webservice')
If you take a look at the screenshot of the first 7 results of the query run on my DB, you can see that the first hit has got all four tags we searched for, the second has got two and the rest has got just one of them. Like this you can do a “find similar entries” very easily.
There are two points where difficulties can accur: When MySQL builds its index out of the tags and when searching for specific tags. I stumbled on three problems:
If you insert tags with characters like “-” (as in “my-comment”), then MySQL will make two index entries: One for “my” and one for “comment”. Vice versa if you search for “my-comment” you’ll find bookmarks with tag “my” and those with tag “comment”. It seems that this problem can be eliminated by setting the character set of the column “tags” to latin1_bin but this feature is not available before MySQL 4.1.
But nontheless this shouldn’t be a showstopper. You could replace “-” with a string, say “_minus_”. This is ugly but should do it..
When searching for or indexing tags like “against” or “brief” (full list of stopwords), these tags will not be regarded.
Since MySQL 4.0.10 you can customize your stopwordlist.
Per default, the minimal length of a word indexed by MySQL fulltext is 4 characters. You should therefor edit my.cnf in order to set the minimal tag length to 1.
This solution scales ok. I did tests with tables from 1000 to 1 million bookmarks.
The time for inserting a bookmark is the same for small as for big tables. The time for an intersection query was 0.001 (finding 0.7 urls averaged) in the 1000-table and 0.1 seconds in the 1 million-table(finding 70 bookmarks averaged). There are some discussions about if MySQLs fulltext search is fast or not (have a look at the user comments). Quick performance tests showed that it is about 10 times as fast as the LIKE-queries mentioned in my previous post. But I guess it is not fast enough for webservices like del.icio.us, I guess this services have to run more than 10 queries a second and then this solution is too slow..
But anyway: I will do an article on the perfomance tests with more accurate data.
RSS feed for comments on this post. TrackBack URI
[...] the queries (if it is not clear for you what is done here, please read my previous post). Then each went to his own home » Tags with MySQL fulltext [...]
Pingback by HYPERGURU » Database design: Tags with MySQL fulltext — May 10, 2005 9:12 am #comment-95
Fulltext for tag searches
I’ve been looking at different schemas that can support tagging, which I realized that full text search is the perfect application for tagging.
A scheme where each tag (applied to an object by a user) occupies a row in the database is simple and effi…
Trackback by Lab notes — May 30, 2005 8:09 am #comment-108
Thanks for spending all the effort, I really enjoy your “MySQL-series” or should I say “delicious reverse engineering”? ;)
Whatever it is, keep’em coming!
I’ve been playing around with tag-based schemes myself and came to the following conclusions:
- Postgres is too slow. I benchmarked it kinda thoroughly and
there seems to be no way to get near the response times
of mysql. The advanced features come with a pricetag, I suppose.
- The most appropiate way to tackle the tag-problem seems to be
a custom, rather flat database with custom indexes suited
for the task. Unfornationally you (obviously) lose all the
flexibility of SQL when going that way and it’s one hell of an
effort to get right. I’ve spent the last few weeks thinking
it through and came to the conclusion that it should be
doable (constant response times in the two-digit ms-ballpark)
but, as said, some aspects like all “counting” require careful
planning and would be very hard to change later on.
For these reasons I abandoned the approach and am now looking
again into how to squeeze the most out of regular (My)SQL-
Databases.
And the following thoughts regarding delicious:
Some of Joshua’s statements on the mailing list indicate he’s
indeed using MySQL. I assume that either the recent hardware-upgrades were sufficient to mitigate the performance issues with mysql fulltext-search for a little longer
OR
he is in fact using a different schema than those that you have presented so far. I guess some cheating would be possible with very ugly hacks (like, one table per tag or such?)
Well, I, too, would love to knowwhat schema he’s actually using.
Maybe, one day, Joshua will tell. ;-)
Comment by micro — June 21, 2005 7:53 pm #comment-283
Thank you for the long comment!
Good to know that I don’t have to consider prosgres for tagging.
I think too that some heavy index tabling and such would make it much faster. But you are true that this is not easy (I didn’t think of that count feature..)
But as I watch delicious, I see that lot of things don’t update “right on”, so it seems that there are lot of index and cache tables around.
Comment by phred — June 22, 2005 6:57 am #comment-285
Stopcharacters
“If you insert tags with characters like “-” setting the character set of the column “tags” to latin1_bin”
except of “-”,
what about “” “?”, inside XML-FILE.
Comment by gedamis — December 16, 2005 7:26 am #comment-1925
Gedamis: Do you mean questionmarks in tags? I havn’t checked these, but I think before I’d implement tagging with mysql fulltext, I’d check all the different special characters like !,.@* and also éöêï and the likes. Perhaps it would be ok to reject tags with non-supported special characters at the moment when a user adds these to an item.
Comment by phred — December 16, 2005 7:57 am #comment-1926
[...] _ MySQLʸǽǥ¸ ▲ [...]
Pingback by ˤä(2006-03-29) — March 29, 2006 2:09 am #comment-2499
Hi,
An other problem is when you delete a tag you have to update all bookmarks that are associated with. I think such a proceedure is a really performance killer or is there a magic trick behind?
Comment by Armand — May 9, 2006 7:40 am #comment-2876
Yeah, you’d have to go through all your posts with that given tag. If this are about 200 posts I suppose that’d be ok performance wise since you won’t delete a tag all days.
Comment by Philipp Keller — May 9, 2006 7:54 am #comment-2877
Is there an easy way to fetch “most popular tags” ? i think one would have to select all tags from all posts and then count all instances of all words and sort theese values descending, but can this be acomplished in mysql only ? or do i need to make a php function for that purpose ?
Comment by tonny — May 23, 2006 10:47 pm #comment-3134
Tonny: I don’t think MySQL could do that. You could do a PHP script but I suppose you’d do a separate table for that. Everytime a bookmark is added you increment all it’s tags counter in the separate table. This way you could also do a “popular tags per week” or per month table.
Comment by Philipp Keller — May 25, 2006 10:28 am #comment-3157
[...] Tags with MySQL fulltext I did some tests with the MySQL fulltext features and it seems that they are built for tagging systems. (tags: My SQL TAG) [...]
Pingback by All in a days work… » Blog Archive » links for 2006-07-12 — July 12, 2006 3:33 am #comment-4817
[...] However, perhaps more exciting is their promise of every user’s own MySQL container! I hope this means I can finally be able to tinker with my own MySQL server settings! What do I need this for you ask? Well for example for one of my projects I needed to use a MySQL fulltext implementation of a tagging system, and by default, MySQL has its minimum word length limit set to 4, meaning all tags of length 3 or less won’t be searched! So I asked mediatemple to set ft_min_word_length = 3 but of course they wouldn’t because that would affect everybody else on my server (grid?) [...]
Pingback by jon lee dot see eh » Blog Archive » mediatemple woes — February 2, 2007 8:54 pm #comment-30619
I like the fulltext approach to tagging.
With this approach what is the best way to generate a list of unique tags, with counts for each?
Such as:
books (3)
programming (10)
work (1)
Comment by Mark Carey — March 23, 2007 1:10 pm #comment-43807
Mark: You would do that in your program. It is not possible to do this on the database side, that’s actually a downside of this db schema.
Comment by Philipp Keller — March 23, 2007 1:45 pm #comment-43811
Hi guys,
Very informative article and comments!
I will appreciate if you can suggest the most scalable and performance-oriented solution in my scenario. I have a PHP/MySQL based application, with the following DB schema (simplified):
Table: Users
————
UserID
[other misc. fields]
Table: Bookmarks
—————-
BookmarkID
BookmarkURL
[other misc. fields]
Table: UserBookmarks
——————–
UBID
UserID
BookmarkID
BookmarkTags (Varchar)
This schema allows multiple users to manage multiple bookmarks. Now, what approach would be ideal in the following two scenarios:
1. If I want to search for bookmarks based on keywords. Eg: “mysql database” -postgres. Programatically, a boolean search parser can be implemented with either ‘LIKE’ or ‘Fulltext’, but I’m concerned about performance (say with a thousands of bookmarks and tags, and high frequency search queries from the users). I need to show unique bookmarks matching the search query and also the number/count of users who have bookmarked it? What do you guys reckon?
2. If I want to build a tag cloud, displaying all the unique tags and their count per user bookmarks.
Please share your thoughts. Thanks in advance!
Comment by Nickel — June 7, 2007 7:32 pm #comment-61364
Nickel: I did some perfomance measurements. It depends how your data looks and how big your database will become. If you “just” have a few thousand entries, then I would go for the “toxi schema” and would add a “fulltext table” with the columns “bookmark_id”, “user_id” and “tags”. This extra table holds redundant data that helps you with the boolean fulltext queries (much much faster than LIKE queries). If you have less complicated queries (just and), then I’d just stick with the toxi schema and wouldn’t add the extra fulltext table.
As for the tag cloud: If you want to have a “global tag cloud” (whatever this is for, I find them rather silly), you’d have to compute this from time to time. Doing tag clouds with the toxi solution is fairly simple (some group by clauses and you’re in)
Comment by Philipp Keller — June 10, 2007 2:34 pm #comment-62479
Why not have both, fulltext with mysql and a table wich keeps track off all different tags and adds everytime one tag is used.
- Egbert
Comment by Egbert Wietses — July 9, 2007 5:43 am #comment-70680
[...] Tags with MySQL fulltext [...]
Pingback by html-utvikler » Blog Archive » Tags og MySQL - en perfekt kombinasjon — April 9, 2008 8:23 pm #comment-121104
Tags, search and a tagcloud…
I have been thinking about how to implement the Tags and which database structure would be best for it. In the end I settled for a method using fulltext search. Now that I had the tags working and knew how search worked I added the actual search functi…
Trackback by The Experience Game Blog — October 15, 2008 3:23 pm #comment-129997
Nice solution but limited to use one word tag.
I found another good solution to store tags in the same structure separated by comma. And you can use any character includin ‘-’ in tag name.
CREATE TABLE `delicious` (
`id` int(11) NOT NULL auto_increment,
`tags` text
PRIMARY KEY (`id`),
FULLTEXT KEY (`tags`)
) ENGINE=MyISAM
The sql will be
SELECT * FROM `delicious` WHERE concat(‘, ‘,tags,’,') like ‘%, two,%’
and if executed against a table with this data
ID, tags
—————————————
1, ‘one two three, numbers, consecutive
2, ‘two times, three times, another tag
3, ‘tags, two, five, seven’
will show the result:
ID, Tags
—————————————
3, ‘tags, two, five, seven’
Comment by S!ava — July 25, 2009 4:18 pm #comment-130763