previous post: «www2006 and collaborative tagging workshop»
Grigory Begelman (Technion – Israel Institute of Technology Computer Science Dpt), Frank Smadja (RawSugar) and I did a paper for www2006 called “automated tag clustering”. It deals with why clustering the tag space makes sense and how this could be done.
After the presentation at the tagging workshop at www2006 we felt the need to give our paper a more www-friendly, I-don’t-want-to-read-through-those-theoretical-equation-flooded-papers face.
So, here you go: Automated Tag Clustering: Improving search and exploration in the tag space. To read this document you should have a clue what tags are about, you should also know some tag services as delicious or flickr so you can understand the limitations these services currently have.
If you don’t want to read through the whole papers, the numerous figures give you a good summary. Finally, to wet your appetite, here a few excerpts of the document:
Currently tagging services still provide a relatively marginal value for information discovery and we claim that with the use of clustering techniques this can be greatly improved [from introduction]
The whole promise of collaborative tagging is that by exploring the tag space you can discover a lot of useful information you would not find with traditional search engines. When your information need is not well defined, the idea that you can explore and see what other people tagged with certain tags is very attractive. We believe that tagging will be able to reach a very wide audience only when exploration techniques will be effective. [from limited exploration]
Although a great visualization paradigm, we believe that with today’s tagclouds it is hard to find more than one or two tags to click on. Tags are not grouped, there is too much information, so that you find lot of related tags scattered on the tag cloud. One or two popular topics and all their related tags tend to dominate the whole cloud. For example, looking at the del.icio.us tagcloud, one would mostly see tags related to web design and technologies. This is because these topics are overwhelmingly more frequent than anything else. [from limited exploration]
Tag web2.0 nowadays is so popular and is combined wildly with anything. In fact this tag is so overused that if you look at tag bookmarks in the del.icio.us dataset, the most used cotag is web2.0[...]. Basing tag similarity on these numbers often doesn’t make sense at all. The similarity measure should be chosen so the popularity of a tag doesn’t affect the set of a tags related tags. Don’t cut the long tail. The success of blogs is driven by the importance of the long tail. We all know that it is crucial to support the niches. Tagging applications should empower the long tail too. If you just sort by popularity, you’d loose all those niches. [from choosing a similarity measure]
We’d be happy to get any kind of feedback on the article. Just post a comment to this blog post.
RSS feed for comments on this post.
Great paper – I am doing a lot of del.icio.us visualizations and this is a great reference. I would like to chat with you more about the math behind it. Look for an email from me soon.
Comment by Kunal Anand — July 11, 2006 6:55 pm #comment-4803
In the article, you mentioned using the METIS library. Which one of the partitioning routines, or stand alone applications should be used. The all seem to want the number of partitions to create. How should I come up with the number of partitions?
Comment by Brian Schneider — May 30, 2007 8:29 am #comment-58012
I think the basic program called Metis is the program to take. It has to standalone programs pmetis and kmetis. I didn’t try them though. The good point about these partitioning algorithm is, that they scope with sparse matrices and this is the case with tag similarity graphs.
You’re right, they need you to specify the number of clusters. You’d start with 2 clusters, test the quality of the cluster, then increase the number of clusters and test the quality of the cluster again until the quality drops again (or you may try another way to find the absolute maxima of this “quality graph”.
To compute the quality, I took a function called Q (Modularity Function) described in a paper by Scott White. The quality function actually looks quite complicated but it finally comes down in counting edges based on a few criterias.
Be sure to let a comment when you have some results, I am keen to hear some actual usage of these algorithms.
Comment by Philipp Keller — May 31, 2007 7:24 am #comment-58552
Hi Philipp
I am an undergraduate(July,07 passout), CS . I must say: Excellent paper! More importantly, you could bring out the important need for tag clustering. I had similar ideas on the need for tag clustering and so was exploring any existing method. I’ve a strong feeling that tag spamming is also a very important issue.
You have mentioned in your paper, that you ‘were’ looking at the problems of tag spamming. Were you able to document/ publish any of your efforts? I want to delve further into that topic. I came across only 2-3 relevant papers about tag spamming. More recent one being (Combating spam in tag spamming by Koutrika et al). Most of them though, follow the method of user’s approach towards tagging. The partitioning is based on bad user and good user.
I feel there is a strong need to follow an approach similar to your clustering approach for handling tag spamming. What are your thoughts on the same?
- Niket Tandon.
Comment by Niket Tandon — November 4, 2007 5:57 am #comment-96521
Hi Niket. I’m glad you liked our paper. The only thing I did concerning tag spamming was looking at tag similarities that were above a certain level. As I remember I took the dice similarity measure and looked at tag connections with a similarity close to 1. At that time there were many spammers that added more or less the same tags to every bookmark. I looked up who is contributing to the found tag connection and reported those usernames to delicious and removed this tag connection in my dataset.
But I’m sure as soon as the service providers add this heuristic to block spam, spammers will adapt. It’s the same as with mail spam or comment spam.
Otis, the owner of simpy.com, does some tag spam prevention, he will surely answer some of your question. He just opened the question of “user contributing to find spammers” on the simpy-user mailing list.
Comment by Philipp Keller — November 4, 2007 4:01 pm #comment-96635
Hi Philipp.
Sorry for the late response. I contacted Otis, and glad that he could answer promptly. I now want to workout some basics.
[1] As you said: “”I remember I took the dice similarity measure and looked at tag connections with a similarity close to 1.At that time there were many spammers that added more or less the same tags to every bookmark. “”
—-In this case, if someone is trying to take the dice similarity measure, then you mean to say: a possible counter attack by spammers could be that they try to make a cluster of the related but incorrect tags?
I think then what we essentially will have are: two or three large clusters. It then, needs to be decided which one is spam and which is not.
Is this what you meant?
[2] You have also written in the article:
“We not yet have conclusions about which similarity measure to take for which data.” Have you been able to document/ publish any results about the same now?
[3] Regarding partitioning and clustering for tags: I am new to it’s mathematical understanding and also the implementation. I have a very fundamental idea of clustering. I would really appreciate if you could please suggest any existing library, would be great if it’s in Java. Metis is implemented in C. Also, any other tutorial material on clustering ( which needs to be used in tag clustering ) would be really helpful to me.
[4] The results of the experiments: Unfortunately, the page of rawSugar labs has been moved now I guess. http://www.rawsugar.com/lab
Could you please suggest any other links?
Also,Query tag: sports:
* hockey, nhl
* baseball, mlb, triple
* basketball, nba, nbdl, wnba
* football, nfl
* alcohol, beer, tv, food, bar
The last category of tags ideally doesnt fall into sports category. But in most cases, to a large extent, the sub-clusters are almost perfect.
Query tag: health:
* shopping, research
* nutrition, food, diet
* fitness, workout, running
As we can see in this case, there are exceptions like the first sub-cluster.. ( shopping , research). But mostly it works well.
I think all this shows that the clustering algorithm is good, Though the retrieval algorithm is not so precise.
Comment by Niket Tandon — December 4, 2007 10:21 am #comment-101767
Hi
Is there any paper/article which tells us comparatively, which clustering algorithm is best suited for tag clustering?
Thanks,
Niket
Comment by Niket Tandon — December 10, 2007 11:56 am #comment-102658
Hi Niket.
I don’t know another paper or article that deals exactly with that problem.
About your questions:
[1] What I meant was: When I looked to bookmark spam, the spammers often used the same tag combination, e.g. health-education. They added hundreds of bookmarks with tag “health” and “education”. This tag spam caused those two tags to have a very high similarity number. This number was so high that it was over the normal range of similarity. That’s how I spotted the “tag-connection-spams”. But the counter attack for spammers is very easy: Don’t use the same tag combinations too often.
[2] No
[3] I don’t know any of these programs. Even metis I never tried out. I’d need to find a library yourself. Maybe this list of Open Source Clustering Algorithms can help you?
[4] This part of the paper were findings of Grigory Begelmann. You have to ask him about these peculiarities..
Comment by Philipp Keller — December 22, 2007 8:37 pm #comment-103943
Thanks a lot Philipp,
I was able to contact Grigory and he’s helping me. Sorry for the late response as I working out Eigenvectors and other mathematics concept, and implementing clustering algorithms.
Reg: Tag Spam- You said: “But the counter attack for spammers is very easy: Don’t use the same tag combinations too often.”
So, what you essentially mean is: the spammers would tag resources with unrelated tags to an extent when it is not so large. So that one can’t raise any suspicion.
Do you think that thinking of all these factors, isn’t there an opportunity for semantics of tags to come into the picture? Or can this in some way handled by some intelligent similarity measure?
Thanks in advance
Niket.
Comment by Niket Tandon — January 23, 2008 5:49 am #comment-106958
Niket: No, this can’t be done by any automatic process. It’s the same with email spam: You can tune your spam filter but the email spammer will notice and change their algorithms and hence you have to change yours, etc.
Although this way is tedious I guess all public services offering user generated content and/or public APIs have to go down that road.
Comment by Philipp Keller — February 9, 2008 4:40 pm #comment-109679
Hi Philipp,
I was looking around to find a library/tool that could perform tag clustering. As I understand I need something that deals with graphs.
I had a look at METIS (pmetis), but it partitions the graph into k equal size parts, and it doesn’t seem to perform the task of finding communities in graphs.
I just found that the algorithm you mention (described in the paper “Spectral Clustering Approach to Finding Communities in Graphs”) was implemented in igraph library http://cneurocvs.rmki.kfki.hu/igraph/ (at least they say that in their docs).
So my question is – do you know this lib? If yes, what is your recommendation?
Comment by Maria Grineva — February 21, 2008 12:16 pm #comment-112180
Hi Maria,
igraph implements a different spectral clustering based on the algorithm of Newman (see http://arxiv.org/abs/physics/0605087). It also has implementations for a couple of other clustering algorithms (e.g., the fast greedy modularity optimization of Clauset et al, the walktrap community detection of Latapy & Pons, the spinglass clustering of Reichardt & Bornholdt). If you have similarity measures, then the walktrap community detection or the spectral algorithm of Newman may be of some use to you.
Comment by Tamás Nepusz — March 3, 2008 7:01 pm #comment-114349
Hi everyone,
as already pointed out METIS is minimizing the edges between different clusters while balancing the size of the clusters. Thus it isn’t very useful for community detection.
Most of the community clustering algorithms implemented in igraph optimize the same quality measure “modularity” of Newman. It describes mathematically what we may expect from a good clustering: That the vertices (tags) have stronger connections (edge weight) to the other vertices of their cluster than in a graph where all edges are equally distributed (no structure, called null-model).
In a perfect world all algorithms should produce the same result. But it’s a NP-complete problem and thus they have to guess and differ in how well they achieve the optimization aim. Beware that some implementations in igraph ignore the edge weight which produces additional structure where none is (i.e. edge weight 0.01 vs. 1.0).
The good news is, that it isn’t necessary to use eigenvectors to find good modularity clusterings. Some “simpler” algorithms exist with comparable or better results and without manually choosing the number of clusters.
Randolf
Comment by Randolf Rotta — April 17, 2008 3:51 pm #comment-123012
Hi everyone,
I tried using kmetis from metis-4.0 suit and discovered it entertains only integer edge weights.
In case one opt for dice similarity the edge weights ought to have floating point values.
I was told by authors that hmetis2.0 supports floating weights. But unfortunately there are no windows binaries for the same yet and hence I could not give it a shot.
I plan to switch to igraph now. I will report the results when available.
Mean while if any one has any suggestion or comment please respond.
By the way I am trying to use the tag clustering in my image search system. I am quite optimistic about the results.
Best,
Shrijeet Paliwal
CS@Stony Brook University
Comment by Shrijeet Paliwal — November 5, 2008 4:58 pm #comment-130015
Hi,
I was exploring the tagging world and really like ur work. I have question about the Cluster Graph. how you generate the graph (i mean visually) with weighted edges.
Thanks in advance
—
Noor
Aachen, Germany
Comment by Noor — July 15, 2009 12:49 pm #comment-130758
@Noor
The graph was done with GraphViz. You can laben the edges and even influence how long they should be, e.g. look at this example: http://www.graphviz.org/Gallery/undirected/ER.html
The coloring I did by hand with Inkscape
Comment by Philipp Keller — July 29, 2009 9:35 pm #comment-130765