google
yahoo
bing
July 17th, 2005

Analyzing tag-connections

When you tag an item, for instance a bookmark, you give them different tags, for instance I tagged the bookmark for “How to Write More Clearly, Think More Clearly, and Learn Complex Material More Easily” (you know this link if you give attention to delicious popular.. :-)) with

“writing”, “toread”, “productivity”, “language”

Now what instantially pops into my mind is, that the tag “toread” is quite different from the other tags. In fact it is something I want to do with this bookmark further on. I name this type of tag “adjective” (I will come back to that name later on..). The other tags I consider as “categories“.
Now you’ll probably say “ah, this is a rare exception”. This is not true. I often tag items with “blog” because it happens that the interesting page I found about my favourite hobby happens to be a blog. Therefore I named this type of tag as “adjective” as it is rather a description to the item than it is a category to it.
Other tags used often as adjectives are “reference”, “tutorial”, “fun”, “cool”, “news”, “free”..

Now this categorization is not very correct. Sometimes, I use “blog” not as a adjective. This is, if I want to bookmark a blog that has no content that interests me but it just looks good. Then, I’ll probably blog it as “design blog”. In that day when I redesign my blog, I want to search for all design blogs I tagged..
You see: it lays all in the connection between those tags, not in the tags itself. This is IMO pretty important.

What is that for?

Clusters

You probably tried to cluster your bookmarks by using clusty. What this service does: It tries to put your tags into separate clouds. You know the “tag-bundles” of delicious? This is something like a “auto-tag-bundle” feature. Try it out, if you not already did so and see the problems that arise..
I think the key problem in this cluster-service lies in the fact that this service considers all connections (also the adjectives). But it shouldn’t do so! Adjectives aren’t tags I want in my clusters. Adjectives are spread all over my tags, so they should first be cut away from my “tag-tree” (the tree that is built out of your tag-connections you built by tagging bookmarks).

Similar items

This categorization is also important when you search for “similar” items of a bookmark. When I want to search for similar items of that “how to write more clearly”-article, I’ll search for “writing+productivity+language” and will leave out the “toread” tag (adjective).
Probably this made you realize that categorizing tag-connections is an important task.

Tag clouds

Now there are those tag clouds. When I look at my taggloud then the “biggest” tag is “resource”. Now tag clouds are here to easily find bookmarks (I never search my bookmarks for solely “resource”) or to have a map of your main interests (“what is your hobby?” “ah, I am a big fan of resources”.. :-) I am sure you were also annoyed by that. I want those adjective-tags cut away..!

Synonyms

Now back to some therory: There is a third type of tag-connections: Synonyms. “delicious” and “del.icio.us” are classic synonyms. But I consider “ruby” and “rails” as synonyms too (no, they aren’t synonyms but up to now they are used as synonyms). You type in the second tag just to be sure that you won’t search for the second and find nothing.. I don’t think this category is too important for the cluster-task but I just name it here because I’ll use it further on.

Example

Let’s go for an example.
Lets consider tags that are connected to the tag “ajax”. I gathered some tag-connection-data from delicious (via its rss-feed). And I run a query on my statistical data. This is data gathered during the period of one week. It is not complete. But our experiment will work anyway:

tag-connection weight type
ajax-javascript 234 synonym
ajax-web 105 category
ajax-programming 100 category
ajax-xmlhttprequest 52 synonym
ajax-css 51 adjective
ajax-design 46 adjective
ajax-php 44 adjective
ajax-development 36 adjective
ajax-xml 34 adjective
ajax-DHTML 33 adjective
ajax-webdev 33 adjective
ajax-webdesign 31 adjective
ajax-google 23 adjective
ajax-HTML 21 adjective
tutorial 14 adjective

Column “tag-connection” is the tag connected to “ajax” (i.e. javascript), column “weight” depicts the number of times this connection occurred in a bookmark-post on delicious. The tags are ordered by weight. In column “type” you see the result of my computations for this tag-connection. Just to make it clear: These are all tags connected to tag “ajax” ordered number by occurrence of the connection. If a bookmark-post somebody did on delicious is tagged with “ajax” and “javascript” that gives one point for the “weight”-column for “ajax-javascript”.
The outcome is quite good, I think (I must admit that I have taken the example that worked out best :-))
There are some errors, sure: xml-ajax should be a “category”-type as well. But we are looking at the usage of these tags not their “real” meaning (whatever that is).

Computation

Synonyms

To compute these categorization I first went for the “synonyms”. The connection “ajax-javascript” is considered as synonym because “ajax-javascript” is “number one connection” of all connections where ajax is a part of. And when considering the connections of “javascript” (the “vice-versa-connection”), ajax is number two.
I consider two tags as synonyms if “in one direction” the other tag is number one and in the other “direction” the other tag is in the top 10. I made up this rule because I think that in most cases there is one “stronger” synonym that is used most of the time when the “weaker” one is used. The fact that the tag “ajax” is mostly used with tag “javascript” could also mean that “javascript” is a supercategory of ajax (which it somehow is). To avoid that this sub-super-categogy-connections are considered as synonyms, we go sure that “ajax” is also important for “javascript” so ajax is not so sub to javascript.. I hope you can follow :-)

Category/Adjective

Then I compute the “category”. Lets put the values of the above table into a graph.
distribution of tags related to ajax
On the x-axis you see the tags: The tick 1 stands for “web”, 2 for “programming”, 3 for “css”, 4=”design”, 5=”php” and so on. You see I removed the synonym-connections “ajax-javascript” and “ajax=xmlhttprequest” as I think they “disturb” the distribution.
The y-axis depicts the weight of the connection: ajax-web has weight “105″, ajax-programming has weight “100″ and so on.
The black line is the “weight”-column of the table above, the red one is the first derivative, the blue one the second derivative of the weight function.
This graph makes it clear that “web” and “programming” are used quite often in combination with “ajax”, then, there is quite a “gap” followed by the “adjective tail”. I consider the “adjective tail” as connections to be categorized as “adjective”. The tags in this tail are used “out of context”: They don’t really belong to the “ajax-cluster”. They sometimes occur together with ajax, but just sometimes. Mostly not. Therefore they are considered as “adjectives”.
Now the task is to find this “gap”. In my experiments I tried to find the last gap. To find the last gap I started at the end of the tail and searched for the first peak of the first derivative (that is when the second derivative goes from positive to negative) and checked if the peak was high enough. If these to conditions were fulfilled, I snipped the connections into two parts the “pre-gap” connections (category) and the “post-gap” connections (adjective).
The same computation has to be made for the “vice-versa” connection. I considered connections as “category” if one of both computations told that it is a “category”.

Further processing: Ambiguous tags

To achieve good clustering results, I think there is a need of checking if the tag is used in different ways. The prominent example hereof is “apple”. Now, when delicious is still restricted to the blogworld, it is clear that apple means Mac-apple. But in future this may change. To recognize if a tag is used in different environments, the algorithm would have to check the “neighbours of neighbours” (as suggested by Pietro Speroni). That is for ajax: check if the neighbours of “javascript” are more or less the same as the neighbours of “web”. You see that it all lays in the connections between tags. The tag per se is not well-defined but the tag in connection with another tag defines it quite well. Therefore for clustering I’m proposing splitting up amiguous tags. That would add much more simplicity to the resulting clusters.

We are onto something

I’m pretty sure we are onto something. I think this is direction it should go. Computations over tag-connection-distributions are cool. Users shouldn’t insert these infos when posting the bookmarks. Posting should stay easy. I’m not that sure about this “synonym”-computation but I think the “category”-computation turned out pretty good. I tried to build some clusters by hand just by considering the category and synonym-connections and I found a completely detached cluster consisting of the tags “cooking”, “health”, “recipes”, “diet” and “food”. As I said, I think we are onto something..

Further reading

4 Comments »

  1. Phillip,
    Interesting article especially the analytical classification of categories,synonyms and adjectives.in case of delicious we put two tags in the tag connection table if they are used to tag the same bookmark..if more people for their bookmarks do this the connection strengthens.

    say you tagged your article on writing with
    “language” and “writing” and a hundred other links connect these two.

    similarly there is another mutually exclusive conenction
    for “language” and “books”.. now if i build the tag connections the way suggested i cant get a connection between books and writing.
    I do agree that in a truly democratic tagging system two tags above “writing” and “books” would eventually get connected.but do u think its necessary to examine more than 1 level of connections? Do you think delicious does it this way?

    Comment by vivek krishna — December 21, 2005 2:19 pm #comment-1950

  2. As 2nd level you mean that if “language” and “books” are connected and “books” and “writing” are connected, therefore “language” and “writing” should be connected too?

    I don’t think it’s necessary or even helpful to go to the second level. If tag1-tag2 and tag2-tag3 each are connected strongly and if there is no link between tag1 and tag3 then I’d suggest that tag1 and tag3 don’t belong together. It even could be that tag2 is used differently in context “tag1″ and “tag2″.
    Given that tag2=”apple”:
    tag1 would be “osx” and tag3=”red”. Then I don’t want to connect osx and red. This example shows that if a tag has got strong connections to two tags which themselves are not connected, this tag is therefore “ambigous” and can have different meanings.
    I think the ambigous-algorithm has to go a little bit further..

    I suppose that del.icio.us doesn’t go to that “2nd level”. But I didn’t test that out.

    Comment by phred — December 21, 2005 3:26 pm #comment-1951

  3. Regarding “Kinds of Tags” there is an interesting section in “The Structure of Collaborative Tagging Systems” from Golder and Huberman. They identified seven different types of tags. Follow the website link and you’ll find the reference.

    I think, that you’re right in saying, that for computations on tags one has to regard these differences between types of tags. For example, the categorical tags of two resources having both “toread” as tag have probably nothing in common.

    On the other hand, the use of these “adjectives” differs from user to user, but the one of “categories” from resource to resource, so that it should not bias the global view that much.

    Comment by Robert — April 7, 2006 12:38 pm #comment-2565

  4. Found your article useful. Thanks!

    I had done some similar work sometime back. I have blogged about it here.

    Comment by threepointsomething — May 20, 2006 6:56 pm #comment-3084

RSS feed for comments on this post. TrackBack URI

Leave a comment

This page and it's content is licenced under creative commons