Search, don't Sort

Published June 23, 2009
Advertisement
One of the major philosophical elements of the V5 design is one taken from Google: Search, don't sort.

The problems with rigid categorization - sorting content items into distinct categories as 'containers' - are fairly well-known:

  • How do you decide what categories there should be? GDNet only creates new forums when there's sufficient traffic in one area to warrant it; we do this for good reason, but until the traffic reaches critical mass, the category on a topic isn't as precise as it could be.

  • How do you decide which category something should be in? When you've got category so vaguely defined as 'Game Programming' and 'General Programming,' it's easy to see how people can get confused.

  • What do you do when a content item should appear in more than one category? And what if they should appear in each category to unequal extents?

  • How do categories relate to one another? If something in one category is commonly in another category, perhaps they should be nested? If something is in the nested category, is it always also in the parent category?



A different approach is flexible category annotations, or 'tags.' Instead of viewing categories as containers that content items are sorted into, they're viewed as indexes into the content pool, fuzzy sets that describe the data rather than housing it.

What am I telling you this for? It's pretty well-known stuff by now, I guess. I'm bringing it up because over the past few days I've been working mostly on the tagging and search engines for V5.

The tagging engine has a pretty simple set of responsibilities:

  • Store and retrieve the tags associated by a user with a given resource.

  • Calculate some set of 'aggregated' tags for a resource, using the tags applied to the item by all users.

  • Find the resources most relevant to a tag or set of tags.


The implementation I've written so far is a naive one, but it'll suffice for the time being. The aggregation process is simply the average of all user-applied tags, crude but open to tweaking later. Finding the most relevant resources is little more than a SELECT query, scoring relevance by taking the mean least squared error between each tag set and the supplied search tags. There are problems, but they can be fixed later.

One nice trick resulting from the RESTful schema for the site is that each resource has a nice, clear URI - ideal for using as a key. So each tagset is the association of a set of (Tag, Weight) pairs with a Uri. The result is completely content-agnostic; the tagging engine knows nothing about the kinds of content the site offers.

The tagging engine's last responsibility - finding resources - is obviously highly related to the search engine. Not all searches are tag-related; for example, Active Topics is a search for all discussion threads updated in the past 24 hours, while it's easy to imagine other searches based around the author of the content or similar. So, there is a separate search service that stores, maintains, and performs all saved and transient searches, using the tagging engine when appropriate.
Next Entry Text sanitization
0 likes 5 comments

Comments

Aardvajk
Now we're talking. This is the first time I've started getting a bit excited about V5.

So rather than Game Programming, DirectX and XNA etc, I'll be able to have my own Aardvajk's Interests that pulls up any topics that relate to things I've selected?

I guess the success or failure of this, like any similar system, will depend on the level of "incorrect" matches (incorrect is probably a subjective term in this context) that the tagging engine makes.

If it works, it will be fantastic, but if I went into my own selection and found myself trawling through a load of topics about AngelScript or Web Development because they commonly have several of my tags attached, it would be a bit annoying.
June 23, 2009 02:22 PM
superpig
Yep, this is why the tags have weights applied. You'll be able to express Aardvajk's Interests as a set of weighted tags: maybe you want it mostly about graphics, and a bit about AI and gameplay programming; you also don't want to include content that is newbie-oriented (where the participants are lowly tagged in the areas that the content is tagged in). That can be saved as your own custom saved search, can generate RSS feeds, notifications, and so on.

The quality of the results will probably be a bit lackluster at first, but will improve over time. I've got lots of ideas about how the system can calculate relevance, and about how users can train it to be better...
June 23, 2009 08:23 PM
Aardvajk
All sounds top notch. I look forward to it. [smile]
June 24, 2009 02:26 AM
Staffan E
I think it sounds wonderful. From the top of my head though I have a few concerns.

Would the set of available tags be predefined or defined by the user, like custom tags? I can see the openness-benefit of letting the user supply their own tags, but difficulty in getting meaningful searches with all the different tags.

I also guess moderation would benefit from a fixed set of tags, where a moderator could be assigned a few tag areas which they are familiar with, and check that thread contents follow the tags.

... now I'm just rambling. It's a really cool idea.
June 27, 2009 07:55 AM
Kylotan
It's worth looking over at StackOverflow at how they handle this. Ability for moderators to add or remove tags is handy, and they have a fairly decent UI for selecting tag combinations etc.
June 30, 2009 10:49 AM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Profile
Author
Advertisement
Advertisement