05.24.05
Distribution of Content Authorship, Part II
In the previous post, I wrote about the trend toward broadening the distribution of content authorship, and how the emergence of blogs caused an explosive acceleration of that trend. In general, putting authorship into the hands of individual authors is important and valuable — if content is authored “at the source” you have accomplished a kind of disintermediation with all its attendant rewards: reduced expense, more timely dissemination, etc. But it’s not all rosy.
Imagine the world of a Billion Blogs, each one posted to with varying frequency. Authors are dilligently linking to each other’s posts in relevant ways. Readers are leaving comments in various ways. Trackbacks are happening. Now, if you’re looking for some particular piece of information in this space, what do you do? Google (at least in its current form) isn’t an appropriate resource.
Instead, you probably turn to a service like Technorati or Feedster. These services have solved at least one part of the problem where Google would currently fail: by implementing a “ping” interface, they are able to know the instant that a posting has been made on any one of those billion blogs. The URL of your new post immediately goes into a queue to be processed and then included in an index. There’s obviously a question of scale here, but clearly this is a better way to index rapidly changing information than the old approach of just sending a spider around on a random walk through a web of links to find new content.
But another problem remains. View the entire collection of blog posts as an ever-expanding ball. Older posts are at the core, newer posts are on the surface. In general, links point inward, toward the core. If you use a system like Google’s PageRank algorithm, you’ll tend to favor posts at the core. Which is not really what you want: at any given moment, the most interesting action is at the surface. But, while you can include those surface posts in an index the moment they appear, you can’t really apply the same kind of PageRank metric to them, because no else has linked to them yet. You can apply a proxy for this missing linkage: assume that blogs that achieved good linkage in the past are more relevant, useful, or trustworthy now. But in the world of a Billion Blogs, that approximation will probably really fall down: you won’t reliably be able to assume that someone who once made a popular post will continue to issue popular posts. Indeed, you may already find that searching blog postings on Technorati or Feedster is less satisfying than searching the good ol’ web on Google.
So what’s the solution? How do we get back to Google-level performance in finding relevant information in the world of a Billion Blogs?
That’s for the next post. But I’ll give you a hint: it’s about semantics and structure.