05.24.05
Structure and Semantics: The Semantic Web
Long before the explosion of the blogosphere, some very smart people were thinking about how to make information on the web more useful. In 2000, most people were finding information on the web by using one of the then-popular search engines, like Yahoo or Excite. But as the web had grown, the experience of using these systems had deteriorated. It wasn’t that these systems couldn’t find pages somehow relevant to what you were looking for; to the contrary, the problem was that there was too much information on just about any topic you could conceive. We were beginning to hit a fundamental limit on the utility of pure text retrieval systems given the size of the corpus. And that corpus was huge.
Google’s rise to prominence during this period can largely be attributed to their PageRank algorithm, which allowed Google to establish the relative value of elements within a result set based on linkage. If something is well-linked, it’s probably a better resource than something that’s not. It was a simple idea, but it really improved the quality of search results. Nonetheless, the underlying mechanism used by Google is still limited by the quality and quantity of the corpus. And that corpus keeps getting larger and the quality is arguably getting worse given the distribution of content authorship. So while PageRank may have temporarily restored the utility of full-text information retrieval systems (aka, search engines), Google and company are fighting a never-ending battle against the corpus; eventually, they too will succumb.
Fortunately, some very smart people have been thinking about this problem for years. In their view, the way to make information online more useful is to make it more readable (and that means, in some sense, understandable) by machine. One accomplishes this feat by authoring content such that entities (say any proper noun) are identified in a unique and unambiguous fashion (everything gets a URI: you, your pet, the color blue) and the relationships between entities are similarly made explicit, and identified uniquely and unambiguously. This effort has been dubbed the Semantic Web. Based on RDF and XML, the creators of the Semantic Web envision a parallel web (they call it an extension) where content is authored in a much more sophisticated way than it is today. By explicitly infusing meaning into content as it’s authored, would-be consumers of that content will be able to find and navigate through vast seas of data on the Semantic Web in ways that will make our current approaches to search look like blind men tapping around with canes.
The only problem is it will never happen.
That’s not quite accurate. For certain domains, the Semantic Web already is providing a brilliant platform on which to publish information in a way that really does enable consumers to maximize its utility. Those domains, however, tend to be large bodies of technical data published by resource-laden institutions, in many respects the exact opposite of the self-published blog that we’ve been focusing on here. The complexity of authoring documents in RDF and the dearth of authoring tools suggest that the Semantic Web is not destined for immediate widespread acceptance.
Information Overload » Structure on the Web: A Survey said,
June 3, 2005 at 12:12 pm
[…] #8217;ll mention is actually unsemantic representation: it’s the idea (debunked in a previous post) that someone with a lot of spare CPU cycles on their hands […]