Through reading an interesting post by Chris Dixon titled "To make smarter systems, it's all about the data", I came across another interesting link in one of the comments - the google research blog post on "The unreasonable effectiveness of data".
The post links to a PDF document written by three Google researchers, and covers a subject I've been experimenting with a lot lately - the semantic extraction of data.
The document is a nice read though probably too technical for most, and it brings up the difficulty of implementation as one of the barriers for taking the next step towards structured data on the web. The argument is that most small content publishers do not posses the knowledge and expertise required to publish their content in a semantically meaningful way.
It seems to me that there's an easy solution for that - and that is to embed semantic awareness in the publishing tools themselves. Most people publish content through one of several high-profile content management systems (WordPress, Moveable type, Blogger, etc), meaning it is possible to reach a very large segment of content publishers from relatively few integration points.
Expecting people to learn about and implement web semantics is unreasonable, as the document suggests. Delegating that responsibility to the authoring tools by enhancing the backend logic and the interface, is very much doable. Need to add a calender event? allow the interface to add it in the proper microformat. Want to affect the styling of your content? allow the interface to give several semantically significant options (headers, paragraphs etc.). Most of those options are available today, yet they are not obvious enough that they are used in a consistent manner.
HTML 5 is around the corner, with more semantically relevant tags (such as header, footer, navigation, sections). Integrating support for it in those content management systems is a good first step towards more semantically accessible content.
Going even further with this, content management systems could provide more structured data on demand - or wait, they are already doing this! XML feeds, anyone? seems to me that if you poke a little beneath the surface you find that there is actually a lot of structure in modern online content. What's lacking is uniformity, but even so the number of dominant standards is not great.
Who will be "the next Google" who can examine this structure and extract meta-meaning that can provide value? you can be certain some bright minds are already on the case.