In this message I talk about indexing, searching and retrieving large quantities of semi-structured XML data (mostly documents). XML documents can be seen as text and indexed as such. Unfortunately, this would totally break usefulness since Hello would not be indexed as 'hello' between 'b' but simply as that textual token. Such a model would result in a very poor searching experience since the user must know in advance how the markup is used. Add namespaces to the picture, and this becomes clearly useless. - o - An improved model indexes the results of the parsing stage, separating the 'textual content' (element contents and attributes value) from the 'contextual content' (element names and attribute names). This is, of course, a better indexing, but still has a few defects: 1) markup information is lost 2) summary information is hard to recreate because of #1 One solution is to avoid indexing attributes, hoping that element content is more structured in itself, thus the summary gives better results. Unfortunately, attributes convey content as much as elements do. One cannot assume or force the opposite. - o - I propose a solution which I call 'Semantic Relevance Rating' (shortly SRR, that you can pronounce as 'SoRRy' :) SRR adds the ability to merge semantic information from the markup structure into the index, thus avoiding to loose the contextual information embedded in them. Let me show you an example of what this is. Suppose we have to index this document: How <keyword>SRR</keyword> works Here I explain how Semantic Relevance Rating works. ...
This is a usage case:
...
just like a browser wouldn't be able to present the above document without style information, an indexer is not able to 'understand' or interpret the semantic relevance of each textual piece without some information that matches their contextual part. Following the stylesheet concept, I propose the creation of a 'relevance-sheet' which contains this information for the indexer that allows it to index the structured content in the way it was intended from the document authors. Here is an example relevance-sheet for the above document: document.srr: metadata.srr: The way the SRR sheet is associated with the document is not defined here since it is another concern. - o - The SRR solution yields a few interesting results: 1) the cost of 'semantizing' the information is proportional to the number of schemas included in the data corpus to index, unlike RDF-like solutions which costs are proportional to the entire information included in the corpus. For example, in a system where there are 10 different schemas and a milion documents, the cost will be associated in creating SRR relevance-sheets for those 10 schemas, compared to the cost of adding semantic RDF information in each and every file. This is the exact same concept of SoC between content and style, here associated to the separation between content and its semantic relevance interpretation. 2) the user experience is no different from the one he/she's used to: he doesn't need to know the schema of the documents nor any information about metadata or metadata fields in order to obtain the information. 3) the SRR drives the indexing behavior and indicates whether or not some information should be indexed or not. Since the relevance factors are multiplied by the indexer, text that is associated to a context of 'zero' semantic relevance is skipped and avoids polluting the index with information that might 'get in the way' unwanted. - o - I'm currently working with the Lucene guys to add the ability to 'rate' textual input in the indexer. When this is done, adding SRR capabilities with Cocoon is a matter of writing the relevance sheets for our DTDs. Comments? -- Stefano Mazzocchi One must still have chaos in oneself to be able to give birth to a dancing star. Friedrich Nietzsche -------------------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org For additional commands, email: cocoon-dev-help@xml.apache.org