Return-Path: Delivered-To: apmail-xml-cocoon-dev-archive@xml.apache.org Received: (qmail 60658 invoked by uid 500); 1 Nov 2001 08:24:35 -0000 Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: cocoon-dev@xml.apache.org Delivered-To: mailing list cocoon-dev@xml.apache.org Received: (qmail 60647 invoked from network); 1 Nov 2001 08:24:34 -0000 Content-Type: text/plain; charset="iso-8859-1" From: David Crossley Reply-To: crossley@indexgeo.com.au To: cocoon-dev@xml.apache.org Subject: Re: [RT] semantic searching Date: Thu, 1 Nov 2001 19:27:29 +1100 X-Mailer: KMail [version 1.2] References: <19a35d193e84.193e8419a35d@i-one.at> <3BDDA515.6AEAC251@apache.org> In-Reply-To: <3BDDA515.6AEAC251@apache.org> MIME-Version: 1.0 Message-Id: <01110119272903.21146@igacer> Content-Transfer-Encoding: 8bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Yes Stefano, both of your postings came through. This might be the delivery delay glitch that we have been seeing. Did you catch the end of the thread "Subject: Lucene as Avalon Component?" http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=100433830923480&w=2 Therein i raised the possibility of integration of Z39.50 search with Cocoon. You seemed to have missed that in your excellent summary. --David Stefano wrote: > Ciao, > > Bernhard started a great thread about adding search capabilities with > lucene, but I'd love to give some more impressions on that. > > Bernhard Huber wrote: > > > > In a perfect world (but we aim for that, right?) we should have an > > > abstracted search engine behavioral interface (future compatible with > > > semantic capabilities?) and then have an Avalon component (block?) to > > > implement that. > > > > and the search-engine understands your queries, semantically :-) > > Yeah right :) > > > But perhaps an advantage could be that a group of documents might > > present already some semantic keywords, stored in the documents, > > like author, and title. > > So searching for this keywords will give very good results. > > I see several levels of search from the least semantic to the most > semantic: > > 1) regexp matching (i.e. grep): no semantic is associated to the search > since it's up to the user to perform the semantic analysis that leads to > the creation of the regexp query to match. This results in boolean > search (either it matches or not) and assumes the content is stored in > textual formats. > > 2) text search engine (i.e. altavista): heuristics are used to extract > sort-of semantic content from some known document types (mostly HTML) > and associate some indexing value to them. This leads to an easier user > experience. > > 3) metadata-based search enginges (i.e. metacrawler): same as above, but > with the use of the HTML tag to associate keywords to higher > values of the search. Gives normally better searches, even if sometimes > keywords are misleading. > > 4) hyperlink-topology based search engines (i.e. google): they have the > ability to estimate the importance of a page given the links that refer > to it. Obviously, this can only happen when you have a "huge" pool of > pages, as google does. Note that google is also able to parse and index > PDF and extract heuristics from the internal graphics (font size, bold, > italic and so on). > > This is the state of the art. Google is, by far, the most advanced > searching solution available but due to its nature it cannot be applied > to a small site without loosing the power of topological analysis (thus, > we go back to number 3). > > Web crawlers are forced to obtain the web site information by "crawling" > it from the outside since they don't know the internal of the site. > > But local search solutions can have access to the web site from the > backside and index them (see htdig, for example, or Oracle text search > tools if text is stored into their databases). > > All these solutions work as a restricted version of #3 above, but they > are based on the assumption that the URI space can be easily mapped to > the internal request. > > Apache might show you the opposite (at first!), but Cocoon shows this is > very unlikely to be the case since it's generally a mistake to map a > file system (or a directory server, or a database repository) one-2-one > with the URI space, since it leads to easily broken links and potential > security issues. > > This is why crawling is the only way to go, but since outside access > reduces the visibility of some internal information that might increase > the semantic capacity of the indexer, Cocoon provies "views" (you can > think of them as "windows", but not in the M$ sense) to the resources. > > This said, we can now have access to the original content of the > resource. For example, we can now index the text inside a logo, if we > are given the SVG content that generated the raster image. Or can index > the PDF content without having to implement a PDF parser since we > request the "content" view of the resource and we obtain an easily > parsable XML file. > > Now, in a perfect world (again!), we could have a browser that allows us > to add specific HTTP headers to the request, thereforse, we could have > cocoon react to an HTTP header to know which view (also known as > resource "variant" in the HTTP spec) was requested. > > The current way for Cocoon to access views is fixed as a special URI > query parameter "cocoon-view", but I think we should extend the feature > to: > > 1) react on a "variant" HTTP header (nothing cocoon specific since the > concept could be impelemented later on by other publishing frameworks) > 2) react on URI extension: for example http://host/path/file.view, that > is something that I normally do by hand in my sitemaps (where > http://host/path/index is the default resource and index.content is the > XML view of the content). > 3) react on URI query parameter (as we do today). > > You could suggest to make this user-definable in the sitemap: well, > while the views are user definable (even if a number will be suggested > as a solid contract to allow indexing of other cocoons), I woundn't like > this to become too flexible since this is a solid contract that, if > broken, doesn't allow a crawler to obtain semantic information on a site > it doesn't own. > > Ok, now, let us suppose we have our good Cocoon in place with a bunch of > XML content and a way (thru resource views) to obtain the most semantic > version of this content. What can we do with it? > > 5) schema based search engines: as markup is bidimensional (text + tag), > we can now look for the text "text" inside the tag "tag". So, if you > know the schema used (say, docbook), you can place a query such as > > search for "cocoon" > in elements "title|subtitle" > of namespace "http://docbook.org/*" > with xml:lang "undefined|EN" > > that will return you the documents who happen to have the text "cocoon" > inside their "title" or "subtitle" elements associated to the namespace > starting with the "http://docbook.org/" URL and using the English > language or having no language definition. > > I call this "schema based" assuming that each schema has an associated > namespace. > > Note that this also capable of performing metadata evaluation: a query > such as > > search for "Stefano" and "Mazzocchi" > in elements "author" > of namespace "http://dublin-core.org/*" > > will work on the metadata markup associated with the dublin core > namespace. > > Note also that just like many search engine, this is a very powerful > syntax, but pretty unlikely that a user with no XML knowledge will be > able to use it. > > There are possible ways of creating such a query, one being the one used > in xyzsearch.com which creates a complex schema-based query based on an > incremental process (they claim a patent on that, but you can patent a > process, not an idea and they don't have Cocoon views under their > process): > > a) search for "Cocoon" > > Search for [Cocoon ] > > search | continue >> > > b) it returns the list of schemas associated with the elements where > the word Cocoon was found and lists a human readable definition of that > schema. For example: > > Markups where "Cocoon" was found: > > [ ] Zoological Markup Language > [ ] Docbook > [ ] Motion Pictures Description Language > > << back | search | continue >> > > c) then you click on which markup you like to choose (hopefully > understanding from the human description of the namespace what the > language is about). > d) then provides you the list of languages it was found in: > > Languages where the term "Cocoon" was found within markup "Docbook": > > [ ] undefined > [ ] English (general) > [ ] Italian > > << back | search | continue >> > > e) then you click on the language and asks you to indicate which tags > you'd like > > Contexts where the term "Cocoon" was found within markup "Docbook" > and language "undefined" or "English": > > [ ] title : the title of the document > [ ] subtitle : the subtitle of the document > [ ] para : a paragraph > [ ] strong : outlines important words > > << back | search | continue >> > > And so on, until the user hits the "search" button and then the list is > presented. > > In order to implement the above we need: > > a) a bunch of valid XML documents > > b) a register of namespaces -> schemas, along with some human readable > description of tags and schemas (which can be provided with the > XMLSchema schema itself) > > c) an xml-based storage system with advanced query capabilities (XPath > or even better, XQL). > > d) a view capable web publishing system. > > e) a view-based schema-aware crawler and indexer. > > f) a web application that connects to the indexer and provides the > above user experience. > > These are all independent concern islands. The contracts are: > > a) and b) are stored into c) (IMO, WebDAV or CVS would be the best > contracts here allowing editors to edit the files as they were on a file > system) > > d) uses c) as semi-structured data repository (XMLDB API being the > contract, or something equivalent) > > e) uses d) to obtain the semantic content and index the site (HTTP and > views being the contract) > > f) uses e) to provide the search experience (no contract nefined here, > probably the software API or some general-enough searching API, maybe > even Lucene's if powerful enough) > > There is still a long way to go to have the entire system in place, but > now that we have both an native XML DB and an indexing engine under > Apache, I hope this is going to move faster. > > Of course, the editing part remains the most difficult one to solve :/ > > 7) semantic search engine: if you are reading this far, I presume you'd > consider the above #6 a kick ass search engine and would likely stop > there. > > Well, there is more and this is where the semantic web effort kicks in. > > The previous technology (#6 from now onward) requires a bunch of > software that is yet to be written, but it's very much likely to happen. > Or, at least, I don't see any technical nor social reason why this > should not happen. > > This, unfortunately, cannot be said for a semantic search engine (#7). > > Let's start from outter space: you know what "semantic networks" are, > right? they are also known as "topic maps" (see www.topicmaps.org for > more details) and they represent a topological connection of "concepts", > along with their relationships. > > The basic idea is the following: > > 1) suppose you have a bunch of semantically marked-up content > 2) each important resource (not a web resource, but a semantic > resource, i.e. a word) is properly described in absolute and unique > terms. That is, currently, with an associated unique URI. > 3) there are semantic networks that describe relationships between > these resources > > With this infrastructure in place, it is virtually possible to use basic > inference rules to "crawl" the semantic networks and obtain search > derivatives which are semantically meaningful. > > Let's make an example: > > 1) suppose that your homepage states that you have two children: bob > and susan. Bob is a 6-years-old boy and Susan is a 12-years-old girl. > You are 42 and live in Boston. > 2) suppose that you used proper markup (say RDF) to describe these > relationships and you used the proper markup to indicate them. > > 3) now, a semantic crawler comes and index this information. > > 4) it is virtually possible, then, to ask for something like "give me > the name of those man in boston who have two or more children under 15" > without requiring any heuristical artificial intelligence. > > Now, in order to obtain this we need: > > a) the infrastructure of #6 > > b) a huge list of topics along with their unique meaning (unique in > this case means that each topic (say "father") must have one and only > one URI (say "http://www.un.org/topics/mankind/family/father") > associated (or topic maps that state the formal equivalence of topics). > > c) topic maps that state the relationships of those topics > > d) a way to create the query in a user-friendly way > > Well, given the political problems found in defining even the most > simple B2B schema, I strongly doubt we'll ever come this far. > > And even if we do come this far and this huge semantic network gets > implemented, the problem is making it possible (and profitable!) for > authors to markup their content in such a way that they are semantic > friendly in this topic-map sense. > > And given the amount of people who think that M$ Word is the best > authoring tool, well, authoring the information will sure be the worst > part of both 6# and 7#. > > > But I have some troubles with "semantic". > > > > As I would say "semantic" lies in the eye of the observer. > > But that's more philosophical. > > I hope the above explains better my meaning of "semantic". > > > Perhaps it would be interesting to gather some ideas, > > about what's the aim of using semantic search. > > > > Although the simple textual search gives a lot of bad results, > > it is simple to use. > > Correct. Both 6# and 7# might be extremely powerful but useless if > people are unable to search due to usability complexity. > > In fact, the weak point of #6 (after talking with my girlfriend about > it) is that the people might believe it's broken or they did something > wrong if they don't see results but a list of contexts to go further. > > Anyway, the above is just an example, not the best way to implement such > a system. > > > Using a semantic search should give better results, as the > > elements are taken into account when generating an index, > > and when evaluating the result of a query. > > Well, not really. > > Suppose you don't go as far as stating that you want "Cocoon" inside the > element "title". > > If you find "cocoon" in HTML you know this is better than > finding "cocoon" in <p>, but what if you have a chinese markup? how do > you know? > > So, I envision something like a heuristical map for tags and tag > inclusions that states the relative value of finding a word in a > particular location. > > So, > > para -> 1 > strong -> 1 > title -> 10 > > then > > /article/title/strong -> 10 + 1 = 11 > /para/strong -> 1 + 1 = 2 > /section/title -> 10 > > and so on, which might work for every markup and be general enough to > allow inclusion of namespaces and change the values depending on this. > > > But some points to think about: > > 1. What does to user should know already about the semantic of the > > documents? > > exactly, he/she doesn't know, nor he/she should. This is what the > heuristically associated values to tags are for. > > > 2. Does he/she have to know that a document has an author, for example? > > Well, some metadata (like library indexes, for examples) are very well > established and might not confuse the user if presented in ad advanced > query form. > > > 3. Does he/she have to know that querying for author entering > > "author:john" will search of the author's name. > > Absolutely not! This will be done by the web application. > > > Perhaps all 3 issues are just a questing of design the GUI of > > an semantic search... > > Yes and no. 3) calls for a better web app, that's for sure, but 1) IMO > calls for a heuristic system that currently is hardwired into the HTML > nature of the web content, but we have to abandon give the flexibility > of the XML model. > > > Just read now > > http://localhost:8080/cocoon/documents/emotional-landscapes.html, > > I see, semantic is taken the xml element's into account. > > Yes, more or less this is the meaning I give to the word. > > > > > How should I index? > > > > > > Eh, good question :) > > > > > > My suggestion would be to connect the same xlink-based crawling > > > subsystem used for CLI to lucene as it was a file system, but this > > > mightrequire some Inversion of Control (us pushing files into > > > lucene and not > > > lucene to crawl them or read them from disk) thus some code > > > changes to > > > it. > > > > I understand your hint. > > Great! > > > I must admit that I never understood cocoon's view concept. > > Very few do. In fact, even Giacomo didn't understand them at first when > he implemented the sitemap and they are still left in an unknown state. > I hope to be able to provide some docs to show the light on this soon. > > > Now I see what I can do using views. > > Yes, without views, Cocoon will be only harmful for the semantic web > effort (see a pretty old RT "is Cocoon harmful for the semantic web" on > this list, also picked up on xmlhack.com). > > > Perhaps adding an example in the view documentation, like > > Try using: > > http://localhost:8080/cocoon/welcome?cocoon-view=content, or > > http://localhost:8080/cocoon/welcome?cocoon-view=links > > would help a lot. > > But perhaps I'm just a bit slow.... > > No, don't worry, the concepts are pretty deep into the abstract > reasoning of how a web should work in the future and there is no docs > explaining this. > > > I never supposed to index the html result of an page, > > but the xml content (ad fontes!). > > Thus I was thinking about how to index a xml source. > > > > Or saying a more generally: > > What would be a smart xml indexing strategy? > > Ok, second step: the indexing algorithm. > > Warning: I know nothing of text indexing nor the algorithms associated > to these problems! > > > Lets take an snippet of > > http://localhost:8080/cocoon/documents/views.html?cocoon-view=content > > > > ----- begin > > .... > > <s1 title="The Views"> > > <s2 title="Introduction"> > > <p> Views are yet another sitemap component. Unlike the rest, they > > are othogonal to the resource and pipeline definitions. In the > > ... > > <s3 title="View Processing"> > > <p>The samples sitemap contains two view definitions. One of them > > looks like the excerpt below.</p> > > <source xml:space="preserve"> > > > > <map:views> > > <map:view name="content" from-label="content"> > > <map:serialize type="xml"/> > > </map:view> > > > > </source> > > .... > > ----- end > > > > I see following options: > > 1. Index only the bare text. That's simple, and stupid, > > as a lot of info entered by the xml generator (human, program) > > is ignored. > > Yes. It's already powerful as we are able, for example, to index picture > text out of SVG files or PDF files without requiring PDF parsing, but it > is admittedly a waste of precious information. > > It could be a first step, though. > > > 2. Try to take the element's name, and/or attributes into account. > > 3. Try to take the elements path into account. > > I would suggest taking the heuristical value of the path into account, > rather than the path itself. > > > Let's see what queries an engine should answer: > > ad 1. query: "Intro", result: all docs having text cocoon > > > > ad 2. query: "title:Intro", result: all docs having title elements with > > text Intro. > > > > ad 2. query: "source:view", result: all docs having some source code > > snippet regarding cocoon view concept. > > > > ad 3. query: "xpath:**/s2/title/Intro", result all docs having s2 title > > Intro, not sure about this how to marry lucene with xpath > > don't know the internals of Lucene, but maybe associating some numerical > values to text is useful to increase the ordering of importance. well, > maybe we should ask the lucene guys for this. > > > I will try to implement something like that... > > > > Design-Draft > > > > 1. Crawling: > > Usign the above described cocoon view-based crawling subsystem > > > > 2. Indexing: > > 2.1 Each element-name will create a lucene field having the > > same name as the element-name. > > (?What about the element's name space, should I take it into account?) > > Yes, it should identify the schema used to get the heuristic mapping. > Also, there could be mixed heuristical mappings, for example, between > docbook namespace and dublin core namespace. > > > 2.2 Each attribute of an element will create a lucene field having > > the concated name of the element-name, and the attribute-name. > > 2.3 Having a field named body for the bare text. > > > > 3. Searching > > Just use the lucene search engine. > > I think this is a good starting point, yes. > > > (btw, > > I was already playing with lucene for indexing/searching mail messages > > stored in mbox. This way I was searching the > > http://xml.apache.org/mails/200109.gz, > > > > Wouldn't it be nice to generate FAQ, etc from the mbox mail messages. > > But that's a semantic problem, as the mail messages have poor > > xml-semantic content :-) > > Yes, even if, in theory, we all use things like *STRONG* _emphasis_ LOUD > "quote" and the like. This is, in fact, markup in the most general sense > :) > > > > Note that "dynamic" has a different sense that before and it means > > > thatthe resource result is not dependent on request-based or > > > environmentalparameters (such as user-agent, date, time, machine > > > load, IP address, > > > whatever). A resource that is done aggregating a ton of documents > > > storedon a database must be considered static if it is not > > > dependent of > > > request parameters. > > > > > > For a semantic crawler, instead of asking for the "standard" view, it > > > would ask for semantic-specific views such as "content" (the most > > > semantic stage at pipeline generation, which we already specify in our > > > example sitemaps) or "schema" (not currently implemented as nobody > > > woulduse it today anyway). > > > > > > But the need of resource "views" is the key to the success of proper > > > search capabililities and we must be sure that we use them even for > > > semantically-poor searching solutions like lucene, but that would kick > > > ass anyway on small to medium size web sites. > > > > > > Hope this helps and if you have further questions, don't mind asking. > > > > thanks for your suggestions, helping a lot to understand cocoon better. > > Hope this helps even more :) > > Ciao. > > -- > Stefano Mazzocchi One must still have chaos in oneself to be > able to give birth to a dancing star. > <stefano@apache.org> Friedrich Nietzsche > -------------------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org For additional commands, email: cocoon-dev-help@xml.apache.org