Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cocoon-dev@xml.apache.org
Content-Type: text/plain;
  charset="iso-8859-1"
From: David Crossley <crossley@indexgeo.com.au>
Reply-To: crossley@indexgeo.com.au
To: cocoon-dev@xml.apache.org
Subject: Re: [RT] semantic searching
Date: Thu, 1 Nov 2001 19:27:29 +1100
References: <19a35d193e84.193e8419a35d@i-one.at>
 <3BDDA515.6AEAC251@apache.org>
In-Reply-To: <3BDDA515.6AEAC251@apache.org>
MIME-Version: 1.0
Message-Id: <01110119272903.21146@igacer>
Content-Transfer-Encoding: 8bit

Yes Stefano, both of your postings came through. This might be
the delivery delay glitch that we have been seeing. Did you catch
the end of the thread "Subject: Lucene as Avalon Component?"
http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=100433830923480&w=2
Therein i raised the possibility of integration of Z39.50 search
with Cocoon. You seemed to have missed that in your
excellent summary.
--David

Stefano wrote:
> Ciao,
> 
> Bernhard started a great thread about adding search capabilities with
> lucene, but I'd love to give some more impressions on that.
> 
> Bernhard Huber wrote:
> 
> > > In a perfect world (but we aim for that, right?) we should have an
> > > abstracted search engine behavioral interface (future compatible with
> > > semantic capabilities?) and then have an Avalon component (block?) to
> > > implement that.
> > 
> > and the search-engine understands your queries, semantically :-)
> 
> Yeah right :)
> 
> > But perhaps an advantage could be that a group of documents might
> > present already some semantic keywords, stored in the documents,
> > like author, and title.
> > So searching for this keywords will give very good results.
> 
> I see several levels of search from the least semantic to the most
> semantic:
> 
> 1) regexp matching (i.e. grep): no semantic is associated to the search
> since it's up to the user to perform the semantic analysis that leads to
> the creation of the regexp query to match. This results in boolean
> search (either it matches or not) and assumes the content is stored in
> textual formats.
> 
> 2) text search engine (i.e. altavista): heuristics are used to extract
> sort-of semantic content from some known document types (mostly HTML)
> and associate some indexing value to them. This leads to an easier user
> experience.
> 
> 3) metadata-based search enginges (i.e. metacrawler): same as above, but
> with the use of the <meta> HTML tag to associate keywords to higher
> values of the search. Gives normally better searches, even if sometimes
> keywords are misleading.
> 
> 4) hyperlink-topology based search engines (i.e. google): they have the
> ability to estimate the importance of a page given the links that refer
> to it. Obviously, this can only happen when you have a "huge" pool of
> pages, as google does. Note that google is also able to parse and index
> PDF and extract heuristics from the internal graphics (font size, bold,
> italic and so on).
> 
> This is the state of the art. Google is, by far, the most advanced
> searching solution available but due to its nature it cannot be applied
> to a small site without loosing the power of topological analysis (thus,
> we go back to number 3).
> 
> Web crawlers are forced to obtain the web site information by "crawling"
> it from the outside since they don't know the internal of the site.
> 
> But local search solutions can have access to the web site from the
> backside and index them (see htdig, for example, or Oracle text search
> tools if text is stored into their databases).
> 
> All these solutions work as a restricted version of #3 above, but they
> are based on the assumption that the URI space can be easily mapped to
> the internal request.
> 
> Apache might show you the opposite (at first!), but Cocoon shows this is
> very unlikely to be the case since it's generally a mistake to map a
> file system (or a directory server, or a database repository) one-2-one
> with the URI space, since it leads to easily broken links and potential
> security issues.
> 
> This is why crawling is the only way to go, but since outside access
> reduces the visibility of some internal information that might increase
> the semantic capacity of the indexer, Cocoon provies "views" (you can
> think of them as "windows", but not in the M$ sense) to the resources.
> 
> This said, we can now have access to the original content of the
> resource. For example, we can now index the text inside a logo, if we
> are given the SVG content that generated the raster image. Or can index
> the PDF content without having to implement a PDF parser since we
> request the "content" view of the resource and we obtain an easily
> parsable XML file.
> 
> Now, in a perfect world (again!), we could have a browser that allows us
> to add specific HTTP headers to the request, thereforse, we could have
> cocoon react to an HTTP header to know which view (also known as
> resource "variant" in the HTTP spec) was requested.
> 
> The current way for Cocoon to access views is fixed as a special URI
> query parameter "cocoon-view", but I think we should extend the feature
> to:
> 
>  1) react on a "variant" HTTP header (nothing cocoon specific since the
> concept could be impelemented later on by other publishing frameworks)
>  2) react on URI extension: for example http://host/path/file.view, that
> is something that I normally do by hand in my sitemaps (where
> http://host/path/index is the default resource and index.content is the
> XML view of the content).
>  3) react on URI query parameter (as we do today).
> 
> You could suggest to make this user-definable in the sitemap: well,
> while the views are user definable (even if a number will be suggested
> as a solid contract to allow indexing of other cocoons), I woundn't like
> this to become too flexible since this is a solid contract that, if
> broken, doesn't allow a crawler to obtain semantic information on a site
> it doesn't own.
> 
> Ok, now, let us suppose we have our good Cocoon in place with a bunch of
> XML content and a way (thru resource views) to obtain the most semantic
> version of this content. What can we do with it?
> 
> 5) schema based search engines: as markup is bidimensional (text + tag),
> we can now look for the text "text" inside the tag "tag". So, if you
> know the schema used (say, docbook), you can place a query such as 
> 
>  search for "cocoon" 
>   in elements "title|subtitle" 
>   of namespace "http://docbook.org/*" 
>   with xml:lang "undefined|EN"
>  
> that will return you the documents who happen to have the text "cocoon"
> inside their "title" or "subtitle" elements associated to the namespace
> starting with the "http://docbook.org/" URL and using the English
> language or having no language definition.
> 
> I call this "schema based" assuming that each schema has an associated
> namespace.
> 
> Note that this also capable of performing metadata evaluation: a query
> such as
> 
>  search for "Stefano" and "Mazzocchi"
>   in elements "author"
>   of namespace "http://dublin-core.org/*"
> 
> will work on the metadata markup associated with the dublin core
> namespace.
> 
> Note also that just like many search engine, this is a very powerful
> syntax, but pretty unlikely that a user with no XML knowledge will be
> able to use it.
> 
> There are possible ways of creating such a query, one being the one used
> in xyzsearch.com  which creates a complex schema-based query based on an
> incremental process (they claim a patent on that, but you can patent a
> process, not an idea and they don't have Cocoon views under their
> process):
> 
>  a) search for "Cocoon"
> 
>      Search for [Cocoon         ]
> 
>             search | continue >>
> 
>  b) it returns the list of schemas associated with the elements where
> the word Cocoon was found and lists a human readable definition of that
> schema. For example:
> 
>     Markups where "Cocoon" was found:
>   
>         [ ] Zoological Markup Language
>         [ ] Docbook
>         [ ] Motion Pictures Description Language
> 
>    << back | search | continue >>
> 
>   c) then you click on which markup you like to choose (hopefully
> understanding from the human description of the namespace what the
> language is about).
>   d) then provides you the list of languages it was found in:
> 
>    Languages where the term "Cocoon" was found within markup "Docbook":
> 
>         [ ] undefined
>         [ ] English (general)
>         [ ] Italian
> 
>    << back | search | continue >>
> 
>   e) then you click on the language and asks you to indicate which tags
> you'd like
> 
>    Contexts where the term "Cocoon" was found within markup "Docbook" 
>    and language "undefined" or "English":
>  
>        [ ] title : the title of the document
>        [ ] subtitle : the subtitle of the document
>        [ ] para : a paragraph
>        [ ] strong : outlines important words
> 
>    << back | search | continue >>
> 
> And so on, until the user hits the "search" button and then the list is
> presented.
> 
> In order to implement the above we need:
> 
>  a) a bunch of valid XML documents
> 
>  b) a register of namespaces -> schemas, along with some human readable
> description of tags and schemas (which can be provided with the
> XMLSchema schema itself)
> 
>  c) an xml-based storage system with advanced query capabilities (XPath
> or even better, XQL).
> 
>  d) a view capable web publishing system.
> 
>  e) a view-based schema-aware crawler and indexer.
> 
>  f) a web application that connects to the indexer and provides the
> above user experience.
> 
> These are all independent concern islands. The contracts are:
> 
>  a) and b) are stored into c) (IMO, WebDAV or CVS would be the best
> contracts here allowing editors to edit the files as they were on a file
> system)
> 
>  d) uses c) as semi-structured data repository (XMLDB API being the
> contract, or something equivalent)
> 
>  e) uses d) to obtain the semantic content and index the site (HTTP and
> views being the contract)
> 
>  f) uses e) to provide the search experience (no contract nefined here,
> probably the software API or some general-enough searching API, maybe
> even Lucene's if powerful enough)
> 
> There is still a long way to go to have the entire system in place, but
> now that we have both an native XML DB and an indexing engine under
> Apache, I hope this is going to move faster.
> 
> Of course, the editing part remains the most difficult one to solve :/
> 
> 7) semantic search engine: if you are reading this far, I presume you'd
> consider the above #6 a kick ass search engine and would likely stop
> there.
> 
> Well, there is more and this is where the semantic web effort kicks in.
> 
> The previous technology (#6 from now onward) requires a bunch of
> software that is yet to be written, but it's very much likely to happen.
> Or, at least, I don't see any technical nor social reason why this
> should not happen.
> 
> This, unfortunately, cannot be said for a semantic search engine (#7).
> 
> Let's start from outter space: you know what "semantic networks" are,
> right? they are also known as "topic maps" (see www.topicmaps.org for
> more details) and they represent a topological connection of "concepts",
> along with their relationships.
> 
> The basic idea is the following:
> 
>  1) suppose you have a bunch of semantically marked-up content
>  2) each important resource (not a web resource, but a semantic
> resource, i.e. a word) is properly described in absolute and unique
> terms. That is, currently, with an associated unique URI.
>  3) there are semantic networks that describe relationships between
> these resources
> 
> With this infrastructure in place, it is virtually possible to use basic
> inference rules to "crawl" the semantic networks and obtain search
> derivatives which are semantically meaningful.
> 
> Let's make an example:
> 
>  1) suppose that your homepage states that you have two children: bob
> and susan. Bob is a 6-years-old boy and Susan is a 12-years-old girl.
> You are 42 and live in Boston.
>  2) suppose that you used proper markup (say RDF) to describe these
> relationships and you used the proper markup to indicate them.
>  
>  3) now, a semantic crawler comes and index this information.
> 
>  4) it is virtually possible, then, to ask for something like "give me
> the name of those man in boston who have two or more children under 15"
> without requiring any heuristical artificial intelligence.
> 
> Now, in order to obtain this we need:
> 
>  a) the infrastructure of #6
> 
>  b) a huge list of topics along with their unique meaning (unique in
> this case means that each topic (say "father") must have one and only
> one URI (say "http://www.un.org/topics/mankind/family/father")
> associated (or topic maps that state the formal equivalence of topics).
> 
>  c) topic maps that state the relationships of those topics
> 
>  d) a way to create the query in a user-friendly way
> 
> Well, given the political problems found in defining even the most
> simple B2B schema, I strongly doubt we'll ever come this far.
> 
> And even if we do come this far and this huge semantic network gets
> implemented, the problem is making it possible (and profitable!) for
> authors to markup their content in such a way that they are semantic
> friendly in this topic-map sense.
> 
> And given the amount of people who think that M$ Word is the best
> authoring tool, well, authoring the information will sure be the worst
> part of both 6# and 7#.
> 
> > But I have some troubles with "semantic".
> > 
> > As I would say "semantic" lies in the eye of the observer.
> > But that's more philosophical.
> 
> I hope the above explains better my meaning of "semantic".
>  
> > Perhaps it would be interesting to gather some ideas,
> > about what's the aim of using semantic search.
> > 
> > Although the simple textual search gives a lot of bad results,
> > it is simple to use.
> 
> Correct. Both 6# and 7# might be extremely powerful but useless if
> people are unable to search due to usability complexity.
> 
> In fact, the weak point of #6 (after talking with my girlfriend about
> it) is that the people might believe it's broken or they did something
> wrong if they don't see results but a list of contexts to go further.
> 
> Anyway, the above is just an example, not the best way to implement such
> a system.
>  
> > Using a semantic search should give better results, as the
> > elements are taken into account when generating an index,
> > and when evaluating the result of a query.
> 
> Well, not really.
> 
> Suppose you don't go as far as stating that you want "Cocoon" inside the
> element "title".
> 
> If you find "cocoon" in HTML <title> you know this is better than
> finding "cocoon" in <p>, but what if you have a chinese markup? how do
> you know?
> 
> So, I envision something like a heuristical map for tags and tag
> inclusions that states the relative value of finding a word in a
> particular location.
> 
> So, 
> 
>  para -> 1
>  strong -> 1
>  title -> 10
> 
> then
> 
>  /article/title/strong -> 10 + 1 = 11
>  /para/strong -> 1 + 1 = 2
>  /section/title -> 10
> 
> and so on, which might work for every markup and be general enough to
> allow inclusion of namespaces and change the values depending on this.
> 
> > But some points to think about:
> > 1. What does to user should know already about the semantic of the
> > documents?
> 
> exactly, he/she doesn't know, nor he/she should. This is what the
> heuristically associated values to tags are for.
>  
> > 2. Does he/she have to know that a document has an author, for example?
> 
> Well, some metadata (like library indexes, for examples) are very well
> established and might not confuse the user if presented in ad advanced
> query form.
>  
> > 3. Does he/she have to know that querying for author entering
> > "author:john" will search of the author's name.
> 
> Absolutely not! This will be done by the web application.
>  
> > Perhaps all 3 issues are just a questing of design the GUI of
> > an semantic search...
> 
> Yes and no. 3) calls for a better web app, that's for sure, but 1) IMO
> calls for a heuristic system that currently is hardwired into the HTML
> nature of the web content, but we have to abandon give the flexibility
> of the XML model.
>  
> > Just read now
> > http://localhost:8080/cocoon/documents/emotional-landscapes.html,
> > I see, semantic is taken the xml element's into account.
> 
> Yes, more or less this is the meaning I give to the word.
>  
> > > > How should I index?
> > >
> > > Eh, good question :)
> > >
> > > My suggestion would be to connect the same xlink-based crawling
> > > subsystem used for CLI to lucene as it was a file system, but this
> > > mightrequire some Inversion of Control (us pushing files into
> > > lucene and not
> > > lucene to crawl them or read them from disk) thus some code
> > > changes to
> > > it.
> >
> > I understand your hint.
> 
> Great!
> 
> > I must admit that I never understood cocoon's view concept.
> 
> Very few do. In fact, even Giacomo didn't understand them at first when
> he implemented the sitemap and they are still left in an unknown state.
> I hope to be able to provide some docs to show the light on this soon.
> 
> > Now I see what I can do using views.
> 
> Yes, without views, Cocoon will be only harmful for the semantic web
> effort (see a pretty old RT "is Cocoon harmful for the semantic web" on
> this list, also picked up on xmlhack.com).
> 
> > Perhaps adding an example in the view documentation, like
> > Try using:
> > http://localhost:8080/cocoon/welcome?cocoon-view=content, or
> > http://localhost:8080/cocoon/welcome?cocoon-view=links
> > would help a lot.
> > But perhaps I'm just a bit slow....
> 
> No, don't worry, the concepts are pretty deep into the abstract
> reasoning of how a web should work in the future and there is no docs
> explaining this.
> 
> > I never supposed to index the html result of an page,
> >  but the xml content (ad fontes!).
> > Thus I was thinking about how to index a xml source.
> > 
> > Or saying a more generally:
> > What would be a smart xml indexing strategy?
> 
> Ok, second step: the indexing algorithm.
> 
> Warning: I know nothing of text indexing nor the algorithms associated
> to these problems!
>  
> > Lets take an snippet of
> > http://localhost:8080/cocoon/documents/views.html?cocoon-view=content
> > 
> > ----- begin
> > ....
> > <s1 title="The Views">
> > <s2 title="Introduction">
> > <p> Views are yet another sitemap component. Unlike the rest, they
> >     are othogonal to the resource and pipeline definitions. In the
> > ...
> > <s3 title="View Processing">
> > <p>The samples sitemap contains two view definitions. One of them
> >      looks like the excerpt below.</p>
> > <source xml:space="preserve">
> > 
> >   &lt;map:views&gt;
> >      &lt;map:view name="content" from-label="content"&gt;
> >      &lt;map:serialize type="xml"/&gt;
> >   &lt;/map:view&gt;
> > 
> >      </source>
> > ....
> > ----- end
> > 
> > I see following options:
> > 1. Index only the bare text. That's simple, and stupid,
> > as a lot of info entered by the xml generator (human, program)
> > is ignored.
> 
> Yes. It's already powerful as we are able, for example, to index picture
> text out of SVG files or PDF files without requiring PDF parsing, but it
> is admittedly a waste of precious information.
> 
> It could be a first step, though.
> 
> > 2. Try to take the element's name, and/or attributes into account.
> > 3. Try to take the elements path into account.
> 
> I would suggest taking the heuristical value of the path into account,
> rather than the path itself.
>  
> > Let's see what queries an engine should answer:
> > ad 1. query: "Intro", result: all docs having text cocoon
> > 
> > ad 2. query: "title:Intro", result: all docs having title elements with
> > text Intro.
> > 
> > ad 2. query: "source:view", result: all docs having some source code
> > snippet regarding cocoon view concept.
> > 
> > ad 3. query: "xpath:**/s2/title/Intro", result all docs having s2 title
> > Intro, not sure about this how to marry lucene with xpath
> 
> don't know the internals of Lucene, but maybe associating some numerical
> values to text is useful to increase the ordering of importance. well,
> maybe we should ask the lucene guys for this.
>  
> > I will try to implement something like that...
> > 
> > Design-Draft
> > 
> > 1. Crawling:
> >   Usign the above described cocoon view-based crawling subsystem
> > 
> > 2. Indexing:
> > 2.1 Each element-name will create a lucene field having the
> >   same name as the element-name.
> >   (?What about the element's name space, should I take it into account?)
> 
> Yes, it should identify the schema used to get the heuristic mapping.
> Also, there could be mixed heuristical mappings, for example, between
> docbook namespace and dublin core namespace.
>  
> > 2.2 Each attribute of an element will create a lucene field having
> >   the concated name of the element-name, and the attribute-name.
> > 2.3 Having a field named body for the bare text.
> > 
> > 3. Searching
> >   Just use the lucene search engine.
> 
> I think this is a good starting point, yes.
>  
> > (btw,
> > I was already playing with lucene for indexing/searching mail messages
> > stored in mbox. This way I was searching the
> > http://xml.apache.org/mails/200109.gz,
> > 
> > Wouldn't it be nice to generate FAQ, etc from the mbox mail messages.
> > But that's a semantic problem, as the mail messages have poor
> > xml-semantic content :-)
> 
> Yes, even if, in theory, we all use things like *STRONG* _emphasis_ LOUD
> "quote" and the like. This is, in fact, markup in the most general sense
> :)
> 
> > > Note that "dynamic" has a different sense that before and it means
> > > thatthe resource result is not dependent on request-based or
> > > environmentalparameters (such as user-agent, date, time, machine
> > > load, IP address,
> > > whatever). A resource that is done aggregating a ton of documents
> > > storedon a database must be considered static if it is not
> > > dependent of
> > > request parameters.
> > >
> > > For a semantic crawler, instead of asking for the "standard" view, it
> > > would ask for semantic-specific views such as "content" (the most
> > > semantic stage at pipeline generation, which we already specify in our
> > > example sitemaps) or "schema" (not currently implemented as nobody
> > > woulduse it today anyway).
> > >
> > > But the need of resource "views" is the key to the success of proper
> > > search capabililities and we must be sure that we use them even for
> > > semantically-poor searching solutions like lucene, but that would kick
> > > ass anyway on small to medium size web sites.
> > >
> > > Hope this helps and if you have further questions, don't mind asking.
> > 
> > thanks for your suggestions, helping a lot to understand cocoon better.
> 
> Hope this helps even more :)
> 
> Ciao.
> 
> -- 
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <stefano@apache.org>                             Friedrich Nietzsche
> --------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org