cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject [RT] semantic searching
Date Wed, 31 Oct 2001 08:51:05 GMT
[resent since apparently it got lost somewhere]

Ciao,

Bernhard started a great thread about adding search capabilities with
lucene, but I'd love to give some more impressions on that.

Bernhard Huber wrote:

> > In a perfect world (but we aim for that, right?) we should have an
> > abstracted search engine behavioral interface (future compatible with
> > semantic capabilities?) and then have an Avalon component (block?) to
> > implement that.
> 
> and the search-engine understands your queries, semantically :-)

Yeah right :)

> But perhaps an advantage could be that a group of documents might
> present already some semantic keywords, stored in the documents,
> like author, and title.
> So searching for this keywords will give very good results.

I see several levels of search from the least semantic to the most
semantic:

1) regexp matching (i.e. grep): no semantic is associated to the search
since it's up to the user to perform the semantic analysis that leads to
the creation of the regexp query to match. This results in boolean
search (either it matches or not) and assumes the content is stored in
textual formats.

2) text search engine (i.e. altavista): heuristics are used to extract
sort-of semantic content from some known document types (mostly HTML)
and associate some indexing value to them. This leads to an easier user
experience.

3) metadata-based search enginges (i.e. metacrawler): same as above, but
with the use of the <meta> HTML tag to associate keywords to higher
values of the search. Gives normally better searches, even if sometimes
keywords are misleading.

4) hyperlink-topology based search engines (i.e. google): they have the
ability to estimate the importance of a page given the links that refer
to it. Obviously, this can only happen when you have a "huge" pool of
pages, as google does. Note that google is also able to parse and index
PDF and extract heuristics from the internal graphics (font size, bold,
italic and so on).

This is the state of the art. Google is, by far, the most advanced
searching solution available but due to its nature it cannot be applied
to a small site without loosing the power of topological analysis (thus,
we go back to number 3).

Web crawlers are forced to obtain the web site information by "crawling"
it from the outside since they don't know the internal of the site.

But local search solutions can have access to the web site from the
backside and index them (see htdig, for example, or Oracle text search
tools if text is stored into their databases).

All these solutions work as a restricted version of #3 above, but they
are based on the assumption that the URI space can be easily mapped to
the internal request.

Apache might show you the opposite (at first!), but Cocoon shows this is
very unlikely to be the case since it's generally a mistake to map a
file system (or a directory server, or a database repository) one-2-one
with the URI space, since it leads to easily broken links and potential
security issues.

This is why crawling is the only way to go, but since outside access
reduces the visibility of some internal information that might increase
the semantic capacity of the indexer, Cocoon provies "views" (you can
think of them as "windows", but not in the M$ sense) to the resources.

This said, we can now have access to the original content of the
resource. For example, we can now index the text inside a logo, if we
are given the SVG content that generated the raster image. Or can index
the PDF content without having to implement a PDF parser since we
request the "content" view of the resource and we obtain an easily
parsable XML file.

Now, in a perfect world (again!), we could have a browser that allows us
to add specific HTTP headers to the request, thereforse, we could have
cocoon react to an HTTP header to know which view (also known as
resource "variant" in the HTTP spec) was requested.

The current way for Cocoon to access views is fixed as a special URI
query parameter "cocoon-view", but I think we should extend the feature
to:

 1) react on a "variant" HTTP header (nothing cocoon specific since the
concept could be impelemented later on by other publishing frameworks)
 2) react on URI extension: for example http://host/path/file.view, that
is something that I normally do by hand in my sitemaps (where
http://host/path/index is the default resource and index.content is the
XML view of the content).
 3) react on URI query parameter (as we do today).

You could suggest to make this user-definable in the sitemap: well,
while the views are user definable (even if a number will be suggested
as a solid contract to allow indexing of other cocoons), I woundn't like
this to become too flexible since this is a solid contract that, if
broken, doesn't allow a crawler to obtain semantic information on a site
it doesn't own.

Ok, now, let us suppose we have our good Cocoon in place with a bunch of
XML content and a way (thru resource views) to obtain the most semantic
version of this content. What can we do with it?

5) schema based search engines: as markup is bidimensional (text + tag),
we can now look for the text "text" inside the tag "tag". So, if you
know the schema used (say, docbook), you can place a query such as 

 search for "cocoon" 
  in elements "title|subtitle" 
  of namespace "http://docbook.org/*" 
  with xml:lang "undefined|EN"
 
that will return you the documents who happen to have the text "cocoon"
inside their "title" or "subtitle" elements associated to the namespace
starting with the "http://docbook.org/" URL and using the English
language or having no language definition.

I call this "schema based" assuming that each schema has an associated
namespace.

Note that this also capable of performing metadata evaluation: a query
such as

 search for "Stefano" and "Mazzocchi"
  in elements "author"
  of namespace "http://dublin-core.org/*"

will work on the metadata markup associated with the dublin core
namespace.

Note also that just like many search engine, this is a very powerful
syntax, but pretty unlikely that a user with no XML knowledge will be
able to use it.

There are possible ways of creating such a query, one being the one used
in xyzsearch.com  which creates a complex schema-based query based on an
incremental process (they claim a patent on that, but you can patent a
process, not an idea and they don't have Cocoon views under their
process):

 a) search for "Cocoon"

     Search for [Cocoon         ]

            search | continue >>

 b) it returns the list of schemas associated with the elements where
the word Cocoon was found and lists a human readable definition of that
schema. For example:

    Markups where "Cocoon" was found:
  
        [ ] Zoological Markup Language
        [ ] Docbook
        [ ] Motion Pictures Description Language

   << back | search | continue >>

  c) then you click on which markup you like to choose (hopefully
understanding from the human description of the namespace what the
language is about).
  d) then provides you the list of languages it was found in:

   Languages where the term "Cocoon" was found within markup "Docbook":

        [ ] undefined
        [ ] English (general)
        [ ] Italian

   << back | search | continue >>

  e) then you click on the language and asks you to indicate which tags
you'd like

   Contexts where the term "Cocoon" was found within markup "Docbook" 
   and language "undefined" or "English":
 
       [ ] title : the title of the document
       [ ] subtitle : the subtitle of the document
       [ ] para : a paragraph
       [ ] strong : outlines important words

   << back | search | continue >>

And so on, until the user hits the "search" button and then the list is
presented.

In order to implement the above we need:

 a) a bunch of valid XML documents

 b) a register of namespaces -> schemas, along with some human readable
description of tags and schemas (which can be provided with the
XMLSchema schema itself)

 c) an xml-based storage system with advanced query capabilities (XPath
or even better, XQL).

 d) a view capable web publishing system.

 e) a view-based schema-aware crawler and indexer.

 f) a web application that connects to the indexer and provides the
above user experience.

These are all independent concern islands. The contracts are:

 a) and b) are stored into c) (IMO, WebDAV or CVS would be the best
contracts here allowing editors to edit the files as they were on a file
system)

 d) uses c) as semi-structured data repository (XMLDB API being the
contract, or something equivalent)

 e) uses d) to obtain the semantic content and index the site (HTTP and
views being the contract)

 f) uses e) to provide the search experience (no contract nefined here,
probably the software API or some general-enough searching API, maybe
even Lucene's if powerful enough)

There is still a long way to go to have the entire system in place, but
now that we have both an native XML DB and an indexing engine under
Apache, I hope this is going to move faster.

Of course, the editing part remains the most difficult one to solve :/

7) semantic search engine: if you are reading this far, I presume you'd
consider the above #6 a kick ass search engine and would likely stop
there.

Well, there is more and this is where the semantic web effort kicks in.

The previous technology (#6 from now onward) requires a bunch of
software that is yet to be written, but it's very much likely to happen.
Or, at least, I don't see any technical nor social reason why this
should not happen.

This, unfortunately, cannot be said for a semantic search engine (#7).

Let's start from outter space: you know what "semantic networks" are,
right? they are also known as "topic maps" (see www.topicmaps.org for
more details) and they represent a topological connection of "concepts",
along with their relationships.

The basic idea is the following:

 1) suppose you have a bunch of semantically marked-up content
 2) each important resource (not a web resource, but a semantic
resource, i.e. a word) is properly described in absolute and unique
terms. That is, currently, with an associated unique URI.
 3) there are semantic networks that describe relationships between
these resources

With this infrastructure in place, it is virtually possible to use basic
inference rules to "crawl" the semantic networks and obtain search
derivatives which are semantically meaningful.

Let's make an example:

 1) suppose that your homepage states that you have two children: bob
and susan. Bob is a 6-years-old boy and Susan is a 12-years-old girl.
You are 42 and live in Boston.
 2) suppose that you used proper markup (say RDF) to describe these
relationships and you used the proper markup to indicate them.
 
 3) now, a semantic crawler comes and index this information.

 4) it is virtually possible, then, to ask for something like "give me
the name of those man in boston who have two or more children under 15"
without requiring any heuristical artificial intelligence.

Now, in order to obtain this we need:

 a) the infrastructure of #6

 b) a huge list of topics along with their unique meaning (unique in
this case means that each topic (say "father") must have one and only
one URI (say "http://www.un.org/topics/mankind/family/father")
associated (or topic maps that state the formal equivalence of topics).

 c) topic maps that state the relationships of those topics

 d) a way to create the query in a user-friendly way

Well, given the political problems found in defining even the most
simple B2B schema, I strongly doubt we'll ever come this far.

And even if we do come this far and this huge semantic network gets
implemented, the problem is making it possible (and profitable!) for
authors to markup their content in such a way that they are semantic
friendly in this topic-map sense.

And given the amount of people who think that M$ Word is the best
authoring tool, well, authoring the information will sure be the worst
part of both 6# and 7#.

> But I have some troubles with "semantic".
> 
> As I would say "semantic" lies in the eye of the observer.
> But that's more philosophical.

I hope the above explains better my meaning of "semantic".
 
> Perhaps it would be interesting to gather some ideas,
> about what's the aim of using semantic search.
> 
> Although the simple textual search gives a lot of bad results,
> it is simple to use.

Correct. Both 6# and 7# might be extremely powerful but useless if
people are unable to search due to usability complexity.

In fact, the weak point of #6 (after talking with my girlfriend about
it) is that the people might believe it's broken or they did something
wrong if they don't see results but a list of contexts to go further.

Anyway, the above is just an example, not the best way to implement such
a system.
 
> Using a semantic search should give better results, as the
> elements are taken into account when generating an index,
> and when evaluating the result of a query.

Well, not really.

Suppose you don't go as far as stating that you want "Cocoon" inside the
element "title".

If you find "cocoon" in HTML <title> you know this is better than
finding "cocoon" in <p>, but what if you have a chinese markup? how do
you know?

So, I envision something like a heuristical map for tags and tag
inclusions that states the relative value of finding a word in a
particular location.

So, 

 para -> 1
 strong -> 1
 title -> 10

then

 /article/title/strong -> 10 + 1 = 11
 /para/strong -> 1 + 1 = 2
 /section/title -> 10

and so on, which might work for every markup and be general enough to
allow inclusion of namespaces and change the values depending on this.

> But some points to think about:
> 1. What does to user should know already about the semantic of the
> documents?

exactly, he/she doesn't know, nor he/she should. This is what the
heuristically associated values to tags are for.
 
> 2. Does he/she have to know that a document has an author, for example?

Well, some metadata (like library indexes, for examples) are very well
established and might not confuse the user if presented in ad advanced
query form.
 
> 3. Does he/she have to know that querying for author entering
> "author:john" will search of the author's name.

Absolutely not! This will be done by the web application.
 
> Perhaps all 3 issues are just a questing of design the GUI of
> an semantic search...

Yes and no. 3) calls for a better web app, that's for sure, but 1) IMO
calls for a heuristic system that currently is hardwired into the HTML
nature of the web content, but we have to abandon give the flexibility
of the XML model.
 
> Just read now
> http://localhost:8080/cocoon/documents/emotional-landscapes.html,
> I see, semantic is taken the xml element's into account.

Yes, more or less this is the meaning I give to the word.
 
> > > How should I index?
> >
> > Eh, good question :)
> >
> > My suggestion would be to connect the same xlink-based crawling
> > subsystem used for CLI to lucene as it was a file system, but this
> > mightrequire some Inversion of Control (us pushing files into
> > lucene and not
> > lucene to crawl them or read them from disk) thus some code
> > changes to
> > it.
>
> I understand your hint.

Great!

> I must admit that I never understood cocoon's view concept.

Very few do. In fact, even Giacomo didn't understand them at first when
he implemented the sitemap and they are still left in an unknown state.
I hope to be able to provide some docs to show the light on this soon.

> Now I see what I can do using views.

Yes, without views, Cocoon will be only harmful for the semantic web
effort (see a pretty old RT "is Cocoon harmful for the semantic web" on
this list, also picked up on xmlhack.com).

> Perhaps adding an example in the view documentation, like
> Try using:
> http://localhost:8080/cocoon/welcome?cocoon-view=content, or
> http://localhost:8080/cocoon/welcome?cocoon-view=links
> would help a lot.
> But perhaps I'm just a bit slow....

No, don't worry, the concepts are pretty deep into the abstract
reasoning of how a web should work in the future and there is no docs
explaining this.

> I never supposed to index the html result of an page,
>  but the xml content (ad fontes!).
> Thus I was thinking about how to index a xml source.
> 
> Or saying a more generally:
> What would be a smart xml indexing strategy?

Ok, second step: the indexing algorithm.

Warning: I know nothing of text indexing nor the algorithms associated
to these problems!
 
> Lets take an snippet of
> http://localhost:8080/cocoon/documents/views.html?cocoon-view=content
> 
> ----- begin
> ....
> <s1 title="The Views">
> <s2 title="Introduction">
> <p> Views are yet another sitemap component. Unlike the rest, they
>     are othogonal to the resource and pipeline definitions. In the
> ...
> <s3 title="View Processing">
> <p>The samples sitemap contains two view definitions. One of them
>      looks like the excerpt below.</p>
> <source xml:space="preserve">
> 
>   &lt;map:views&gt;
>      &lt;map:view name="content" from-label="content"&gt;
>      &lt;map:serialize type="xml"/&gt;
>   &lt;/map:view&gt;
> 
>      </source>
> ....
> ----- end
> 
> I see following options:
> 1. Index only the bare text. That's simple, and stupid,
> as a lot of info entered by the xml generator (human, program)
> is ignored.

Yes. It's already powerful as we are able, for example, to index picture
text out of SVG files or PDF files without requiring PDF parsing, but it
is admittedly a waste of precious information.

It could be a first step, though.

> 2. Try to take the element's name, and/or attributes into account.
> 3. Try to take the elements path into account.

I would suggest taking the heuristical value of the path into account,
rather than the path itself.
 
> Let's see what queries an engine should answer:
> ad 1. query: "Intro", result: all docs having text cocoon
> 
> ad 2. query: "title:Intro", result: all docs having title elements with
> text Intro.
> 
> ad 2. query: "source:view", result: all docs having some source code
> snippet regarding cocoon view concept.
> 
> ad 3. query: "xpath:**/s2/title/Intro", result all docs having s2 title
> Intro, not sure about this how to marry lucene with xpath

don't know the internals of Lucene, but maybe associating some numerical
values to text is useful to increase the ordering of importance. well,
maybe we should ask the lucene guys for this.
 
> I will try to implement something like that...
> 
> Design-Draft
> 
> 1. Crawling:
>   Usign the above described cocoon view-based crawling subsystem
> 
> 2. Indexing:
> 2.1 Each element-name will create a lucene field having the
>   same name as the element-name.
>   (?What about the element's name space, should I take it into account?)

Yes, it should identify the schema used to get the heuristic mapping.
Also, there could be mixed heuristical mappings, for example, between
docbook namespace and dublin core namespace.
 
> 2.2 Each attribute of an element will create a lucene field having
>   the concated name of the element-name, and the attribute-name.
> 2.3 Having a field named body for the bare text.
> 
> 3. Searching
>   Just use the lucene search engine.

I think this is a good starting point, yes.
 
> (btw,
> I was already playing with lucene for indexing/searching mail messages
> stored in mbox. This way I was searching the
> http://xml.apache.org/mails/200109.gz,
> 
> Wouldn't it be nice to generate FAQ, etc from the mbox mail messages.
> But that's a semantic problem, as the mail messages have poor
> xml-semantic content :-)

Yes, even if, in theory, we all use things like *STRONG* _emphasis_ LOUD
"quote" and the like. This is, in fact, markup in the most general sense
:)

> > Note that "dynamic" has a different sense that before and it means
> > thatthe resource result is not dependent on request-based or
> > environmentalparameters (such as user-agent, date, time, machine
> > load, IP address,
> > whatever). A resource that is done aggregating a ton of documents
> > storedon a database must be considered static if it is not
> > dependent of
> > request parameters.
> >
> > For a semantic crawler, instead of asking for the "standard" view, it
> > would ask for semantic-specific views such as "content" (the most
> > semantic stage at pipeline generation, which we already specify in our
> > example sitemaps) or "schema" (not currently implemented as nobody
> > woulduse it today anyway).
> >
> > But the need of resource "views" is the key to the success of proper
> > search capabililities and we must be sure that we use them even for
> > semantically-poor searching solutions like lucene, but that would kick
> > ass anyway on small to medium size web sites.
> >
> > Hope this helps and if you have further questions, don't mind asking.
> 
> thanks for your suggestions, helping a lot to understand cocoon better.

Hope this helps even more :)

Ciao.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Mime
View raw message