Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cocoon-dev@xml.apache.org
From: "Bernhard Huber" <bh22351@i-one.at>
To: cocoon-dev@xml.apache.org
Message-ID: <19a35d193e84.193e8419a35d@i-one.at>
Date: Sun, 28 Oct 2001 21:47:40 GMT
MIME-Version: 1.0
Content-Language: de
Subject: Re: Subject: Lucene as Avalon Component?
Content-Type: multipart/mixed; boundary="--3f39420f25f16592"

----3f39420f25f16592
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: 7bit

hi, stefano

----- Originalnachricht -----
Von: Stefano Mazzocchi <stefano@apache.org>
Datum: Sonntag, Oktober 28, 2001 12:21 pm
Betreff: Re: Subject: Lucene as Avalon Component?

> Bernhard, perfect timing! I was thinking about the same thing the 
> otherday.
> 
> Bernhard Huber wrote:
> > 
> > hi,
> > I'm taking a look at lunce, a nice search engine.
> > As Cocoon2 claims to be an XML publishing engine,
> > some sort of searching feature would be quite nice.
> 
> Yes, this is very true.
> 
> > Now I'm a bit confused how to make it usabel under Cocoon2.
> > Should I write a generator for the searching part of lucene?
> > Should I encapsulate the indexing, and searching as
> > an avalon component?
> 
> In a perfect world (but we aim for that, right?) we should have an
> abstracted search engine behavioral interface (future compatible with
> semantic capabilities?) and then have an Avalon component (block?) to
> implement that.

and the search-engine understands your queries, semantically :-)
But perhaps an advantage could be that a group of documents might
present already some semantic keywords, stored in the documents,
like author, and title.
So searching for this keywords will give very good results.

> 
> Then, a cocoon component (a generator or a transformer, depending 
> on the
> syntax of the query language being XML or not) can use the avalon
> component to power itself and generate the XML event stream.

Yup, that's would be nice. 
Moreover we can use the XML event stream not only for generating
the answer of the search-query/request, but evaluate some hit 
statistics. 

As the XML event stream can be handled as some static xml page source.

> 
> Note that both Lucene and dbXML (probably going to be called Apache
> Xindice, from the latin word "indice" -> "index") could power 
> this: the
> first as an indexer of the textual part (final pipeline results) while
> the second being an indexer of the semantic part (starting pipeline
> sources).
> 
> Obviously, a semantic approach is very likely to yield much better
> results, but it requires a completely different way of doing search
> (look at xyzsearch.com, for example), while lucene is simply doing
> textual heuristics.
I will try to check xyzsearch.com

But I have some troubles with "semantic".

As I would say "semantic" lies in the eye of the observer.
But that's more philosophical.

Perhaps it would be interesting to gather some ideas,
about what's the aim of using semantic search.

Although the simple textual search gives a lot of bad results,
it is simple to use.

Using a semantic search should give better results, as the 
elements are taken into account when generating an index,
and when evaluating the result of a query.
But some points to think about:
1. What does to user should know already about the semantic of the 
documents?

2. Does he/she have to know that a document has an author, for example?

3. Does he/she have to know that querying for author entering
"author:john" will search of the author's name.

Perhaps all 3 issues are just a questing of design the GUI of 
an semantic search...

Just read now
http://localhost:8080/cocoon/documents/emotional-landscapes.html,
I see, semantic is taken the xml element's into account.

> 
> This said, it's also likely that the two approaches are so different
> that a single behavioral interface will be either too general or too
> simple to cover both cases, so, probably, both a textual search
> interface and a markup search interface will be required.
> 
> > How should I index?
> 
> Eh, good question :)
> 
> My suggestion would be to connect the same xlink-based crawling
> subsystem used for CLI to lucene as it was a file system, but this 
> mightrequire some Inversion of Control (us pushing files into 
> lucene and not
> lucene to crawl them or read them from disk) thus some code 
> changes to
> it.
I understand your hint. 
I must admit that I never understood cocoon's view concept.
Now I see what I can do using views.
Perhaps adding an example in the view documentation, like
Try using: 
http://localhost:8080/cocoon/welcome?cocoon-view=content, or
http://localhost:8080/cocoon/welcome?cocoon-view=links
would help a lot.
But perhaps I'm just a bit slow....

I never supposed to index the html result of an page,
 but the xml content (ad fontes!).
Thus I was thinking about how to index a xml source.

Or saying a more generally:
What would be a smart xml indexing strategy?

Lets take an snippet of 
http://localhost:8080/cocoon/documents/views.html?cocoon-view=content

----- begin
.... 
<s1 title="The Views">   
<s2 title="Introduction">
<p> Views are yet another sitemap component. Unlike the rest, they
    are othogonal to the resource and pipeline definitions. In the
...
<s3 title="View Processing">   
<p>The samples sitemap contains two view definitions. One of them
     looks like the excerpt below.</p>
<source xml:space="preserve">

  &lt;map:views&gt;
     &lt;map:view name="content" from-label="content"&gt;
     &lt;map:serialize type="xml"/&gt;
  &lt;/map:view&gt;

     </source>
....
----- end

I see following options:
1. Index only the bare text. That's simple, and stupid,
as a lot of info entered by the xml generator (human, program)
is ignored.
2. Try to take the element's name, and/or attributes into account.
3. Try to take the elements path into account.

Let's see what queries an engine should answer:
ad 1. query: "Intro", result: all docs having text cocoon

ad 2. query: "title:Intro", result: all docs having title elements with 
text Intro.

ad 2. query: "source:view", result: all docs having some source code
snippet regarding cocoon view concept.

ad 3. query: "xpath:**/s2/title/Intro", result all docs having s2 title
Intro, not sure about this how to marry lucene with xpath

> 
> > Let's say I want to provide one or more sub-sitemaps
> > a searching feature, and let's say the index is already
> > generated, how can i calculate from the internal sitemap URL
> > to public browser-URL?
> > 
> > For example I have an index over all /docs/samples/*/* files,
> > how can I detect that they are all mapped to the URL 
> http://machine/*/*?> 
> > any ideas are welcome?
> 
> The CLI subsystem works by starting at a URI, asking for the 
> "link" view
> of that URI (cocoon will then return a newline-separated list of 
> linkedURIs created out of all those links that contain 
> xlink:href="" or src=""
> or href="" attributes), then recursively call itself on every linked
> URI. 
> 
> When it reaches a leaf (a page with no further links or links that 
> werealready visited), it asks for the "link-translated" view of 
> the URI,
> passing in POST to the request the new-line separated list of 
> links so
> that Cocoon knows how to regenerate an adapted version of the resource
> (this is useful to maintain link consistency when moved on a file 
> systemand workign on the original link semantics, it works for 
> every file
> format, even for PDF, because link translation happens transparently
> before serialization takes place).
> 
> Last operation is URI mangling where, depending on the give MIME-
> type of
> the returned resource, the proper extension is added to the file name
> and the resource is saved on disk.
> 
> Another important feature is that the "link" view also indicates as
> "dynamic" those links that have a particular xlink role (behavior)
> xlink:role="dynamic", so they are skipped by the CLI generation 
> and a
> placeholder is written (that might redirect to the original URI, for
> example).
> 
> So, currently, indexers like lucene assume that what goes out of a web
> server is what is already in (at least, for static pages). Cocoon
> doesn't work that way.
> 
> So, the indexer should crawl from the end side (the web side, just 
> likebig search engine do) and don't assume anything about how the 
> files are
> generated internally.
> 
> The only different is that Cocoon implements a standard behavior of
> resource views and we can use those to gain more information about the
> requests without missing the semantic information that cocoon already
> stores (such as the xlink information).
> 
> So, IMO, the most elegant and effective solution would be to connect
> lucene to the cocoon view-based crawling subsystem:
> 
> 1) start with some URI (the root, mostly)
> 2) obtain the link view of the resource
> 3) recursively call itself on non-dynamic links until a leaf is 
> reached 4) obtain the leaf resource (performing translation to 
> adapt the
> cocoon-relative URIs to the site-relative URIs) and push it into 
> lucene 5) continue until all leafs are processed.

I will try to implement something like that...

Design-Draft

1. Crawling:
  Usign the above described cocoon view-based crawling subsystem

2. Indexing:
2.1 Each element-name will create a lucene field having the
  same name as the element-name.
  (?What about the element's name space, should I take it into account?)

2.2 Each attribute of an element will create a lucene field having
  the concated name of the element-name, and the attribute-name.
2.3 Having a field named body for the bare text.

3. Searching
  Just use the lucene search engine.

(btw, 
I was already playing with lucene for indexing/searching mail messages
stored in mbox. This way I was searching the 
http://xml.apache.org/mails/200109.gz,

Wouldn't it be nice to generate FAQ, etc from the mbox mail messages.
But that's a semantic problem, as the mail messages have poor
xml-semantic content :-)
)
 
> 
> Note that "dynamic" has a different sense that before and it means 
> thatthe resource result is not dependent on request-based or 
> environmentalparameters (such as user-agent, date, time, machine 
> load, IP address,
> whatever). A resource that is done aggregating a ton of documents 
> storedon a database must be considered static if it is not 
> dependent of
> request parameters.
> 
> For a semantic crawler, instead of asking for the "standard" view, it
> would ask for semantic-specific views such as "content" (the most
> semantic stage at pipeline generation, which we already specify in our
> example sitemaps) or "schema" (not currently implemented as nobody 
> woulduse it today anyway).
> 
> But the need of resource "views" is the key to the success of proper
> search capabililities and we must be sure that we use them even for
> semantically-poor searching solutions like lucene, but that would kick
> ass anyway on small to medium size web sites.
> 
> Hope this helps and if you have further questions, don't mind asking.

thanks for your suggestions, helping a lot to understand cocoon better. 

bye berni

----3f39420f25f16592
Content-Type: text/x-vcard; name="bh22351.vcf"; charset=us-ascii
Content-Disposition: attachment; filename="bh22351.vcf
Content-Description: Card for <bh22351@i-one.at>
Content-Transfer-Encoding: 7bit

begin:vcard
n:Huber;Bernhard
fn:Bernhard Huber
version:2.1
email;internet:bh22351@i-one.at
end:vcard


----3f39420f25f16592
Content-Type: text/plain; charset=us-ascii

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
----3f39420f25f16592--