cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bernhard Huber" <bh22...@i-one.at>
Subject Re: Subject: Lucene as Avalon Component?
Date Mon, 29 Oct 2001 23:17:22 GMT
hi, david
thanks for the links to z39.50.
I read a bit about that protocol, but as I understand
supporting z39.50 might require to write an
avalon block implementing the z39.50 server,
that's at the moment a bit too much for me,
learning avalon in depth + z39.50,
anyway thanks!

----- Originalnachricht -----
Von: David Crossley <crossley@indexgeo.com.au>
Datum: Montag, Oktober 29, 2001 7:52 am
Betreff: Re: Subject: Lucene as Avalon Component?

> Structured searching is an obvious beneficiary of a solid
> XML framework. Cocoon would need capability to allow
> such functionality to be implemented by any search system
> of choice.
> 
> I would prefer to utilise the Z39.50 protocol (ISO 23950).
> This is stateful and session-based. It suports both fielded
> and full-text search. It has a powerful boolean and relational
> query syntax and various high-level abstractions.
> 
> Importantly, there are sets of well-known attributes which
> shield the user from how the search is implemented and from
> how the XML records are structured. (Bernhard, this directly
> addresses your three numbered issues below.)
> 
> Of course, this power comes at the cost of potentially
> complex implemention. However, this is eased by the
> availability of solid toolkits and fully blown servers/gateways
> (both open source and the other).
> 
> This is the age-old search and retrieve protocol from the
> library world, so plenty of leverage can be gained.
> Start at: http://lcweb.loc.gov/z395
> Also follow their links to resources/
> I see there at least one appropriate solution for Cocoowhich is 
> open source and Java (JZKit).
> 
> Thanks Bernhard, for raising this important topic.
> --David Crossley
> 
> Bernhard Huber wrote:
> >  Stefano Mazzocchi wrote:
> > > Bernhard, perfect timing! I was thinking about the same thing 
> the 
> > > otherday.
> > > 
> > > Bernhard Huber wrote:
> > > > 
> > > > hi,
> > > > I'm taking a look at lunce, a nice search engine.
> > > > As Cocoon2 claims to be an XML publishing engine,
> > > > some sort of searching feature would be quite nice.
> > > 
> > > Yes, this is very true.
> > > 
> > > > Now I'm a bit confused how to make it usabel under Cocoon2.
> > > > Should I write a generator for the searching part of lucene?
> > > > Should I encapsulate the indexing, and searching as
> > > > an avalon component?
> > > 
> > > In a perfect world (but we aim for that, right?) we should 
> have an
> > > abstracted search engine behavioral interface (future 
> compatible with
> > > semantic capabilities?) and then have an Avalon component 
> (block?) to
> > > implement that.
> > 
> > and the search-engine understands your queries, semantically :-)
> > But perhaps an advantage could be that a group of documents might
> > present already some semantic keywords, stored in the documents,
> > like author, and title.
> > So searching for this keywords will give very good results.
> > 
> > > Then, a cocoon component (a generator or a transformer, 
> depending 
> > > on the
> > > syntax of the query language being XML or not) can use the avalon
> > > component to power itself and generate the XML event stream.
> > 
> > Yup, that's would be nice. 
> > Moreover we can use the XML event stream not only for generating
> > the answer of the search-query/request, but evaluate some hit 
> > statistics. 
> > 
> > As the XML event stream can be handled as some static xml page 
> source.> 
> > > Note that both Lucene and dbXML (probably going to be called 
> Apache> > Xindice, from the latin word "indice" -> "index") could 
> power 
> > > this: the
> > > first as an indexer of the textual part (final pipeline 
> results) while
> > > the second being an indexer of the semantic part (starting 
> pipeline> > sources).
> > > 
> > > Obviously, a semantic approach is very likely to yield much better
> > > results, but it requires a completely different way of doing 
> search> > (look at xyzsearch.com, for example), while lucene is 
> simply doing
> > > textual heuristics.
> > I will try to check xyzsearch.com
> > 
> > But I have some troubles with "semantic".
> > 
> > As I would say "semantic" lies in the eye of the observer.
> > But that's more philosophical.
> > 
> > Perhaps it would be interesting to gather some ideas,
> > about what's the aim of using semantic search.
> > 
> > Although the simple textual search gives a lot of bad results,
> > it is simple to use.
> > 
> > Using a semantic search should give better results, as the 
> > elements are taken into account when generating an index,
> > and when evaluating the result of a query.
> > But some points to think about:
> > 1. What does to user should know already about the semantic of 
> the 
> > documents?
> > 
> > 2. Does he/she have to know that a document has an author, for 
> example?> 
> > 3. Does he/she have to know that querying for author entering
> > "author:john" will search of the author's name.
> > 
> > Perhaps all 3 issues are just a questing of design the GUI of 
> > an semantic search...
> > 
> > Just read now
> > http://localhost:8080/cocoon/documents/emotional-landscapes.html,
> > I see, semantic is taken the xml element's into account.
> > 
> > > This said, it's also likely that the two approaches are so 
> different> > that a single behavioral interface will be either too 
> general or too
> > > simple to cover both cases, so, probably, both a textual search
> > > interface and a markup search interface will be required.
> > > 
> > > > How should I index?
> > > 
> > > Eh, good question :)
> > > 
> > > My suggestion would be to connect the same xlink-based crawling
> > > subsystem used for CLI to lucene as it was a file system, but 
> this 
> > > mightrequire some Inversion of Control (us pushing files into 
> > > lucene and not
> > > lucene to crawl them or read them from disk) thus some code 
> > > changes to
> > > it.
> > I understand your hint. 
> > I must admit that I never understood cocoon's view concept.
> > Now I see what I can do using views.
> > Perhaps adding an example in the view documentation, like
> > Try using: 
> > http://localhost:8080/cocoon/welcome?cocoon-view=content, or
> > http://localhost:8080/cocoon/welcome?cocoon-view=links
> > would help a lot.
> > But perhaps I'm just a bit slow....
> > 
> > I never supposed to index the html result of an page,
> >  but the xml content (ad fontes!).
> > Thus I was thinking about how to index a xml source.
> > 
> > Or saying a more generally:
> > What would be a smart xml indexing strategy?
> > 
> > Lets take an snippet of 
> > http://localhost:8080/cocoon/documents/views.html?cocoon-
> view=content> 
> > ----- begin
> > .... 
> > <s1 title="The Views">   
> > <s2 title="Introduction">
> > <p> Views are yet another sitemap component. Unlike the rest, they
> >     are othogonal to the resource and pipeline definitions. In the
> > ...
> > <s3 title="View Processing">   
> > <p>The samples sitemap contains two view definitions. One of them
> >      looks like the excerpt below.</p>
> > <source xml:space="preserve">
> > 
> >   <map:views&gt;
> >      <map:view name="content" from-label="content"&gt;
> >      <map:serialize type="xml"/&gt;
> >   </map:view&gt;
> > 
> >      </source>
> > ....
> > ----- end
> > 
> > I see following options:
> > 1. Index only the bare text. That's simple, and stupid,
> > as a lot of info entered by the xml generator (human, program)
> > is ignored.
> > 2. Try to take the element's name, and/or attributes into account.
> > 3. Try to take the elements path into account.
> > 
> > Let's see what queries an engine should answer:
> > ad 1. query: "Intro", result: all docs having text cocoon
> > 
> > ad 2. query: "title:Intro", result: all docs having title 
> elements with 
> > text Intro.
> > 
> > ad 2. query: "source:view", result: all docs having some source code
> > snippet regarding cocoon view concept.
> > 
> > ad 3. query: "xpath:**/s2/title/Intro", result all docs having 
> s2 title
> > Intro, not sure about this how to marry lucene with xpath
> > 
> > > 
> > > > Let's say I want to provide one or more sub-sitemaps
> > > > a searching feature, and let's say the index is already
> > > > generated, how can i calculate from the internal sitemap URL
> > > > to public browser-URL?
> > > > 
> > > > For example I have an index over all /docs/samples/*/* files,
> > > > how can I detect that they are all mapped to the URL 
> > > " target="l">http://machine/*/*?> 
> > > > any ideas are welcome?
> > > 
> > > The CLI subsystem works by starting at a URI, asking for the 
> > > "link" view
> > > of that URI (cocoon will then return a newline-separated list 
> of 
> > > linkedURIs created out of all those links that contain 
> > > xlink:href="" or src=""
> > > or href="" attributes), then recursively call itself on every 
> linked> > URI. 
> > > 
> > > When it reaches a leaf (a page with no further links or links 
> that 
> > > werealready visited), it asks for the "link-translated" view 
> of 
> > > the URI,
> > > passing in POST to the request the new-line separated list of 
> > > links so
> > > that Cocoon knows how to regenerate an adapted version of the 
> resource> > (this is useful to maintain link consistency when 
> moved on a file 
> > > systemand workign on the original link semantics, it works for 
> > > every file
> > > format, even for PDF, because link translation happens 
> transparently> > before serialization takes place).
> > > 
> > > Last operation is URI mangling where, depending on the give 
> MIME-
> > > type of
> > > the returned resource, the proper extension is added to the 
> file name
> > > and the resource is saved on disk.
> > > 
> > > Another important feature is that the "link" view also 
> indicates as
> > > "dynamic" those links that have a particular xlink role (behavior)
> > > xlink:role="dynamic", so they are skipped by the CLI 
> generation 
> > > and a
> > > placeholder is written (that might redirect to the original 
> URI, for
> > > example).
> > > 
> > > So, currently, indexers like lucene assume that what goes out 
> of a web
> > > server is what is already in (at least, for static pages). Cocoon
> > > doesn't work that way.
> > > 
> > > So, the indexer should crawl from the end side (the web side, 
> just 
> > > likebig search engine do) and don't assume anything about how 
> the 
> > > files are
> > > generated internally.
> > > 
> > > The only different is that Cocoon implements a standard 
> behavior of
> > > resource views and we can use those to gain more information 
> about the
> > > requests without missing the semantic information that cocoon 
> already> > stores (such as the xlink information).
> > > 
> > > So, IMO, the most elegant and effective solution would be to 
> connect> > lucene to the cocoon view-based crawling subsystem:
> > > 
> > > 1) start with some URI (the root, mostly)
> > > 2) obtain the link view of the resource
> > > 3) recursively call itself on non-dynamic links until a leaf 
> is 
> > > reached 4) obtain the leaf resource (performing translation to 
> > > adapt the
> > > cocoon-relative URIs to the site-relative URIs) and push it 
> into 
> > > lucene 5) continue until all leafs are processed.
> > 
> > I will try to implement something like that...
> > 
> > Design-Draft
> > 
> > 1. Crawling:
> >   Usign the above described cocoon view-based crawling subsystem
> > 
> > 2. Indexing:
> > 2.1 Each element-name will create a lucene field having the
> >   same name as the element-name.
> >   (?What about the element's name space, should I take it into 
> account?)> 
> > 2.2 Each attribute of an element will create a lucene field having
> >   the concated name of the element-name, and the attribute-name.
> > 2.3 Having a field named body for the bare text.
> > 
> > 3. Searching
> >   Just use the lucene search engine.
> > 
> > (btw, 
> > I was already playing with lucene for indexing/searching mail 
> messages> stored in mbox. This way I was searching the 
> > http://xml.apache.org/mails/200109.gz,
> > 
> > Wouldn't it be nice to generate FAQ, etc from the mbox mail 
> messages.> But that's a semantic problem, as the mail messages 
> have poor
> > xml-semantic content :-)
> > )
> >  
> > > Note that "dynamic" has a different sense that before and it 
> means 
> > > thatthe resource result is not dependent on request-based or 
> > > environmentalparameters (such as user-agent, date, time, 
> machine 
> > > load, IP address,
> > > whatever). A resource that is done aggregating a ton of 
> documents 
> > > storedon a database must be considered static if it is not 
> > > dependent of
> > > request parameters.
> > > 
> > > For a semantic crawler, instead of asking for the "standard" 
> view, it
> > > would ask for semantic-specific views such as "content" (the most
> > > semantic stage at pipeline generation, which we already 
> specify in our
> > > example sitemaps) or "schema" (not currently implemented as 
> nobody 
> > > woulduse it today anyway).
> > > 
> > > But the need of resource "views" is the key to the success of 
> proper> > search capabililities and we must be sure that we use 
> them even for
> > > semantically-poor searching solutions like lucene, but that 
> would kick
> > > ass anyway on small to medium size web sites.
> > > 
> > > Hope this helps and if you have further questions, don't mind 
> asking.> 
> > thanks for your suggestions, helping a lot to understand cocoon 
> better. 
> > 
> > bye berni
> 
> 
> -------------------------------------------------------------------
> --
> To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
> For additional commands, email: cocoon-dev-help@xml.apache.org
> 
> 

Mime
View raw message