Return-Path: Delivered-To: apmail-xml-cocoon-dev-archive@xml.apache.org Received: (qmail 57609 invoked by uid 500); 28 Oct 2001 14:11:32 -0000 Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: cocoon-dev@xml.apache.org Delivered-To: mailing list cocoon-dev@xml.apache.org Received: (qmail 57354 invoked from network); 28 Oct 2001 14:11:29 -0000 Message-ID: <3BDBEA3A.582A9AA@apache.org> Date: Sun, 28 Oct 2001 12:21:30 +0100 From: Stefano Mazzocchi X-Mailer: Mozilla 4.77 [en] (Windows NT 5.0; U) X-Accept-Language: en MIME-Version: 1.0 To: cocoon-dev@xml.apache.org Subject: Re: Subject: Lucene as Avalon Component? References: <179c44175a06.175a06179c44@i-one.at> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Bernhard, perfect timing! I was thinking about the same thing the other day. Bernhard Huber wrote: > > hi, > I'm taking a look at lunce, a nice search engine. > As Cocoon2 claims to be an XML publishing engine, > some sort of searching feature would be quite nice. Yes, this is very true. > Now I'm a bit confused how to make it usabel under Cocoon2. > Should I write a generator for the searching part of lucene? > Should I encapsulate the indexing, and searching as > an avalon component? In a perfect world (but we aim for that, right?) we should have an abstracted search engine behavioral interface (future compatible with semantic capabilities?) and then have an Avalon component (block?) to implement that. Then, a cocoon component (a generator or a transformer, depending on the syntax of the query language being XML or not) can use the avalon component to power itself and generate the XML event stream. Note that both Lucene and dbXML (probably going to be called Apache Xindice, from the latin word "indice" -> "index") could power this: the first as an indexer of the textual part (final pipeline results) while the second being an indexer of the semantic part (starting pipeline sources). Obviously, a semantic approach is very likely to yield much better results, but it requires a completely different way of doing search (look at xyzsearch.com, for example), while lucene is simply doing textual heuristics. This said, it's also likely that the two approaches are so different that a single behavioral interface will be either too general or too simple to cover both cases, so, probably, both a textual search interface and a markup search interface will be required. > How should I index? Eh, good question :) My suggestion would be to connect the same xlink-based crawling subsystem used for CLI to lucene as it was a file system, but this might require some Inversion of Control (us pushing files into lucene and not lucene to crawl them or read them from disk) thus some code changes to it. > Let's say I want to provide one or more sub-sitemaps > a searching feature, and let's say the index is already > generated, how can i calculate from the internal sitemap URL > to public browser-URL? > > For example I have an index over all /docs/samples/*/* files, > how can I detect that they are all mapped to the URL http://machine/*/*? > > any ideas are welcome? The CLI subsystem works by starting at a URI, asking for the "link" view of that URI (cocoon will then return a newline-separated list of linked URIs created out of all those links that contain xlink:href="" or src="" or href="" attributes), then recursively call itself on every linked URI. When it reaches a leaf (a page with no further links or links that were already visited), it asks for the "link-translated" view of the URI, passing in POST to the request the new-line separated list of links so that Cocoon knows how to regenerate an adapted version of the resource (this is useful to maintain link consistency when moved on a file system and workign on the original link semantics, it works for every file format, even for PDF, because link translation happens transparently before serialization takes place). Last operation is URI mangling where, depending on the give MIME-type of the returned resource, the proper extension is added to the file name and the resource is saved on disk. Another important feature is that the "link" view also indicates as "dynamic" those links that have a particular xlink role (behavior) xlink:role="dynamic", so they are skipped by the CLI generation and a placeholder is written (that might redirect to the original URI, for example). So, currently, indexers like lucene assume that what goes out of a web server is what is already in (at least, for static pages). Cocoon doesn't work that way. So, the indexer should crawl from the end side (the web side, just like big search engine do) and don't assume anything about how the files are generated internally. The only different is that Cocoon implements a standard behavior of resource views and we can use those to gain more information about the requests without missing the semantic information that cocoon already stores (such as the xlink information). So, IMO, the most elegant and effective solution would be to connect lucene to the cocoon view-based crawling subsystem: 1) start with some URI (the root, mostly) 2) obtain the link view of the resource 3) recursively call itself on non-dynamic links until a leaf is reached 4) obtain the leaf resource (performing translation to adapt the cocoon-relative URIs to the site-relative URIs) and push it into lucene 5) continue until all leafs are processed. Note that "dynamic" has a different sense that before and it means that the resource result is not dependent on request-based or environmental parameters (such as user-agent, date, time, machine load, IP address, whatever). A resource that is done aggregating a ton of documents stored on a database must be considered static if it is not dependent of request parameters. For a semantic crawler, instead of asking for the "standard" view, it would ask for semantic-specific views such as "content" (the most semantic stage at pipeline generation, which we already specify in our example sitemaps) or "schema" (not currently implemented as nobody would use it today anyway). But the need of resource "views" is the key to the success of proper search capabililities and we must be sure that we use them even for semantically-poor searching solutions like lucene, but that would kick ass anyway on small to medium size web sites. Hope this helps and if you have further questions, don't mind asking. -- Stefano Mazzocchi One must still have chaos in oneself to be able to give birth to a dancing star. Friedrich Nietzsche -------------------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org For additional commands, email: cocoon-dev-help@xml.apache.org