lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manfred Schäfer <mschae...@bouncy.com>
Subject Re: Proposal for Lucene / new component
Date Sun, 10 Feb 2002 14:45:03 GMT
Hi,


> I've read you proposal (and all email related to it). One thing I'd like to advise is
to distinguish the crawler and the loader component.
> The crawler is responsible for gathering documents from several sources.
> The loader (or indexer) is responsible for loading the gathered documents to the index
(I think in batch mode).

I see three different component types:
    - file producer (crawler, database reader, Filesystem reader)
    - Document Handler (knows the syntax (maybe semantic) of file-content)
    - Indexer (Lucene)

Is batch mode really the way. I think of something like pipes (But maybe i'm wrong).

>
> I think it's redundant to hardcode the indexing logic into all crawler component (ftp,
http, jdbc, filesys crawler). It's an interesting question how the components can communicate?
(don't you think using avalon is a good way?)

I think, that the configuration of the indexing procedure, including work for all three component
types, is the real adventure. The components itself are relatively easy to write. I first
thought of ant as configuration framework. But i think that would work only for batch mode.
The main question is: What is the production
unit we are talking about. I don't think, that this should be simple files. I think it must
be records of String,Date,Integer,Binary-Fields, which could be mapped to lucene fiels.

Ok, i will tell you some more details:

a crawler will produce something like

mime: application/word
created:12.1.2001
data: <binary>
url:http://www.sample.com/test.doc


the document handler for word docs will take and transform this to

mime: application/word
created:12.1.2001
url:http://www.sample.com/test.doc
author:Manfred Schäfer
title:'77 secrets of indexing documents'
asText: '... the document as plain text ...'

now we come to lucene, the fields must be mapped to lucene fields

LUCENE-FIELDS -> DOCUMENT-FIELDS
mimetype->mime
created->created
url->url
author->author
default->author, asText

Working with ant in batch mode could make use of XML for the representation of the records
above. Configuring a pipe-system with a xml-config-file is not so simple.
I don't know avalon, so i cant't say anything about it. But i would favor to have at least
a possiblity to works only with configuration, without programming.

regards,

Manfred






--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message