lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew C. Oliver" <acoli...@apache.org>
Subject Re: Proposal for Lucene / new component
Date Sun, 24 Feb 2002 16:54:09 GMT
On Sun, 2002-02-10 at 09:45, Manfred Schäfer wrote:
> Hi,
> 
> 
> > I've read you proposal (and all email related to it). One thing I'd like to advise
is to distinguish the crawler and the loader component.
> > The crawler is responsible for gathering documents from several sources.
> > The loader (or indexer) is responsible for loading the gathered documents to the
index (I think in batch mode).
> 
> I see three different component types:
>     - file producer (crawler, database reader, Filesystem reader)
>     - Document Handler (knows the syntax (maybe semantic) of file-content)
>     - Indexer (Lucene)
> 
> Is batch mode really the way. I think of something like pipes (But maybe i'm wrong).
> 

I see something that smells a lot like awt style events  only no
threading (necessarily) 

> >
> > I think it's redundant to hardcode the indexing logic into all crawler component
(ftp, http, jdbc, filesys crawler). It's an interesting question how the components can communicate?
(don't you think using avalon is a good way?)
> 
> I think, that the configuration of the indexing procedure, including work for all three
component types, is the real adventure. The components itself are relatively easy to write.
I first thought of ant as configuration framework. But i think that would work only for batch
mode. The main question is: What is the production
> unit we are talking about. I don't think, that this should be simple files. I think it
must be records of String,Date,Integer,Binary-Fields, which could be mapped to lucene fiels.
> 
> Ok, i will tell you some more details:
> 
> a crawler will produce something like
> 
> mime: application/word
> created:12.1.2001
> data: <binary>
> url:http://www.sample.com/test.doc
> 
> 
> the document handler for word docs will take and transform this to
> 
> mime: application/word
> created:12.1.2001
> url:http://www.sample.com/test.doc
> author:Manfred Schäfer
> title:'77 secrets of indexing documents'
> asText: '... the document as plain text ...'
> 
> now we come to lucene, the fields must be mapped to lucene fields
> 
> LUCENE-FIELDS -> DOCUMENT-FIELDS
> mimetype->mime
> created->created
> url->url
> author->author
> default->author, asText
> 

right... did the proposal not say that?  If not can you patch it and
make it a bit more clear?

> Working with ant in batch mode could make use of XML for the representation of the records
above. Configuring a pipe-system with a xml-config-file is not so simple.
> I don't know avalon, so i cant't say anything about it. But i would favor to have at
least a possiblity to works only with configuration, without programming.
> 

I'm trying to learn enough about avalon to do this.  I'm having a hard
time of it.  After I read the conceptual documentation and see a couple
of code samples I'm like "now what?"  I need a "hello avalon" tutorial
to help me.. . U/f I can't write one (chicken and the egg kind of
thing).  I still am having trouble figuring out how to do something like
this via ant or even if ant is the right tool..  (I mean I love it for
builds but for this??) MAybe I have a mind block :-)

> regards,
> 
> Manfred
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
> 
-- 
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document 
                            format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html 
			- fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message