Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 37419 invoked from network); 10 Feb 2002 13:46:12 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 10 Feb 2002 13:46:12 -0000 Received: (qmail 28415 invoked by uid 97); 10 Feb 2002 13:46:12 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 28383 invoked by uid 97); 10 Feb 2002 13:46:12 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 28370 invoked from network); 10 Feb 2002 13:46:11 -0000 Message-ID: <3C66876F.FAABC790@bouncy.com> Date: Sun, 10 Feb 2002 14:45:03 +0000 From: Manfred =?iso-8859-1?Q?Sch=E4fer?= Organization: Bouncy Bytes Software GmbH X-Mailer: Mozilla 4.78 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 To: Lucene Developers List Subject: Re: Proposal for Lucene / new component References: <50EA669584662B498F13A5F24630A0C01BAD18@peach.mnet.private> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Return-Path: mschaefer@bouncy.com X-MDaemon-Deliver-To: lucene-dev@jakarta.apache.org Reply-To: mschaefer@bouncy.com X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi, > I've read you proposal (and all email related to it). One thing I'd like to advise is to distinguish the crawler and the loader component. > The crawler is responsible for gathering documents from several sources. > The loader (or indexer) is responsible for loading the gathered documents to the index (I think in batch mode). I see three different component types: - file producer (crawler, database reader, Filesystem reader) - Document Handler (knows the syntax (maybe semantic) of file-content) - Indexer (Lucene) Is batch mode really the way. I think of something like pipes (But maybe i'm wrong). > > I think it's redundant to hardcode the indexing logic into all crawler component (ftp, http, jdbc, filesys crawler). It's an interesting question how the components can communicate? (don't you think using avalon is a good way?) I think, that the configuration of the indexing procedure, including work for all three component types, is the real adventure. The components itself are relatively easy to write. I first thought of ant as configuration framework. But i think that would work only for batch mode. The main question is: What is the production unit we are talking about. I don't think, that this should be simple files. I think it must be records of String,Date,Integer,Binary-Fields, which could be mapped to lucene fiels. Ok, i will tell you some more details: a crawler will produce something like mime: application/word created:12.1.2001 data: url:http://www.sample.com/test.doc the document handler for word docs will take and transform this to mime: application/word created:12.1.2001 url:http://www.sample.com/test.doc author:Manfred Sch�fer title:'77 secrets of indexing documents' asText: '... the document as plain text ...' now we come to lucene, the fields must be mapped to lucene fields LUCENE-FIELDS -> DOCUMENT-FIELDS mimetype->mime created->created url->url author->author default->author, asText Working with ant in batch mode could make use of XML for the representation of the records above. Configuring a pipe-system with a xml-config-file is not so simple. I don't know avalon, so i cant't say anything about it. But i would favor to have at least a possiblity to works only with configuration, without programming. regards, Manfred -- To unsubscribe, e-mail: For additional commands, e-mail: