Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 72236 invoked from network); 25 May 2007 20:42:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 May 2007 20:42:42 -0000 Received: (qmail 56261 invoked by uid 500); 25 May 2007 20:42:40 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 56225 invoked by uid 500); 25 May 2007 20:42:40 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 56213 invoked by uid 99); 25 May 2007 20:42:40 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 May 2007 13:42:40 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [69.44.16.11] (HELO getopt.org) (69.44.16.11) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 May 2007 13:42:34 -0700 Received: from [192.168.0.253] (75-mo3-2.acn.waw.pl [62.121.105.75]) (authenticated) by getopt.org (8.11.6/8.11.6) with ESMTP id l4PKgSY19679 for ; Fri, 25 May 2007 15:42:28 -0500 Message-ID: <46574A20.3050701@getopt.org> Date: Fri, 25 May 2007 22:42:08 +0200 From: Andrzej Bialecki User-Agent: Thunderbird 1.5.0.10 (Windows/20070221) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Indexing help needed References: <20070525154307.61618.qmail@web50312.mail.re2.yahoo.com> <01c301c79eef$f156abf0$2e01a8c0@dorthy> <465726A3.3090004@getopt.org> <01cf01c79f03$248d5970$2e01a8c0@dorthy> In-Reply-To: <01cf01c79f03$248d5970$2e01a8c0@dorthy> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org jim shirreffs wrote: > Thanks for the advice, I just don't see where in the Lucene code I > should plug OOParcer into Lucene. > > I've walked the code in LIUS and Nutch (moving on to Solr) trying to > find common objects. If I can find common objects in Lucene and Nutch > I'll know where to plug in. You seem to be somewhat confused about what Lucene really is. It's just a library, and not an application. It's up to you to provide the logic and glue, or to extend any existing demo application to accomodate your needs. It's also a _plain_ _text_ search library. So if you want to index anything else you need to first convert it to a plain text format. That's essentially what OOParser does in Nutch. It extracts data from OO documents and converts it to plain text. Disregard other stuff in that plugin - it has to do with how Nutch passes this data to storage (and indexing takes place in a completely different step, so you won't find it here). Just use the parts that extract plain text data - and then use this plain text data to add fields to Lucene documents. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org