Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 50511 invoked from network); 23 Jun 2004 08:45:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 23 Jun 2004 08:45:03 -0000 Received: (qmail 4483 invoked by uid 500); 23 Jun 2004 08:44:21 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 4345 invoked by uid 500); 23 Jun 2004 08:44:19 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 4082 invoked by uid 99); 23 Jun 2004 08:44:16 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [216.136.173.241] (HELO web12704.mail.yahoo.com) (216.136.173.241) by apache.org (qpsmtpd/0.27.1) with SMTP; Wed, 23 Jun 2004 01:44:13 -0700 Message-ID: <20040623084330.68095.qmail@web12704.mail.yahoo.com> Received: from [211.95.204.101] by web12704.mail.yahoo.com via HTTP; Wed, 23 Jun 2004 01:43:30 PDT Date: Wed, 23 Jun 2004 01:43:30 -0700 (PDT) From: Otis Gospodnetic Subject: Re: Storing data in Lucene or Xindice To: lucene-user@jakarta.apache.org Cc: Rob Clews In-Reply-To: <1087979333.3483.16.camel@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N (redirecting to lucene-user list) Hello Rob, I think you will end up with a simpler final result if you try saving everything in a single data source. I have not used Xindice, so I cannot comment on its features, performance, etc., but judging from your description, you could simply use Lucene to index the textual information from XML feeds or HTML. For XML parsing and indexing, you can see the article I wrote for IBM developerWorks: http://www-106.ibm.com/developerworks/java/library/j-lucene/ If you will be doing a lot of parsing, you will want to use something faster than Digester, though. Maybe Electric XML parser. For HTML you can use NekoHTML, JTidy, htmlparser (sf.net), or Brian Goetz's HTMLParser. Now that I think about it, I seem to recall that Xindice uses Lucene under the hood.... I can't find any information that confirms this, now. Maybe I'm mixing somehting up. Otis --- Rob Clews wrote: > Hi, > > I'm currently looking at using Lucene to index some XML feeds we > receive > for content. However, some of the feeds contain the articles contents > and some don't, the feeds that do contain the contents are in XML, > for > the others we must retrieve them in HTML. > > I was originally going to store the XML contents from the feed in > Xindice and retrieve them for each result from a Lucene query, but I > guess I could store them in Lucene. We expect to build up a lot of > content from shortish articles on the web and our main focus is > speed, > so would I be best storing the contents in Lucene or Xindice? > > Would storing more data (non-indexable) in Lucene slow it down on > queries? > > Thanks, > Rob Clews > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org