Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <20040623084330.68095.qmail@web12704.mail.yahoo.com>
Date: Wed, 23 Jun 2004 01:43:30 -0700 (PDT)
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Subject: Re: Storing data in Lucene or Xindice
To: lucene-user@jakarta.apache.org
Cc: Rob Clews <robc@klearsystems.com>
In-Reply-To: <1087979333.3483.16.camel@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

(redirecting to lucene-user list)

Hello Rob,

I think you will end up with a simpler final result if you try saving
everything in a single data source.  I have not used Xindice, so I
cannot comment on its features, performance, etc., but judging from
your description, you could simply use Lucene to index the textual
information from XML feeds or HTML.

For XML parsing and indexing, you can see the article I wrote for IBM
developerWorks:
http://www-106.ibm.com/developerworks/java/library/j-lucene/

If you will be doing a lot of parsing, you will want to use something
faster than Digester, though.  Maybe Electric XML parser.

For HTML you can use NekoHTML, JTidy, htmlparser (sf.net), or Brian
Goetz's HTMLParser.

Now that I think about it, I seem to recall that Xindice uses Lucene
under the hood.... I can't find any information that confirms this,
now.  Maybe I'm mixing somehting up.

Otis


--- Rob Clews <robc@klearsystems.com> wrote:
> Hi,
> 
> I'm currently looking at using Lucene to index some XML feeds we
> receive
> for content. However, some of the feeds contain the articles contents
> and some don't, the feeds that do contain the contents are in XML,
> for
> the others we must retrieve them in HTML.
> 
> I was originally going to store the XML contents from the feed in
> Xindice and retrieve them for each result from a Lucene query, but I
> guess I could store them in Lucene. We expect to build up a lot of
> content from shortish articles on the web and our main focus is
> speed,
> so would I be best storing the contents in Lucene or Xindice?
> 
> Would storing more data (non-indexable) in Lucene slow it down on
> queries?
> 
> Thanks,
> Rob Clews
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org