lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche" <>
Subject Re: Lucene & Zend Lucene Search : indexation speed, document parsing
Date Tue, 16 Sep 2008 11:04:38 GMT
Bonjour Romain,

Im asking myself a few questions. Mainly about speed (indexation time) and
> document parsing (way to index most of commonly used office documents).  For
> document parsing, I'm planning to use different open sources library. The
> company Im doing this for will be indexing a few Gigabytes of data. Around
> 5Gb I think. Any advices about this project? Comments and suggestion are
> welcome.

For the parsing you should have a look at Apache Tika. It supports the most
common formats and exposes the OS libraries it uses for each format under a
very nice and simple API. That should spare you the trouble of interfacing
with each individual library.

DigitalPebble Ltd

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message