lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <>
Subject Re: Which one is better - Lucene OR Google Search Appliance
Date Fri, 28 Nov 2008 09:43:03 GMT

On Fri, Nov 28, 2008 at 10:28 AM, Mike_SearchGuru
<> wrote:
> 1) each pdf file is about on avaerage 100 page long and 4MB in size.
> However, we are not indexing the whole lot. We will only be indexing very
> few parts ie the headlines on the PDF files. So i woudl say some 5% of the
> document will ever be indexed.

Do you already have some mechanism for extracting this text content
from the PDF files? Unless the PDF files you have contain internal
tagging for headings and other similar content it might be difficult
to selectively extract anything at a lower granularity than a page.

Also, the speed of different PDF libraries varies notably depending on
how accurate results they produce. PDFBox (that I know best) is pretty
accurate but may require quite a bit of time to extract all the text
from a multi-megabyte document. On the other hand you can run as many
text extraction processes in parallel as you have CPU capacity for.


Jukka Zitting

View raw message