lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Which one is better - Lucene OR Google Search Appliance
Date Fri, 28 Nov 2008 09:19:32 GMT

On Nov 28, 2008, at 4:06 AM, André Warnier wrote:
> A couple more notes :
> - assume it takes just 1 second to read and index one PDF document.  
> You have 8,000,000 documents, and there are 86,400 seconds in a day.  
> Assuming no delays at all in passing these documents over any kind  
> of network, that means that it would take 93 days to index the  
> collection.

Most indexers handle upwards of 100 docs / second or more - certainly  
depending on document size, etc.  Note that you can parallelize  
indexing for higher speeds.  So that indexing estimate is way off.

> - assume one PDF document contains on average 30 Kb of pure text.  A  
> reasonable average for a full-text indexing, will result in an index  
> that is, in size, approximately 3 times as large as the original text.

That's not quite a fair stat.  It depends on what you're storing.  The  
rough guide for index size with Lucene is roughly 35% the size of the  
original data... assuming the text is being just indexed and not  
stored.  Very likely there is no need to store the PDF content in the  
Lucene index.  Certainly storage needs must be factored into the  
equation, and there are trade-offs when it comes to places where  
storing a document into Lucene - hit highlighting for example.


View raw message