lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: Which one is better - Lucene OR Google Search Appliance
Date Fri, 28 Nov 2008 10:37:57 GMT
Erik Hatcher wrote:
> On Nov 28, 2008, at 4:06 AM, André Warnier wrote:
>> A couple more notes :
>> - assume it takes just 1 second to read and index one PDF document. 
>> You have 8,000,000 documents, and there are 86,400 seconds in a day. 
>> Assuming no delays at all in passing these documents over any kind of 
>> network, that means that it would take 93 days to index the collection.
> Most indexers handle upwards of 100 docs / second or more - certainly 
> depending on document size, etc.  Note that you can parallelize indexing 
> for higher speeds.  So that indexing estimate is way off.
The number I gave above was just an example, to trigger thinking about 
the real issues.  But considering the later information of the OP that 
each document is about 4 MB in size, and that before one extracts the 
text of the PDF, it needs to be retrieved and read, I would surmise that 
indeed the estimate is way off, but on the optimistic side.
But of course to an extent one can parallelise, but then we're starting 
to talk budget issues, so maybe let's not mix at first.

>> - assume one PDF document contains on average 30 Kb of pure text.  A 
>> reasonable average for a full-text indexing, will result in an index 
>> that is, in size, approximately 3 times as large as the original text.
> That's not quite a fair stat.  It depends on what you're storing.  The 
> rough guide for index size with Lucene is roughly 35% the size of the 
> original data... assuming the text is being just indexed and not 
> stored.  Very likely there is no need to store the PDF content in the 
> Lucene index.  Certainly storage needs must be factored into the 
> equation, and there are trade-offs when it comes to places where storing 
> a document into Lucene - hit highlighting for example.
Same thing.  In a collection that size, to allow meaningful searches one 
would need to store positional information to allow proximity searches. 
   I thus question the 35% figure.
Anyway, before knowing more about the real document content, the way in 
which users would search, the way in which they would like to see the 
results, etc.. much of that is rather speculative.

One aspect not evoked until now for instance is what these PDFs really 
contain.  For example, is the content really text at all ?
With this kind of volume, the OP might be talking about published 
articles e.g, of which a good proportion might very well be bitmap 
images embedded in PDFs, rather than real text.
Search_Guru, if that is your case (or even if a fraction of your PDFs 
are such), then multiply any estimates above by at least an order of 
magnitude, because we're then talking OCR, at approximately 10 seconds 
per page.

A practical tip now :

I have not done this in a while, but as I recall installing Lucene on a 
PC is a really easy thing to do. It comes with a trial indexing "robot", 
which you can just point to a directory containing documents, and it 
will go off and index them for you.
So create such a directory, put a sample of your documents in it, let it 
run and look at the results.  The whole thing will take you one hour at 
the most, and it will provide you with a very first estimate of what 
your issues really are, and give you some real bases to start thinking.

It may even be that Solr (which as I also recall is a web interface to 
Lucene among other things), will allow you to do this in a more 
user-friendly and graphical way.

The above shows one of the benefits of Lucene as compared to GSA : you 
can try it out easily, without committing yourself.

Do not be over-enthousiastic (or disappointed) at first, no matter what 
the results are.  One the one hand, you have to be careful about 
extrapolating results from 500 documents to 8 million, and on the other 
hand there are many ways to configure and tune these things.
But first you need some basic numbers.

And a more generic tip :
If you are new to this kind of thing, don't underestimate it.  The topic 
of text indexing and retrieval, by itself, is as vast as the topic of 
relational databases, and it has its own problematic, its own 
techniques, its own jargon, its own set of specialists etc..
I do not mean this in any derogatory way.  Obviously you are not totally 
innocent in the subject, since you have come to this list, rather than 
some other more classical-IT oriented one.  That's already a big step in 
the right direction.

View raw message