Erik Hatcher wrote:
>
> On Nov 28, 2008, at 4:06 AM, André Warnier wrote:
>> A couple more notes :
>> - assume it takes just 1 second to read and index one PDF document.
>> You have 8,000,000 documents, and there are 86,400 seconds in a day.
>> Assuming no delays at all in passing these documents over any kind of
>> network, that means that it would take 93 days to index the collection.
>
> Most indexers handle upwards of 100 docs / second or more - certainly
> depending on document size, etc. Note that you can parallelize indexing
> for higher speeds. So that indexing estimate is way off.
>
The number I gave above was just an example, to trigger thinking about
the real issues. But considering the later information of the OP that
each document is about 4 MB in size, and that before one extracts the
text of the PDF, it needs to be retrieved and read, I would surmise that
indeed the estimate is way off, but on the optimistic side.
But of course to an extent one can parallelise, but then we're starting
to talk budget issues, so maybe let's not mix at first.
>> - assume one PDF document contains on average 30 Kb of pure text. A
>> reasonable average for a full-text indexing, will result in an index
>> that is, in size, approximately 3 times as large as the original text.
>
> That's not quite a fair stat. It depends on what you're storing. The
> rough guide for index size with Lucene is roughly 35% the size of the
> original data... assuming the text is being just indexed and not
> stored. Very likely there is no need to store the PDF content in the
> Lucene index. Certainly storage needs must be factored into the
> equation, and there are trade-offs when it comes to places where storing
> a document into Lucene - hit highlighting for example.
>
Same thing. In a collection that size, to allow meaningful searches one
would need to store positional information to allow proximity searches.
I thus question the 35% figure.
Anyway, before knowing more about the real document content, the way in
which users would search, the way in which they would like to see the
results, etc.. much of that is rather speculative.
One aspect not evoked until now for instance is what these PDFs really
contain. For example, is the content really text at all ?
With this kind of volume, the OP might be talking about published
articles e.g, of which a good proportion might very well be bitmap
images embedded in PDFs, rather than real text.
Search_Guru, if that is your case (or even if a fraction of your PDFs
are such), then multiply any estimates above by at least an order of
magnitude, because we're then talking OCR, at approximately 10 seconds
per page.
A practical tip now :
I have not done this in a while, but as I recall installing Lucene on a
PC is a really easy thing to do. It comes with a trial indexing "robot",
which you can just point to a directory containing documents, and it
will go off and index them for you.
So create such a directory, put a sample of your documents in it, let it
run and look at the results. The whole thing will take you one hour at
the most, and it will provide you with a very first estimate of what
your issues really are, and give you some real bases to start thinking.
It may even be that Solr (which as I also recall is a web interface to
Lucene among other things), will allow you to do this in a more
user-friendly and graphical way.
The above shows one of the benefits of Lucene as compared to GSA : you
can try it out easily, without committing yourself.
Do not be over-enthousiastic (or disappointed) at first, no matter what
the results are. One the one hand, you have to be careful about
extrapolating results from 500 documents to 8 million, and on the other
hand there are many ways to configure and tune these things.
But first you need some basic numbers.
And a more generic tip :
If you are new to this kind of thing, don't underestimate it. The topic
of text indexing and retrieval, by itself, is as vast as the topic of
relational databases, and it has its own problematic, its own
techniques, its own jargon, its own set of specialists etc..
I do not mean this in any derogatory way. Obviously you are not totally
innocent in the subject, since you have come to this list, rather than
some other more classical-IT oriented one. That's already a big step in
the right direction.
|