From André Warnier>
Subject Re: Which one is better - Lucene OR Google Search Appliance
Date Fri, 28 Nov 2008 11:37:50 GMT
Andrzej Bialecki wrote:
> André Warnier wrote:
>> Same thing.  In a collection that size, to allow meaningful searches 
>> one would need to store positional information to allow proximity 
>> searches.   I thus question the 35% figure.
> However, Lucene is very good at compressing this data, especially if we 
> consider un-stored fields, i.e. fields that are only indexed. Try it and 
> measure it, and then question ;)
Allright, I will retry it.  I guess one should not keep relying on older 
My original figure of 300% was based, I admit, on a relatively 
comprehensive indexing of documents, with distinct search indexes per 
"field", proximity, no stemming, no exclusion of stopwords etc..
Basically what one needs to search professionally in published technical 
literature e.g.

>> A practical tip now :
>> I have not done this in a while, but as I recall installing Lucene on 
>> a PC is a really easy thing to do. It comes with a trial indexing 
>> "robot", which you can just point to a directory containing documents, 
>> and it will go off and index them for you.
>> So create such a directory, put a sample of your documents in it, let 
>> it run and look at the results.  The whole thing will take you one 
>> hour at the most, and it will provide you with a very first estimate 
>> of what your issues really are, and give you some real bases to start 
>> thinking.
> This is not a good advice. You need to convert the PDFs to plain text 
> first - Lucene as such indexes only plain text, and its demo application 
> can parse at most plain text and simple HTML. It won't be able to parse 
> PDFs, instead it will create a lot of garbage terms.
I did not remember that, sorry.
But I suppose there must exist some kind of "plugin" to process PDFs, no ?
Basically I'm trying to help the OP into running a test on a limited set 
of his documents (which are PDFs of so far unknown content, but 
supposedly text), with a minimum of installation complications, and 
allowing him to visualise the results.
This in order for him to get a feel for what indexing his whole 
collection would entail.

So, what would be the correct recommendation in this case ?

