lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Warnier>
Subject Re: Which one is better - Lucene OR Google Search Appliance
Date Fri, 28 Nov 2008 09:06:43 GMT
Mike_SearchGuru wrote:
> OK basically we ahve 8 million pdf's to index and we have good technical
> people in our company.
> question is is lucene slower than GSA in terms of indexing pdf's?
> are there any costs for licenses if used commercially. If yes then what are
> the costs?
> what are teh downsides of Lucene as opposed to GSA. these are my questions
> and if you can answerr them then it will be great help.
> Thanks
> Ali
> Ian Holsman wrote:
>> Mike_SearchGuru wrote:
>>> We are evaluating Lucene at the moment and also considering Google Search
>>> Appliance. Is there anyone who can guide us on which one is better apart
>>> from Google being expensive as we have 8 million PDF's to index.
>>> Can someoen help us by clearly identifying whcih one is better.
>> Hi Mike.
>> Firstly GSA is so much more than just a search library, which is what 
>> lucene is. In your analysis you should be looking at things like Solr 
>> (which will give you a web interface to the lucene library), and Tika or 
>> nutch to actually put your documents into the index itself.
>> as for which is better, we have no idea what your requirements are 
>> (besides from wanting to avoid spending money) or what your 
>> organization's technical capabilities are (are you willing to spend 1-3 
>> getting up to speed with the open source tools for example) so it will 
>> be hard for us to judge.
I am not an expert on either GSA or Lucene, but reading your descrition 
above, I would ask myself a couple of questions first of all.

You have 8 million PDFs which you want to index.  That is, presumably, 
to make their content searchable later by some users.
Let's say that you go though the entire collection of PDFs, and index 
every single word in them, no matter with which tool (both GSA and 
Lucene can do that).

Assuming that these 8 million PDFs are all in English, you have a good 
chance that just about any word of the English language will occur 
thousands of times. So, a user searching for something will find 
thousands of hits, just like when you search in Google.  Will that be 
useful to them ?
In other words, the question is : do you want some control about how the 
8 million PDFs are going to be indexed, or not ?

The second question is about access.  When your documents are all 
indexed, should then any user be able to access any item of the 
collection ? or do you want some form of access-control, to determine 
who gets access to what ?

The answer to the above will already provide some elements to make choices.

A couple more notes :
- assume it takes just 1 second to read and index one PDF document. You 
have 8,000,000 documents, and there are 86,400 seconds in a day. 
Assuming no delays at all in passing these documents over any kind of 
network, that means that it would take 93 days to index the collection.
- assume one PDF document contains on average 30 Kb of pure text.  A 
reasonable average for a full-text indexing, will result in an index 
that is, in size, approximately 3 times as large as the original text.
You make the calculation.

You might thus want to analyse this seriously, and not make a decision 
based purely on the cost of a license.

View raw message