lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Byrne <>
Subject Re: Which one is better - Lucene OR Google Search Appliance
Date Fri, 28 Nov 2008 09:47:19 GMT
Mike_SearchGuru wrote:
> 5) we need a facility whereby we can create multiple indexes so that we cna
> keep teh size of these indexes as small as possible BUT when a query is
> fired we want to be able to pull information form all these multiple
> indexes.
Lucene can search across multiple indexes and merge the results 
correctly, as if they came from one index.
> 7) on time factor - if it takes 1 sec to index a pdf file (assmuing that the
> content to index is 30KB), then we will be screwed up as we cant wait 93
> days for everything to be indexed. So what we might do is split or docs into
> multiple parts and index them separately on separate servers ( may be 10
> servers) and so that should cut the 93 days to 9 days. The question here is
> can we then group all those indexes on one server later on when going live.
As well as searching accross multiple indexes, Lucene also lets you 
merge several indexes into one if you want.
> 8) currently our pdf file size for all 8 million adds up to 40 terabyte
> already.
> awarnier wrote:
>> Mike_SearchGuru wrote:
>>> OK basically we ahve 8 million pdf's to index and we have good technical
>>> people in our company.
>>> question is is lucene slower than GSA in terms of indexing pdf's?
>>> are there any costs for licenses if used commercially. If yes then what
>>> are
>>> the costs?
>>> what are teh downsides of Lucene as opposed to GSA. these are my
>>> questions
>>> and if you can answerr them then it will be great help.
>>> Thanks
>>> Ali
>>> Ian Holsman wrote:
>>>> Mike_SearchGuru wrote:
>>>>> We are evaluating Lucene at the moment and also considering Google
>>>>> Search
>>>>> Appliance. Is there anyone who can guide us on which one is better
>>>>> apart
>>>>> from Google being expensive as we have 8 million PDF's to index.
>>>>> Can someoen help us by clearly identifying whcih one is better.
>>>> Hi Mike.
>>>> Firstly GSA is so much more than just a search library, which is what 
>>>> lucene is. In your analysis you should be looking at things like Solr 
>>>> (which will give you a web interface to the lucene library), and Tika or

>>>> nutch to actually put your documents into the index itself.
>>>> as for which is better, we have no idea what your requirements are 
>>>> (besides from wanting to avoid spending money) or what your 
>>>> organization's technical capabilities are (are you willing to spend 1-3 
>>>> getting up to speed with the open source tools for example) so it will 
>>>> be hard for us to judge.
>> Hi.
>> I am not an expert on either GSA or Lucene, but reading your descrition 
>> above, I would ask myself a couple of questions first of all.
>> You have 8 million PDFs which you want to index.  That is, presumably, 
>> to make their content searchable later by some users.
>> Let's say that you go though the entire collection of PDFs, and index 
>> every single word in them, no matter with which tool (both GSA and 
>> Lucene can do that).
>> Assuming that these 8 million PDFs are all in English, you have a good 
>> chance that just about any word of the English language will occur 
>> thousands of times. So, a user searching for something will find 
>> thousands of hits, just like when you search in Google.  Will that be 
>> useful to them ?
>> In other words, the question is : do you want some control about how the 
>> 8 million PDFs are going to be indexed, or not ?
>> The second question is about access.  When your documents are all 
>> indexed, should then any user be able to access any item of the 
>> collection ? or do you want some form of access-control, to determine 
>> who gets access to what ?
>> The answer to the above will already provide some elements to make
>> choices.
>> A couple more notes :
>> - assume it takes just 1 second to read and index one PDF document. You 
>> have 8,000,000 documents, and there are 86,400 seconds in a day. 
>> Assuming no delays at all in passing these documents over any kind of 
>> network, that means that it would take 93 days to index the collection.
>> - assume one PDF document contains on average 30 Kb of pure text.  A 
>> reasonable average for a full-text indexing, will result in an index 
>> that is, in size, approximately 3 times as large as the original text.
>> You make the calculation.
>> You might thus want to analyse this seriously, and not make a decision 
>> based purely on the cost of a license.
> ------------------------------------------------------------------------
> No virus found in this incoming message.
> Checked by AVG - 
> Version: 8.0.176 / Virus Database: 270.9.11/1816 - Release Date: 11/27/2008 7:53 PM

View raw message