lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Byrne <john.by...@therogueprocess.net>
Subject Re: Which one is better - Lucene OR Google Search Appliance
Date Fri, 28 Nov 2008 09:47:19 GMT
Mike_SearchGuru wrote:
> 5) we need a facility whereby we can create multiple indexes so that we cna
> keep teh size of these indexes as small as possible BUT when a query is
> fired we want to be able to pull information form all these multiple
> indexes.
>   
Lucene can search across multiple indexes and merge the results 
correctly, as if they came from one index.
> 7) on time factor - if it takes 1 sec to index a pdf file (assmuing that the
> content to index is 30KB), then we will be screwed up as we cant wait 93
> days for everything to be indexed. So what we might do is split or docs into
> multiple parts and index them separately on separate servers ( may be 10
> servers) and so that should cut the 93 days to 9 days. The question here is
> can we then group all those indexes on one server later on when going live.
>   
As well as searching accross multiple indexes, Lucene also lets you 
merge several indexes into one if you want.
> 8) currently our pdf file size for all 8 million adds up to 40 terabyte
> already.
>
>
>
> awarnier wrote:
>   
>> Mike_SearchGuru wrote:
>>     
>>> OK basically we ahve 8 million pdf's to index and we have good technical
>>> people in our company.
>>>
>>> question is is lucene slower than GSA in terms of indexing pdf's?
>>> are there any costs for licenses if used commercially. If yes then what
>>> are
>>> the costs?
>>> what are teh downsides of Lucene as opposed to GSA. these are my
>>> questions
>>> and if you can answerr them then it will be great help.
>>>
>>> Thanks
>>> Ali
>>>
>>>
>>>
>>> Ian Holsman wrote:
>>>       
>>>> Mike_SearchGuru wrote:
>>>>         
>>>>> We are evaluating Lucene at the moment and also considering Google
>>>>> Search
>>>>> Appliance. Is there anyone who can guide us on which one is better
>>>>> apart
>>>>> from Google being expensive as we have 8 million PDF's to index.
>>>>>
>>>>> Can someoen help us by clearly identifying whcih one is better.
>>>>>   
>>>>>           
>>>> Hi Mike.
>>>>
>>>> Firstly GSA is so much more than just a search library, which is what 
>>>> lucene is. In your analysis you should be looking at things like Solr 
>>>> (which will give you a web interface to the lucene library), and Tika or

>>>> nutch to actually put your documents into the index itself.
>>>>
>>>> as for which is better, we have no idea what your requirements are 
>>>> (besides from wanting to avoid spending money) or what your 
>>>> organization's technical capabilities are (are you willing to spend 1-3 
>>>> getting up to speed with the open source tools for example) so it will 
>>>> be hard for us to judge.
>>>>  
>>>>
>>>>         
>> Hi.
>> I am not an expert on either GSA or Lucene, but reading your descrition 
>> above, I would ask myself a couple of questions first of all.
>>
>> You have 8 million PDFs which you want to index.  That is, presumably, 
>> to make their content searchable later by some users.
>> Let's say that you go though the entire collection of PDFs, and index 
>> every single word in them, no matter with which tool (both GSA and 
>> Lucene can do that).
>>
>> Assuming that these 8 million PDFs are all in English, you have a good 
>> chance that just about any word of the English language will occur 
>> thousands of times. So, a user searching for something will find 
>> thousands of hits, just like when you search in Google.  Will that be 
>> useful to them ?
>> In other words, the question is : do you want some control about how the 
>> 8 million PDFs are going to be indexed, or not ?
>>
>> The second question is about access.  When your documents are all 
>> indexed, should then any user be able to access any item of the 
>> collection ? or do you want some form of access-control, to determine 
>> who gets access to what ?
>>
>> The answer to the above will already provide some elements to make
>> choices.
>>
>> A couple more notes :
>> - assume it takes just 1 second to read and index one PDF document. You 
>> have 8,000,000 documents, and there are 86,400 seconds in a day. 
>> Assuming no delays at all in passing these documents over any kind of 
>> network, that means that it would take 93 days to index the collection.
>> - assume one PDF document contains on average 30 Kb of pure text.  A 
>> reasonable average for a full-text indexing, will result in an index 
>> that is, in size, approximately 3 times as large as the original text.
>> You make the calculation.
>>
>> You might thus want to analyse this seriously, and not make a decision 
>> based purely on the cost of a license.
>>
>>
>>     
>
>   
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com 
> Version: 8.0.176 / Virus Database: 270.9.11/1816 - Release Date: 11/27/2008 7:53 PM
>
>   


Mime
View raw message