Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 79273 invoked from network); 30 Nov 2008 17:11:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Nov 2008 17:11:11 -0000 Received: (qmail 57548 invoked by uid 500); 30 Nov 2008 17:11:21 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 57524 invoked by uid 500); 30 Nov 2008 17:11:21 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 57513 invoked by uid 99); 30 Nov 2008 17:11:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Nov 2008 09:11:21 -0800 X-ASF-Spam-Status: No, hits=2.4 required=10.0 tests=HTML_MESSAGE,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of joaquin.delgado@gmail.com designates 74.125.44.30 as permitted sender) Received: from [74.125.44.30] (HELO yx-out-2324.google.com) (74.125.44.30) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Nov 2008 17:09:52 +0000 Received: by yx-out-2324.google.com with SMTP id 3so785692yxj.5 for ; Sun, 30 Nov 2008 09:10:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=sL3oGpWPRWub1Iebpgj8hSA5ESkkpH3hEsAIqKTjeZ8=; b=hKBNSar7hXjxtGUBUdYDhQ3sOINQHkDpW6RVcdHf/cE9EcD5Q7H4Fz9ouhnEy2TZeo k+vf/adAE9colAF5wBENqdHTBQDnfNKvP4U8NEVv/G6OMM9IqjfUS6ThQtvf1KX63KeN WhD3bI/th62YJG+jjVoGbs9O+3/8DCHANajbg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=AT8tXF2t7UpH9wd6exWA/+x4a2/EBhVfZbFci+IdgT8Nw9We3mPA9u5wOfq66cuVp1 FoNAWAts590Q05sP2qy7SrQOhAoD3tPZ1F+r5P7d1zLyyAejLElvg0k5b6gpJXY1ZDTo cWm1FiDXJSK/dUghxMXRZKEPGOqkhg7BfGOqQ= Received: by 10.151.114.9 with SMTP id r9mr8684385ybm.180.1228065029606; Sun, 30 Nov 2008 09:10:29 -0800 (PST) Received: by 10.151.135.8 with HTTP; Sun, 30 Nov 2008 09:10:29 -0800 (PST) Message-ID: Date: Sun, 30 Nov 2008 09:10:29 -0800 From: "J. Delgado" To: general@lucene.apache.org Subject: Re: Which one is better - Lucene OR Google Search Appliance In-Reply-To: <20731258.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_74066_2924532.1228065029596" References: <20725398.post@talk.nabble.com> <492F1D1E.6010001@holsman.net> <20730335.post@talk.nabble.com> <492FB4A3.9090908@ice-sa.com> <20731258.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_74066_2924532.1228065029596 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline On Fri, Nov 28, 2008 at 1:28 AM, Mike_SearchGuru wrote: > > Many thanks to your responses. yes you are right we shoudl not be > considering > license costs and i agree to you. > > Let me further answer your questions: > 1) each pdf file is about on avaerage 100 page long and 4MB in size. Have you considered chopping the document into 100 separate pages and indexing those while storing in a field the link to the complete doc? In that way you get relevant hits at the page level and can navigate back to the original doc. If you need help on this I think I have a PDF "chopper" script (windows) somewhere ... it uses pdfbox. Otherwise it should be relatively easy to do it. > > However, we are not indexing the whole lot. We will only be indexing very > few parts ie the headlines on the PDF files. So i woudl say some 5% of the > document will ever be indexed. > 2) all files are in english > 3) we dont need any control on how the pdf's are indexed. > 4) every week we have an increase of 5000 pdfs that needs to be indexed > 5) we need a facility whereby we can create multiple indexes so that we cna > keep teh size of these indexes as small as possible BUT when a query is > fired we want to be able to pull information form all these multiple > indexes. > 6) no need for any access controls > 7) on time factor - if it takes 1 sec to index a pdf file (assmuing that > the > content to index is 30KB), then we will be screwed up as we cant wait 93 > days for everything to be indexed. So what we might do is split or docs > into > multiple parts and index them separately on separate servers ( may be 10 > servers) and so that should cut the 93 days to 9 days. The question here is > can we then group all those indexes on one server later on when going live. > 8) currently our pdf file size for all 8 million adds up to 40 terabyte > already. > > > > awarnier wrote: > > > > Mike_SearchGuru wrote: > >> OK basically we ahve 8 million pdf's to index and we have good technical > >> people in our company. > >> > >> question is is lucene slower than GSA in terms of indexing pdf's? > >> are there any costs for licenses if used commercially. If yes then what > >> are > >> the costs? > >> what are teh downsides of Lucene as opposed to GSA. these are my > >> questions > >> and if you can answerr them then it will be great help. > >> > >> Thanks > >> Ali > >> > >> > >> > >> Ian Holsman wrote: > >>> Mike_SearchGuru wrote: > >>>> We are evaluating Lucene at the moment and also considering Google > >>>> Search > >>>> Appliance. Is there anyone who can guide us on which one is better > >>>> apart > >>>> from Google being expensive as we have 8 million PDF's to index. > >>>> > >>>> Can someoen help us by clearly identifying whcih one is better. > >>>> > >>> Hi Mike. > >>> > >>> Firstly GSA is so much more than just a search library, which is what > >>> lucene is. In your analysis you should be looking at things like Solr > >>> (which will give you a web interface to the lucene library), and Tika > or > >>> nutch to actually put your documents into the index itself. > >>> > >>> as for which is better, we have no idea what your requirements are > >>> (besides from wanting to avoid spending money) or what your > >>> organization's technical capabilities are (are you willing to spend 1-3 > >>> getting up to speed with the open source tools for example) so it will > >>> be hard for us to judge. > >>> > >>> > > Hi. > > I am not an expert on either GSA or Lucene, but reading your descrition > > above, I would ask myself a couple of questions first of all. > > > > You have 8 million PDFs which you want to index. That is, presumably, > > to make their content searchable later by some users. > > Let's say that you go though the entire collection of PDFs, and index > > every single word in them, no matter with which tool (both GSA and > > Lucene can do that). > > > > Assuming that these 8 million PDFs are all in English, you have a good > > chance that just about any word of the English language will occur > > thousands of times. So, a user searching for something will find > > thousands of hits, just like when you search in Google. Will that be > > useful to them ? > > In other words, the question is : do you want some control about how the > > 8 million PDFs are going to be indexed, or not ? > > > > The second question is about access. When your documents are all > > indexed, should then any user be able to access any item of the > > collection ? or do you want some form of access-control, to determine > > who gets access to what ? > > > > The answer to the above will already provide some elements to make > > choices. > > > > A couple more notes : > > - assume it takes just 1 second to read and index one PDF document. You > > have 8,000,000 documents, and there are 86,400 seconds in a day. > > Assuming no delays at all in passing these documents over any kind of > > network, that means that it would take 93 days to index the collection. > > - assume one PDF document contains on average 30 Kb of pure text. A > > reasonable average for a full-text indexing, will result in an index > > that is, in size, approximately 3 times as large as the original text. > > You make the calculation. > > > > You might thus want to analyse this seriously, and not make a decision > > based purely on the cost of a license. > > > > > > -- > View this message in context: > http://www.nabble.com/Which-one-is-better---Lucene-OR-Google-Search-Appliance-tp20725398p20731258.html > Sent from the Lucene - General mailing list archive at Nabble.com. > > ------=_Part_74066_2924532.1228065029596--