Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 20593 invoked from network); 13 Oct 2010 16:23:04 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 13 Oct 2010 16:23:04 -0000 Received: (qmail 63204 invoked by uid 500); 13 Oct 2010 16:23:02 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 63152 invoked by uid 500); 13 Oct 2010 16:23:01 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 63144 invoked by uid 99); 13 Oct 2010 16:23:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Oct 2010 16:23:01 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of umesh.iitk@gmail.com designates 74.125.82.176 as permitted sender) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Oct 2010 16:22:56 +0000 Received: by wyb38 with SMTP id 38so5784146wyb.35 for ; Wed, 13 Oct 2010 09:22:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=BE9p6KPA/LHaibwa3L0smf4TOUZK0YH9aLhmLT7CaPc=; b=Fahx6efnpqTuwYWolk6xzINS2m25RLo32sibbzUGVbEORFSps+R9+UDR0+ggJ51YZU wIrN2ElxsofqOzliC0+6HswRyR+gkR35qoDsJ7B+oub2az71P4miO2p96z21uGDdHTOG vl517zz/uBRFz252ctWB83h6r/F9tEcXbgdr0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=m0zZXKZkXwNvC77lXLoUnrz68WMoLOedihStZKi4Gn0631PstkOiHilk+jH4hiMlgZ yO+9NqP9KdbhOGuilcPHGTvtuzYMmEQ+CiogPmQ/sWX3U1+eHIajYV2W+Scxuz9XAwsB ujj/VrQirkEvVsbOH0i96xvH0WafFzUWoDsz8= Received: by 10.216.28.77 with SMTP id f55mr162560wea.91.1286986954337; Wed, 13 Oct 2010 09:22:34 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.68.79 with HTTP; Wed, 13 Oct 2010 09:22:14 -0700 (PDT) In-Reply-To: References: From: Umesh Prasad Date: Wed, 13 Oct 2010 21:52:14 +0530 Message-ID: Subject: Re: Questions about Lucene usage recommendations To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00504502cd363e6f1304928200bd --00504502cd363e6f1304928200bd Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable One more suggestion: With lucene 2.1 you might be using the hits API to search, which preloads the documents See https://issues.apache.org/jira/browse/LUCENE-954?focusedCommentId=3D1257925= 8&page=3Dcom.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel= #action_12579258 The performance hit is significant on a live server. Switch to Topdocs based search, check out http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Hits.html for details. Thanks Umesh Prasad On Tue, Sep 28, 2010 at 1:36 PM, Pawlak Michel (DCTI) < michel.pawlak@etat.ge.ch> wrote: > Hello, > > I *hope* they do it this way, I'll have to check it. The number of fields > cannot be made smaller as it has to respect an existing standard, regard = to > what you explain, searching concatened fields seems to be a smart way to = do > it. Thank you for the tip. > > Michel > > > ____________________________________________________ > Michel Pawlak > Architecte Solution > > Centre des Technologies de l'Information (CTI) > Service Architecture et Composants Transversaux (ACT) > Case postale 2285 - 64-66, rue du Grand Pr=C3=A9 - 1211 Gen=C3=A8ve 2 > T=C3=A9l. direct : +41 (0)22 388 00 95 > michel.pawlak@etat.ge.ch > > > -----Message d'origine----- > De : Danil =C5=A2ORIN [mailto:torindan@gmail.com] > Envoy=C3=A9 : mardi, 28. septembre 2010 07:57 > =C3=80 : java-user@lucene.apache.org > Objet : Re: Questions about Lucene usage recommendations > > You said you have 1000 fields...when performing search do you search > in all 1000 fields? > That could definitely be a major performance hit, as it translates to > a BooleanQuery with 1000 TermQueries inside > > Maybe it make sense to concat data from fields that you search in > fewer fields (preferably ONE). > So basically your document will have 1000+1 field and search is > performed on that additional field, instead of 1000 fields. > > But it depends on your usecase, adding title, author and book abstract > may make sense together..for the search usecase > On the other hand if you store number of pages, and release year it > doesn't make sense to mix something like "349", "1997" with title and > author. > It depends on how you are searching and what results do you > expect...but I hope you get the idea... > > > On Mon, Sep 27, 2010 at 18:41, Pawlak Michel (DCTI) > wrote: > > Hello, > > > > Thank you for your quick reply, I'll do my best to answer your remarks > and questions (I numbered them to make my mail more readable.) > Unfortunately, as I wrote, I have no access to the source code, and the > company is not really willing to answer my questions (that's why we > investigate)... so I cannot be sure of what is really done. > > > > 1) yes 2.1.0 is really old that's why I'm wondering why the company > providing the application didn't upgrade to a newer version, especially i= f > it's "almost jar drop-in" > > 2) I mean that the application creates an index based on data stored in= a > db and not in files (then the index is being used for the searches) > > 3) only documents being changed are reindexed, not the entire document > base. Around 100-150 "documents" per day need to be reindexed. > > 4) no specific sorting is done by default (as far as I know...) > > 5+6) each "document" is small (each "document" is a technical descripti= on > of a book (not the book's content itself), made of ~1000 database fields, > each of which weight a few Kbyte), no highlighting is done > > 7) IMHO the way lucene is being used is the bottleneck, not lucene > itself. However I have no proof, as I do not have access to the code. Wha= t I > know is that as we have awfull performance issues and could not wait for = a > better solution, we've put the application on a high performance server, > with a USPV SAN, and the performance improved dramatically (dropped from = 2.5 > minutes per search to around 10 seconds before optimization.) But we cann= ot > afford such a solution in the long run (using such an high end server for > such a small application is a kind of joke). Currently, we observe read > access peaks to the index file (up to 280 Mbyte/second) and performance > improvements when optimizing the index (see below) > > 8) We use no NFS, we're on a USPV SAN, but we were using physical HDD > before (slower than the SAN, but it's only part of the problem IMHO) > > 9-10) Thank you for the information > > 11) On the high end server, after we optimized the index the average > search time dropped from 10s to below 2s, now (after 2.5 weeks) the avera= ge > search time is 7s. Optimization seems required :-/ > > 12) ok > > > > Regards, > > > > Michel > > > > -----Message d'origine----- > > De : Danil =C5=A2ORIN [mailto:torindan@gmail.com] > > Envoy=C3=A9 : lundi, 27. septembre 2010 14:53 > > =C3=80 : java-user@lucene.apache.org > > Objet : Re: Questions about Lucene usage recommendations > > > > 1) Lucene 2.1 is really old...you should be able to migrate to lucene 2= .9 > > without changing your code (almost jar drop-in, but be careful on > > analyzers), and there could be huge improvements if you use lucene > > properly. > > > > Few questions: > > 2) - what does "all data to be indexed is stored in DB fields" mean? yo= u > > should store in lucene everything you need, so on search time you > > shouldn't need to hit DB > > 3) - what does "indexing is done right after every modification" mean? = do > > you index just the changed document? or reindex all 1.4 M docs? > > 4) do you sort on some field, or just basic relevance? > > 5) - how big is each document and what do you do with it? maybe > > highlighting on large documents causes this? > > 6) - what's in the document? if it's like a book...and almost every wor= d > > matches every document...there could be some issues > > 7) - is the lucene the bottleneck? maybe you are calling from remote > > server and marshaling+network+unmarshaling is slow? > > > > Usual lucene patterns are (assuming that you moved to lucene 2.9): > > 8) - avoid using NFS (not necessary a performance bottleneck, and > > definitely not something to cause 2 minutes query, but just to be on > > the safe side) > > 9) - keep writer open and add documents to the index (no need to rebuil= d > > everything) > > 10) - keep your readers open and use reopen() once in a while (you may > > even go for realtime search if you want to) > > 11) - in your case, I don't think optimize will do any good, segments > look > > good to me, don't worry about cfs file size > > 12) - there are ways to limit cfs files and play with setMaxXXX, but i > > don't think it's the cause of your 2 minute query. > > > > On Mon, Sep 27, 2010 at 14:35, Pawlak Michel (DCTI) > > wrote: > >> Hello, > >> > >> We have an application which is using lucene and we have strong > >> performance issues (on bad days, some searches take more than 2 > >> minutes). I'm new to the Lucene component, thus I'm not sure Lucene is > >> correctly used and thus would like to have some information on lucene > >> usage recommendations. This would help locate the problem (application > >> code / lucene configuration / hardware / all) It would be great if a > >> project committer / specialist could answer those questions. > >> > >> First some facts about the application : > >> - Lucene version being used : 2.1.0 (february 2007...) > >> - around 1.4M "documents" to be indexed. > >> - Db size (all data to be indexed is stored in DB fields) : 3.5 GB > >> - Index file size on disk : 1.6 GB (note that one cfs file is 780M, > >> another one is 600M, the rest consists of smaller files) > >> - single indexer, multiple readers (6 readers) > >> - around 150 documents are modified per day > >> - indexing is done right after every modification > >> - simple searches can take ages (for instance searching for "chocolate= " > >> could take for more than 2 minutes) > >> - I do not have access to source code (yes that's the funny part) > >> > >> My questions : > >> - Is this version of Lucene still supported ? > >> - What are the main reasons, if any, one should use the latest version > >> of lucene instead of 2.1.0 ? (for instance : performance, stability, > >> critical fixes, support, etc.) (the answer may sound obvious, but I > >> would like to have an official answer) > >> - Is there any recommendation concerning storage any Lucene user shoul= d > >> know (not benchmarks, but recommendations such as "better use physical > >> HDD", "do not use NFS if possible", "if your cfs files are greater tha= n > >> XYZ, better use this kind of storage", "if you have more than XYZ > >> searches per second, better..." etc) > >> - Is there any recommandation concerning cfs file size ? > >> - Is there a way to limit the size of cfs files ? > >> - What is the impact on search performance if cfs file size is limited= ? > >> - How often should optimization occur ? (every day, week, month ?) > >> - I saw that IndexWriter has methods such as setMaxFieldLength() > >> setMergeFactor() setMaxBufferedDocs() setMaxMergeDocs() Can you briefl= y > >> explain how these can affect performance ? > >> - Is there any other recommandation "dummies" should be informed of, a= nd > >> every expert has to know ? For instance as a list of lucene patterns / > >> anti patterns which may affect performance. > >> > >> If my questions are not precise enough, do not hesitate to ask for > >> details. If you see an obvious problem do not hesitate to tell me. > >> > >> A big thank you in advance for your help, > >> > >> Best regards, > >> > >> Michel > >> > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --=20 --- Thanks & Regards Umesh Prasad --00504502cd363e6f1304928200bd--