Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 93385 invoked from network); 16 Aug 2010 22:59:47 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Aug 2010 22:59:47 -0000 Received: (qmail 10092 invoked by uid 500); 16 Aug 2010 22:59:44 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 10050 invoked by uid 500); 16 Aug 2010 22:59:44 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 10042 invoked by uid 99); 16 Aug 2010 22:59:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Aug 2010 22:59:44 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [195.35.184.38] (HELO animal.buyways.nl) (195.35.184.38) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Aug 2010 22:59:35 +0000 Received: from localhost (localhost [127.0.0.1]) by animal.buyways.nl (Postfix) with ESMTP id 13896D50087 for ; Tue, 17 Aug 2010 00:59:15 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at buyways.nl Received: from animal.buyways.nl ([127.0.0.1]) by localhost (animal.buyways.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RPEPpYpe9Eqh for ; Tue, 17 Aug 2010 00:59:07 +0200 (CEST) Received: from animal.buyways.nl (localhost [127.0.0.1]) by animal.buyways.nl (Postfix) with SMTP id 50A0FD50080 for ; Tue, 17 Aug 2010 00:59:07 +0200 (CEST) Subject: RE: Re: Solr searching performance issues, using large documents From: =?windows-1252?Q?Markus_Jelsma?= To: solr-user@lucene.apache.org Date: Tue, 17 Aug 2010 00:59:07 +0200 Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="=_XasqPSe85boR3e5vRRHjjN4uBQJSzwaXl7F4Pr395KUA3zcH" In-Reply-To: References: X-Priority: 3 (Normal) X-Mailer: Zarafa 6.30.12-19435 Message-Id: X-Virus-Checked: Checked by ClamAV on apache.org --=_XasqPSe85boR3e5vRRHjjN4uBQJSzwaXl7F4Pr395KUA3zcH Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I've no idea if it's possible but i'd at least try to return an ArrayList= of rows instead of just a single row. And if it doesn't work, which is p= robably the case, how about filing an issue in Jira=3F=0D=0A=0D=0A=A0=0D=0A= =0D=0AReading the docs in the matter, i think it should (made) to be poss= ible to return multiple rows in an ArrayList.=0D=0A=A0=0D=0A-----Original= message-----=0D=0AFrom: Peter Spam =0D=0ASent: Tue 17-08-= 2010 00:47=0D=0ATo: solr-user@lucene.apache.org;=20=0D=0ASubject: Re: Sol= r searching performance issues, using large documents=0D=0A=0D=0AStill st= uck on this - any hints on how to write the JavaScript to split a documen= t=3F =A0Thanks!=0D=0A=0D=0A=0D=0A-Pete=0D=0A=0D=0AOn Aug 5, 2010, at 8:10= PM, Lance Norskog wrote:=0D=0A=0D=0A> You may have to write your own jav= ascript to read in the giant field=0D=0A> and split it up.=0D=0A>=20=0D=0A= > On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam wrote:=0D=0A= >> I've read through the DataImportHandler page a few times, and still ca= n't figure out how to separate a large document into smaller documents. =A0= Any hints=3F :-) =A0Thanks!=0D=0A>>=20=0D=0A>> -Peter=0D=0A>>=20=0D=0A>> = On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:=0D=0A>>=20=0D=0A>>> Span= ning won't work- you would have to make overlapping mini-documents=0D=0A>= >> if you want to support this.=0D=0A>>>=20=0D=0A>>> I don't know how big= the chunks should be- you'll have to experiment.=0D=0A>>>=20=0D=0A>>> La= nce=0D=0A>>>=20=0D=0A>>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam wrote:=0D=0A>>>> What would happen if the search query phrase= spanned separate document chunks=3F=0D=0A>>>>=20=0D=0A>>>> Also, what wo= uld the optimal size of chunks be=3F=0D=0A>>>>=20=0D=0A>>>> Thanks!=0D=0A= >>>>=20=0D=0A>>>>=20=0D=0A>>>> -Peter=0D=0A>>>>=20=0D=0A>>>> On Aug 1, 20= 10, at 7:21 PM, Lance Norskog wrote:=0D=0A>>>>=20=0D=0A>>>>> Not that I k= now of.=0D=0A>>>>>=20=0D=0A>>>>> The DataImportHandler has the ability to= create multiple documents=0D=0A>>>>> from one input stream. It is possib= le to create a DIH file that reads=0D=0A>>>>> large log files and splits = each one into N documents, with the file=0D=0A>>>>> name as a common fiel= d. The DIH wiki page tells you in general how to=0D=0A>>>>> make a DIH fi= le.=0D=0A>>>>>=20=0D=0A>>>>> http://wiki.apache.org/solr/DataImportHandle= r=0D=0A>>>>>=20=0D=0A>>>>> From this, you should be able to make a DIH fi= le that puts log files=0D=0A>>>>> in as separate documents. As to splitti= ng files up into=0D=0A>>>>> mini-documents, you might have to write a bit= of Javascript to achieve=0D=0A>>>>> this. There is no data structure or = software that implements=0D=0A>>>>> structured documents.=0D=0A>>>>>=20=0D= =0A>>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam wrote= :=0D=0A>>>>>> Thanks for the pointer, Lance! =A0Is there an example of th= is somewhere=3F=0D=0A>>>>>>=20=0D=0A>>>>>>=20=0D=0A>>>>>> -Peter=0D=0A>>>= >>>=20=0D=0A>>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:=0D=0A= >>>>>>=20=0D=0A>>>>>>> Ah! You're not just highlighting, you're snippetiz= ing. This makes it easier.=0D=0A>>>>>>>=20=0D=0A>>>>>>> Highlighting does= not stream- it pulls the entire stored contents into=0D=0A>>>>>>> one st= ring and then pulls out the snippet. =A0If you want this to be=0D=0A>>>>>= >> fast, you have to split up the text into small pieces and only=0D=0A>>= >>>>> snippetize from the most relevant text. So, separate documents with= a=0D=0A>>>>>>> common group id for the document it came from. You might = have to do 2=0D=0A>>>>>>> queries to achieve what you want, but the secon= d query for the same=0D=0A>>>>>>> query will be blindingly fast. Often <1= ms.=0D=0A>>>>>>>=20=0D=0A>>>>>>> Good luck!=0D=0A>>>>>>>=20=0D=0A>>>>>>> = Lance=0D=0A>>>>>>>=20=0D=0A>>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter= Spam wrote:=0D=0A>>>>>>>> However, I do need to search t= he entire document, or else the highlighting will sometimes be blank :-(=0D= =0A>>>>>>>> Thanks!=0D=0A>>>>>>>>=20=0D=0A>>>>>>>> - Peter=0D=0A>>>>>>>>=20= =0D=0A>>>>>>>> ps. sorry for the many responses - I'm rushing around tryi= ng to get this working.=0D=0A>>>>>>>>=20=0D=0A>>>>>>>> On Jul 31, 2010, a= t 1:11 PM, Peter Spam wrote:=0D=0A>>>>>>>>=20=0D=0A>>>>>>>>> Correction -= it went from 17 seconds to 10 seconds - I was changing the hl.regex.maxA= nalyzedChars the first time.=0D=0A>>>>>>>>> Thanks!=0D=0A>>>>>>>>>=20=0D=0A= >>>>>>>>> -Peter=0D=0A>>>>>>>>>=20=0D=0A>>>>>>>>> On Jul 31, 2010, at 1:0= 6 PM, Peter Spam wrote:=0D=0A>>>>>>>>>=20=0D=0A>>>>>>>>>> On Jul 30, 2010= , at 1:16 PM, Peter Karich wrote:=0D=0A>>>>>>>>>>=20=0D=0A>>>>>>>>>>> did= you already try other values for hl.maxAnalyzedChars=3D2147483647=0D=0A>= >>>>>>>>>=20=0D=0A>>>>>>>>>> Yes, I tried dropping it down to 21, but it = didn't have much of an impact (one search I just tried went from 17 secon= ds to 15.8 seconds, and this is an 8-core Mac Pro with 6GB RAM - 4GB for = java).=0D=0A>>>>>>>>>>=20=0D=0A>>>>>>>>>>> =3F Also regular expression hi= ghlighting is more expensive, I think.=0D=0A>>>>>>>>>>> What does the 'fu= zzy' variable mean=3F If you use this to query via=0D=0A>>>>>>>>>>> "~som= eTerm" instead "someTerm"=0D=0A>>>>>>>>>>> then you should try the trunk = of solr which is a lot faster for fuzzy or=0D=0A>>>>>>>>>>> other wildcar= d search.=0D=0A>>>>>>>>>>=20=0D=0A>>>>>>>>>> "fuzzy" could be set to "*" = but isn't right now.=0D=0A>>>>>>>>>>=20=0D=0A>>>>>>>>>> Thanks for the ti= ps, Peter - this has been very frustrating!=0D=0A>>>>>>>>>>=20=0D=0A>>>>>= >>>>>=20=0D=0A>>>>>>>>>> - Peter=0D=0A>>>>>>>>>>=20=0D=0A>>>>>>>>>>> Rega= rds,=0D=0A>>>>>>>>>>> Peter.=0D=0A>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> Data s= et: About 4,000 log files (will eventually grow to millions). =A0Average = log file is 850k. =A0Largest log file (so far) is about 70MB.=0D=0A>>>>>>= >>>>>>=20=0D=0A>>>>>>>>>>>> Problem: When I search for common terms, the = query time goes from under 2-3 seconds to about 60 seconds. =A0TermVector= s etc are enabled. =A0When I disable highlighting, performance improves a= lot, but is still slow for some queries (7 seconds). =A0Thanks in advanc= e for any ideas!=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>= >>> -Peter=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> --= -------------------------------------------------------------------------= ----------------------------------------------------------=0D=0A>>>>>>>>>= >>>=20=0D=0A>>>>>>>>>>>> 4GB RAM server=0D=0A>>>>>>>>>>>> % java -Xms2048= M -Xmx3072M -jar start.jar=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> -------= -------------------------------------------------------------------------= -----------------------------------------------------=0D=0A>>>>>>>>>>>>=20= =0D=0A>>>>>>>>>>>> schema.xml changes:=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>= >>>> =A0=0D=0A>>>>>>= >>>>>> =A0 =A0=0D=0A>>>>>>>>>>>> =A0 =A0 =A0=0D=0A>>>>>>>>>>>> =A0 =A0=0D=0A>>>>>>>>>>>> =A0 =A0=0D=0A>>>>>>>>>>>> =A0 =A0=0D=0A>>>= >>>>>>>>> =A0=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> ...=0D=0A= >>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> =0D=0A>>>>>>>>>>>> =A0=0D=0A>>>>>>>>>>>> =0D=0A>>>>>>>>>>>> =0D=0A>>>>>>>>>>>> =0D=0A>>>>>>>>>>>> =0D=0A>>>>>>>>>>= >> =0D=0A>>>>>>>>>>>> =0D=0A= >>>>>>>>>>>> =0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> = =2E..=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> =0D=0A>>>>>>>>>>>> body=0D=0A>>>>>>>>>>>> =0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> -----------------= -------------------------------------------------------------------------= -------------------------------------------=0D=0A>>>>>>>>>>>>=20=0D=0A>>>= >>>>>>>>> solrconfig.xml changes:=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> = =A02147483647=0D=0A>>>>>>>>>>>> =A0128=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> = -------------------------------------------------------------------------= ------------------------------------------------------------=0D=0A>>>>>>>= >>>>>=20=0D=0A>>>>>>>>>>>> The query:=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>= >>> rowStr =3D "&rows=3D10"=0D=0A>>>>>>>>>>>> facet =3D "&facet=3Dtrue&fa= cet.limit=3D10&facet.field=3Ddevice&facet.field=3Dckey&facet.field=3Dvers= ion"=0D=0A>>>>>>>>>>>> fields =3D "&fl=3Did,score,filename,version,device= ,first2md5,filesize,ckey"=0D=0A>>>>>>>>>>>> termvectors =3D "&tv=3Dtrue&q= t=3Dtvrh&tv.all=3Dtrue"=0D=0A>>>>>>>>>>>> hl =3D "&hl=3Dtrue&hl.fl=3Dbody= &hl.snippets=3D1&hl.fragsize=3D400"=0D=0A>>>>>>>>>>>> regexv =3D "(=3Fm)^= =2E*\n.*\n.*$"=0D=0A>>>>>>>>>>>> hl_regex =3D "&hl.regex.pattern=3D" + CG= I::escape(regexv) + "&hl.regex.slop=3D1&hl.fragmenter=3Dregex&hl.regex.ma= xAnalyzedChars=3D2147483647&hl.maxAnalyzedChars=3D2147483647"=0D=0A>>>>>>= >>>>>> justq =3D '&q=3D' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub= (/\\/, '').gsub(/([:~!<>=3D"])/,'\\\\\1') + fuzzy + minLogSizeStr)=0D=0A>= >>>>>>>>>>>=20=0D=0A>>>>>>>>>>>> thequery =3D '/solr/select=3FtimeAllowed= =3D5000&wt=3Druby' + (p['fq'].empty=3F =3F '' : ('&fq=3D'+p['fq'].to_s) )= + justq + rowStr + facet + fields + termvectors + hl + hl_regex=0D=0A>>>= >>>>>>>>>=20=0D=0A>>>>>>>>>>>> baseurl =3D '/cgi-bin/search.rb=3Fq=3D' + = CGI::escape(p['q'].to_s) + '&rows=3D' + p['rows'].to_s + '&minLogSize=3D'= + p['minLogSize'].to_s=0D=0A>>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>>=20=0D=0A>>= >>>>>>>>>>=20=0D=0A>>>>>>>>>>>=20=0D=0A>>>>>>>>>>>=20=0D=0A>>>>>>>>>>> --= =0D=0A>>>>>>>>>>> http://karussell.wordpress.com/=0D=0A>>>>>>>>>>>=20=0D=0A= >>>>>>>>>>=20=0D=0A>>>>>>>>>=20=0D=0A>>>>>>>>=20=0D=0A>>>>>>>>=20=0D=0A>>= >>>>>=20=0D=0A>>>>>>>=20=0D=0A>>>>>>>=20=0D=0A>>>>>>> --=0D=0A>>>>>>> Lan= ce Norskog=0D=0A>>>>>>> goksron@gmail.com=0D=0A>>>>>>=20=0D=0A>>>>>>=20=0D= =0A>>>>>=20=0D=0A>>>>>=20=0D=0A>>>>>=20=0D=0A>>>>> --=0D=0A>>>>> Lance No= rskog=0D=0A>>>>> goksron@gmail.com=0D=0A>>>>=20=0D=0A>>>>=20=0D=0A>>>=20=0D= =0A>>>=20=0D=0A>>>=20=0D=0A>>> --=0D=0A>>> Lance Norskog=0D=0A>>> goksron= @gmail.com=0D=0A>>=20=0D=0A>>=20=0D=0A>=20=0D=0A>=20=0D=0A>=20=0D=0A> --=20= =0D=0A> Lance Norskog=0D=0A> goksron@gmail.com=0D=0A=0D=0A --=_XasqPSe85boR3e5vRRHjjN4uBQJSzwaXl7F4Pr395KUA3zcH--