Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6D8D918D6C for ; Sat, 14 Nov 2015 12:54:54 +0000 (UTC) Received: (qmail 36718 invoked by uid 500); 14 Nov 2015 12:54:52 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 36652 invoked by uid 500); 14 Nov 2015 12:54:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 36630 invoked by uid 99); 14 Nov 2015 12:54:52 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Nov 2015 12:54:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E818D1A2485 for ; Sat, 14 Nov 2015 12:54:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.8 X-Spam-Level: X-Spam-Status: No, score=0.8 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id BihBkRH9JTAR for ; Sat, 14 Nov 2015 12:54:40 +0000 (UTC) Received: from mail.sd-datasolutions.de (serv2.sd-datasolutions.de [85.25.204.22]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTP id D008F20CF5 for ; Sat, 14 Nov 2015 12:54:39 +0000 (UTC) Received: from VEGA (unknown [IPv6:2001:1a80:2b04:a01:8e70:5aff:fed1:75a4]) by mail.sd-datasolutions.de (Postfix) with ESMTPSA id 0031516F802C9 for ; Sat, 14 Nov 2015 12:54:38 +0000 (UTC) X-NSA-Greeting: Dear NSA, have fun with reading and analyzing this e-mail! From: "Uwe Schindler" To: References: <9F9F27BA-8912-423D-8ECC-B878713C606B@gmail.com> <1447350178757.58991@statsbiblioteket.dk> <0CE3CC53-9FB4-498C-8B5E-3CD4632CDF62@gmail.com> <004a01d11d72$a89fb090$f9df11b0$@thetaphi.de> <004d01d11d76$107dee90$3179cbb0$@thetaphi.de> <4BAB1A2E-E22B-4722-9601-353C6D278259@gmail.com> <004e01d11eda$d8524870$88f6d950$@thetaphi.de> In-Reply-To: Subject: RE: 500 millions document for loop. Date: Sat, 14 Nov 2015 13:54:38 +0100 Message-ID: <004f01d11edb$9f47f7e0$ddd7e7a0$@thetaphi.de> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQHn4lz3/4Haaex6Cgco9Udn7KAQdwJcHmHYAKjay1ABL1MpsAIuqzQjAe+7uH8B6Zd0VAGuo7UWArisdqqd+NUTwA== Content-Language: de For performance reasons, I would also return "false" for "out of order" = documents. This allows to access stored fields in a more effective way = (otherwise it seeks too much). For this type of collector the IO cost is = higher than the small computing performance increase caused by out of = order documents. Kind regards, Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: Valentin Popov [mailto:valentin.po@gmail.com] > Sent: Saturday, November 14, 2015 1:51 PM > To: java-user@lucene.apache.org > Subject: Re: 500 millions document for loop. >=20 > Thank you very much! >=20 >=20 > > On 14 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 15:49, Uwe = Schindler wrote: > > > > Hi, > > > > This code is buggy! The collect() call of the collector does not get = a > document ID relative to the top-level IndexSearcher, it only gets a = document > id relative to the reader reported in setNextReader (which is a atomic = reader > responsible for a single Lucene index segment). > > > > In setNextReader, save the reference to the "current" reader. And = use this > "current" reader to get the stored fields: > > > > indexSearcher.search(query, queryFilter, new Collector() { > > AtomicReader current; > > > > @Override > > public void setScorer(Scorer arg0) throws IOException > { } > > > > @Override > > public void setNextReader(AtomicReaderContext ctx) > throws IOException { > > current =3D ctx.reader(); > > } > > > > @Override > > public void collect(int docID) throws IOException { > > Document doc =3D current.document(docID, > loadFields); > > found.found(doc); > > } > > > > @Override > > public boolean acceptsDocsOutOfOrder() { > > return true; > > } > > }); > > > > Otherwise you get wrong document ids reported!!! > > > > Uwe > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: uwe@thetaphi.de > > > >> -----Original Message----- > >> From: Valentin Popov [mailto:valentin.po@gmail.com] > >> Sent: Saturday, November 14, 2015 1:04 PM > >> To: java-user@lucene.apache.org > >> Subject: Re: 500 millions document for loop. > >> > >> Hi, Uwe. > >> > >> Thanks for you advise. > >> > >> After implementing you suggestion, our calculation time drop down = from > ~20 > >> days to 3,5 hours. > >> > >> /** > >> * > >> * DocumentFound - callback function for each document > >> */ > >> public void iterate(SearchOptions options, final DocumentFound = found, > final > >> Set loadFields) throws Exception { > >> Query query =3D options.getQuery(); > >> Filter queryFilter =3D options.getQueryFilter(); > >> final IndexSearcher indexSearcher =3D new > >> > VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx > >> ecutor()); > >> > >> indexSearcher.search(query, queryFilter, new Collector() { > >> > >> @Override > >> public void setScorer(Scorer arg0) throws IOException > >> { } > >> > >> @Override > >> public void setNextReader(AtomicReaderContext > >> arg0) throws IOException { } > >> > >> @Override > >> public void collect(int docID) throws IOException { > >> Document doc =3D indexSearcher.doc(docID, > >> loadFields); > >> found.found(doc); > >> } > >> > >> @Override > >> public boolean acceptsDocsOutOfOrder() { > >> return true; > >> } > >> }); > >> > >> } > >> > >> > >>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 21:15, Uwe = Schindler wrote: > >>> > >>> Hi, > >>> > >>>>> The big question is: Do you need the results paged at all? > >>>> > >>>> Yup, because if we return all results, we get OME. > >>> > >>> You get the OME because the paging collector cannot handle that, = so this > is > >> an XY problem. Would it not be better if you application just gets = the > results > >> as a stream and processes them one after each other? If this is the = case > (and > >> most statistics need it like that), your much better to NOT USE > TOPDOCS!!!! > >> Your requirement is diametral to getting top-scoring documents! You > want to > >> get ALL results as a sequence. > >>> > >>>>> Do you need them sorted? > >>>> > >>>> Nope. > >>> > >>> OK, so unsorted streaming is the right approach. > >>> > >>>>> If not, the easiest approach is to use a custom Collector that = does no > >>>> sorting and just consumes the results. > >>>> > >>>> Main bottleneck as I see come from next page search, that took = ~2-4 > >>>> seconds. > >>> > >>> This is because when paging the collector has to re-execute the = whole > >> query and sort all results again, just with a larger window. So if = you have > >> result pages of 50000 results and you want to get the second page, = it will > >> internally sort 100000 results, because the first page needs to be > calculated, > >> too. If you go forward in results the windows gets larger and = larger, until it > >> finally collects all results. > >>> > >>> So just get the results as a stream by implementing the Collector = API is > the > >> right way to do this. > >>> > >>>>> > >>>>> Uwe > >>>>> > >>>>> ----- > >>>>> Uwe Schindler > >>>>> H.-H.-Meier-Allee 63, D-28213 Bremen > >>>>> http://www.thetaphi.de > >>>>> eMail: uwe@thetaphi.de > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Valentin Popov [mailto:valentin.po@gmail.com] > >>>>>> Sent: Thursday, November 12, 2015 6:48 PM > >>>>>> To: java-user@lucene.apache.org > >>>>>> Subject: Re: 500 millions document for loop. > >>>>>> > >>>>>> Toke, thanks! > >>>>>> > >>>>>> We will look at this solution, looks like this is that what we = need. > >>>>>> > >>>>>> > >>>>>>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 20:42, Toke = Eskildsen > > >>>>>> wrote: > >>>>>>> > >>>>>>> Valentin Popov wrote: > >>>>>>> > >>>>>>>> We have ~10 indexes for 500M documents, each document > >>>>>>>> has =C2=ABarchive date=C2=BB, and =C2=ABto=C2=BB address, one = of our task is > >>>>>>>> calculate statistics of =C2=ABto=C2=BB for last year. Right = now we are > >>>>>>>> using search archive_date:(current_date - 1 year) and = paginate > >>>>>>>> results for 50k records for page. Bottleneck of that = approach, > >>>>>>>> pagination take too long time and on powerful server it take > >>>>>>>> ~20 days to execute, and it is very long. > >>>>>>> > >>>>>>> Lucene does not like deep page requests due to the way the > internal > >>>>>> Priority Queue works. Solr has CursorMark, which should be = fairly > >> simple > >>>> to > >>>>>> emulate in your Lucene handling code: > >>>>>>> > >>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr- > >> efficient- > >>>>>> cursor-based-iteration-of-large-result-sets/ > >>>>>>> > >>>>>>> - Toke Eskildsen > >>>>>>> > >>>>>>> = --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: = java-user-unsubscribe@lucene.apache.org > >>>>>>> For additional commands, e-mail: java-user- > help@lucene.apache.org > >>>>>>> > >>>>>> > >>>>>> Regards, > >>>>>> Valentin Popov > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> = --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >>>>>> For additional commands, e-mail: java-user- > help@lucene.apache.org > >>>>> > >>>>> > >>>>> = --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >>>>> For additional commands, e-mail: = java-user-help@lucene.apache.org > >>>>> > >>>> > >>>> > >>>> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, > >>>> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD = =D0=9F=D0=BE=D0=BF=D0=BE=D0=B2 > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> = --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org > >>> > >>> > >>> = --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >>> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > >> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, > >> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD = =D0=9F=D0=BE=D0=BF=D0=BE=D0=B2 > >> > >> > >> > >> > >> > >> > >> = --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > = --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > >=20 >=20 > =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, > =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD = =D0=9F=D0=BE=D0=BF=D0=BE=D0=B2 >=20 >=20 >=20 >=20 >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org