Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7CA2718D47 for ; Sat, 14 Nov 2015 12:49:20 +0000 (UTC) Received: (qmail 28241 invoked by uid 500); 14 Nov 2015 12:49:19 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 28182 invoked by uid 500); 14 Nov 2015 12:49:19 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 28171 invoked by uid 99); 14 Nov 2015 12:49:18 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Nov 2015 12:49:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 85526C154B for ; Sat, 14 Nov 2015 12:49:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.799 X-Spam-Level: X-Spam-Status: No, score=0.799 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id v8AmsLElRwJ2 for ; Sat, 14 Nov 2015 12:49:07 +0000 (UTC) Received: from mail.sd-datasolutions.de (serv2.sd-datasolutions.de [85.25.204.22]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTP id 99EAB203C1 for ; Sat, 14 Nov 2015 12:49:06 +0000 (UTC) Received: from VEGA (unknown [IPv6:2001:1a80:2b04:a01:8e70:5aff:fed1:75a4]) by mail.sd-datasolutions.de (Postfix) with ESMTPSA id 3140F16F802C9 for ; Sat, 14 Nov 2015 12:49:05 +0000 (UTC) X-NSA-Greeting: Dear NSA, have fun with reading and analyzing this e-mail! From: "Uwe Schindler" To: References: <9F9F27BA-8912-423D-8ECC-B878713C606B@gmail.com> <1447350178757.58991@statsbiblioteket.dk> <0CE3CC53-9FB4-498C-8B5E-3CD4632CDF62@gmail.com> <004a01d11d72$a89fb090$f9df11b0$@thetaphi.de> <004d01d11d76$107dee90$3179cbb0$@thetaphi.de> <4BAB1A2E-E22B-4722-9601-353C6D278259@gmail.com> In-Reply-To: <4BAB1A2E-E22B-4722-9601-353C6D278259@gmail.com> Subject: RE: 500 millions document for loop. Date: Sat, 14 Nov 2015 13:49:04 +0100 Message-ID: <004e01d11eda$d8524870$88f6d950$@thetaphi.de> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQHn4lz3/4Haaex6Cgco9Udn7KAQdwJcHmHYAKjay1ABL1MpsAIuqzQjAe+7uH8B6Zd0VJ4cDQ2g Content-Language: de Hi, This code is buggy! The collect() call of the collector does not get a = document ID relative to the top-level IndexSearcher, it only gets a = document id relative to the reader reported in setNextReader (which is a = atomic reader responsible for a single Lucene index segment). In setNextReader, save the reference to the "current" reader. And use = this "current" reader to get the stored fields: indexSearcher.search(query, queryFilter, new Collector() { AtomicReader current;=20 @Override public void setScorer(Scorer arg0) throws IOException { } =20 @Override public void setNextReader(AtomicReaderContext ctx) throws = IOException {=20 current =3D ctx.reader(); } =20 @Override public void collect(int docID) throws IOException { Document doc =3D current.document(docID, loadFields); found.found(doc); } =20 @Override public boolean acceptsDocsOutOfOrder() { return true; } }); Otherwise you get wrong document ids reported!!! Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: Valentin Popov [mailto:valentin.po@gmail.com] > Sent: Saturday, November 14, 2015 1:04 PM > To: java-user@lucene.apache.org > Subject: Re: 500 millions document for loop. >=20 > Hi, Uwe. >=20 > Thanks for you advise. >=20 > After implementing you suggestion, our calculation time drop down from = ~20 > days to 3,5 hours. >=20 > /** > * > * DocumentFound - callback function for each document > */ > public void iterate(SearchOptions options, final DocumentFound found, = final > Set loadFields) throws Exception { > Query query =3D options.getQuery(); > Filter queryFilter =3D options.getQueryFilter(); > final IndexSearcher indexSearcher =3D new > VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx > ecutor()); >=20 > indexSearcher.search(query, queryFilter, new Collector() { >=20 > @Override > public void setScorer(Scorer arg0) throws IOException > { } >=20 > @Override > public void setNextReader(AtomicReaderContext > arg0) throws IOException { } >=20 > @Override > public void collect(int docID) throws IOException { > Document doc =3D indexSearcher.doc(docID, > loadFields); > found.found(doc); > } >=20 > @Override > public boolean acceptsDocsOutOfOrder() { > return true; > } > }); >=20 > } >=20 >=20 > > On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 21:15, Uwe = Schindler wrote: > > > > Hi, > > > >>> The big question is: Do you need the results paged at all? > >> > >> Yup, because if we return all results, we get OME. > > > > You get the OME because the paging collector cannot handle that, so = this is > an XY problem. Would it not be better if you application just gets the = results > as a stream and processes them one after each other? If this is the = case (and > most statistics need it like that), your much better to NOT USE = TOPDOCS!!!! > Your requirement is diametral to getting top-scoring documents! You = want to > get ALL results as a sequence. > > > >>> Do you need them sorted? > >> > >> Nope. > > > > OK, so unsorted streaming is the right approach. > > > >>> If not, the easiest approach is to use a custom Collector that = does no > >> sorting and just consumes the results. > >> > >> Main bottleneck as I see come from next page search, that took ~2-4 > >> seconds. > > > > This is because when paging the collector has to re-execute the = whole > query and sort all results again, just with a larger window. So if you = have > result pages of 50000 results and you want to get the second page, it = will > internally sort 100000 results, because the first page needs to be = calculated, > too. If you go forward in results the windows gets larger and larger, = until it > finally collects all results. > > > > So just get the results as a stream by implementing the Collector = API is the > right way to do this. > > > >>> > >>> Uwe > >>> > >>> ----- > >>> Uwe Schindler > >>> H.-H.-Meier-Allee 63, D-28213 Bremen > >>> http://www.thetaphi.de > >>> eMail: uwe@thetaphi.de > >>> > >>>> -----Original Message----- > >>>> From: Valentin Popov [mailto:valentin.po@gmail.com] > >>>> Sent: Thursday, November 12, 2015 6:48 PM > >>>> To: java-user@lucene.apache.org > >>>> Subject: Re: 500 millions document for loop. > >>>> > >>>> Toke, thanks! > >>>> > >>>> We will look at this solution, looks like this is that what we = need. > >>>> > >>>> > >>>>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 20:42, Toke = Eskildsen > >>>> wrote: > >>>>> > >>>>> Valentin Popov wrote: > >>>>> > >>>>>> We have ~10 indexes for 500M documents, each document > >>>>>> has =C2=ABarchive date=C2=BB, and =C2=ABto=C2=BB address, one = of our task is > >>>>>> calculate statistics of =C2=ABto=C2=BB for last year. Right now = we are > >>>>>> using search archive_date:(current_date - 1 year) and paginate > >>>>>> results for 50k records for page. Bottleneck of that approach, > >>>>>> pagination take too long time and on powerful server it take > >>>>>> ~20 days to execute, and it is very long. > >>>>> > >>>>> Lucene does not like deep page requests due to the way the = internal > >>>> Priority Queue works. Solr has CursorMark, which should be fairly > simple > >> to > >>>> emulate in your Lucene handling code: > >>>>> > >>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr- > efficient- > >>>> cursor-based-iteration-of-large-result-sets/ > >>>>> > >>>>> - Toke Eskildsen > >>>>> > >>>>> = --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >>>>> For additional commands, e-mail: = java-user-help@lucene.apache.org > >>>>> > >>>> > >>>> Regards, > >>>> Valentin Popov > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> = --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org > >>> > >>> > >>> = --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >>> For additional commands, e-mail: java-user-help@lucene.apache.org > >>> > >> > >> > >> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, > >> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD = =D0=9F=D0=BE=D0=BF=D0=BE=D0=B2 > >> > >> > >> > >> > >> > >> > >> = --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > = --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 >=20 > =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, > =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD = =D0=9F=D0=BE=D0=BF=D0=BE=D0=B2 >=20 >=20 >=20 >=20 >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org