Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2ECB019618 for ; Tue, 26 Apr 2016 14:00:31 +0000 (UTC) Received: (qmail 19375 invoked by uid 500); 26 Apr 2016 14:00:29 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 19318 invoked by uid 500); 26 Apr 2016 14:00:29 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 19278 invoked by uid 99); 26 Apr 2016 14:00:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Apr 2016 14:00:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 101F7C0715 for ; Tue, 26 Apr 2016 14:00:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.021 X-Spam-Level: X-Spam-Status: No, score=-0.021 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id RpvGJcOe3htN for ; Tue, 26 Apr 2016 14:00:25 +0000 (UTC) Received: from mail-lf0-f52.google.com (mail-lf0-f52.google.com [209.85.215.52]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 096E75F489 for ; Tue, 26 Apr 2016 14:00:24 +0000 (UTC) Received: by mail-lf0-f52.google.com with SMTP id j11so19189147lfb.1 for ; Tue, 26 Apr 2016 07:00:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=2bvKp3+ghwy+y0T5ePdJOsZK5dHid+tcieboKQqnS5o=; b=ff1GQOZqGyGKDhV8Ql1m1YfbGTg0Jnb4SX6Ma1RiSRiwHFM9tVHvpSNO9BpET2PHGF burFxon9uuzHU/vA1lxF31Rhwnr5zMkp9XIuKG3M1Bsg/jM2qNNXu6CZg/Af9SSbz5eT GantN9XpUbuAA02/FYpxS+0k1j0VZ6/KCikGQxvZHPXf27prJZSqQ49UBZe/2uZtamtS OMjVeiR2vJT1bBTgRg4e5a+OiZHDeY+oERmv33736IiRWXAxxt2E6YyN1BYfMi9BuRwQ eKkNKS3JY2+Q3vdEws8HkgGBmEluUFKetdZrqw5l6dWWe63mZ5bMC7ct2PXvA1PEGMRG 48Eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=2bvKp3+ghwy+y0T5ePdJOsZK5dHid+tcieboKQqnS5o=; b=VtSMHNQ84a97oExTicn9Uv0kvNoZ65UEaQd1r9KKQT0hcBxLwY3iBzzcftqhX12niK wlh8TgrroGsmYl1v0e7bucFNHr7L3XOSACbwP7zQQyxLfO2QpcwNl+teCtPu8G2tp8SB Pn3g0LArL3SBygV+WlxtpK5PUVtBGC0LY9FrNPQrKTKfqNIkkid68Fw8t/dvCNUfeaBF IfMFFMS7UhxUai2wedUhZrBkNJQlxw5WIOEKUQVvtCLuPhMHSNJzjofzwLQjNTzbDgaX 3/DlaLvj27ZNDKUprhK8YCNVraLqdFK1BwuD67DlLQZk9iK85eCQbkuVJ3B+DdL5Da0M 25JQ== X-Gm-Message-State: AOPr4FXyctStYI5/h6cx87rBEeXv5CDgmf3qraphZvSPmLV1dCS43a2uHDZeykTHIfVtsg== X-Received: by 10.112.17.98 with SMTP id n2mr1347953lbd.47.1461679223403; Tue, 26 Apr 2016 07:00:23 -0700 (PDT) Received: from ?IPv6:2a02:2168:f41:f100:dc97:aaaf:fb6e:c2a5? ([2a02:2168:f41:f100:dc97:aaaf:fb6e:c2a5]) by smtp.gmail.com with ESMTPSA id xg1sm5352941lbb.10.2016.04.26.07.00.22 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 26 Apr 2016 07:00:22 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: 500 millions document for loop. From: Valentin Popov In-Reply-To: <004f01d11edb$9f47f7e0$ddd7e7a0$@thetaphi.de> Date: Tue, 26 Apr 2016 17:00:21 +0300 Cc: Uwe Schindler Content-Transfer-Encoding: quoted-printable Message-Id: References: <9F9F27BA-8912-423D-8ECC-B878713C606B@gmail.com> <1447350178757.58991@statsbiblioteket.dk> <0CE3CC53-9FB4-498C-8B5E-3CD4632CDF62@gmail.com> <004a01d11d72$a89fb090$f9df11b0$@thetaphi.de> <004d01d11d76$107dee90$3179cbb0$@thetaphi.de> <4BAB1A2E-E22B-4722-9601-353C6D278259@gmail.com> <004e01d11eda$d8524870$88f6d950$@thetaphi.de> <004f01d11edb$9f47f7e0$ddd7e7a0$@thetaphi.de> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.3124) Uwe, hello.=20 Is it possible to use same fast iterator, but apply sorting for date?=20 Regards, Valentin.=20 > On 14 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 15:54, Uwe Schindler = wrote: >=20 > For performance reasons, I would also return "false" for "out of = order" documents. This allows to access stored fields in a more = effective way (otherwise it seeks too much). For this type of collector = the IO cost is higher than the small computing performance increase = caused by out of order documents. >=20 > Kind regards, > Uwe >=20 > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: uwe@thetaphi.de >=20 >> -----Original Message----- >> From: Valentin Popov [mailto:valentin.po@gmail.com] >> Sent: Saturday, November 14, 2015 1:51 PM >> To: java-user@lucene.apache.org >> Subject: Re: 500 millions document for loop. >>=20 >> Thank you very much! >>=20 >>=20 >>> On 14 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 15:49, Uwe = Schindler wrote: >>>=20 >>> Hi, >>>=20 >>> This code is buggy! The collect() call of the collector does not get = a >> document ID relative to the top-level IndexSearcher, it only gets a = document >> id relative to the reader reported in setNextReader (which is a = atomic reader >> responsible for a single Lucene index segment). >>>=20 >>> In setNextReader, save the reference to the "current" reader. And = use this >> "current" reader to get the stored fields: >>>=20 >>> indexSearcher.search(query, queryFilter, new Collector() = { >>> AtomicReader current; >>>=20 >>> @Override >>> public void setScorer(Scorer arg0) throws = IOException >> { } >>>=20 >>> @Override >>> public void setNextReader(AtomicReaderContext = ctx) >> throws IOException { >>> current =3D ctx.reader(); >>> } >>>=20 >>> @Override >>> public void collect(int docID) throws = IOException { >>> Document doc =3D current.document(docID, >> loadFields); >>> found.found(doc); >>> } >>>=20 >>> @Override >>> public boolean acceptsDocsOutOfOrder() { >>> return true; >>> } >>> }); >>>=20 >>> Otherwise you get wrong document ids reported!!! >>>=20 >>> Uwe >>>=20 >>> ----- >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: uwe@thetaphi.de >>>=20 >>>> -----Original Message----- >>>> From: Valentin Popov [mailto:valentin.po@gmail.com] >>>> Sent: Saturday, November 14, 2015 1:04 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Re: 500 millions document for loop. >>>>=20 >>>> Hi, Uwe. >>>>=20 >>>> Thanks for you advise. >>>>=20 >>>> After implementing you suggestion, our calculation time drop down = from >> ~20 >>>> days to 3,5 hours. >>>>=20 >>>> /** >>>> * >>>> * DocumentFound - callback function for each document >>>> */ >>>> public void iterate(SearchOptions options, final DocumentFound = found, >> final >>>> Set loadFields) throws Exception { >>>> Query query =3D options.getQuery(); >>>> Filter queryFilter =3D options.getQueryFilter(); >>>> final IndexSearcher indexSearcher =3D new >>>>=20 >> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx >>>> ecutor()); >>>>=20 >>>> indexSearcher.search(query, queryFilter, new Collector() = { >>>>=20 >>>> @Override >>>> public void setScorer(Scorer arg0) throws = IOException >>>> { } >>>>=20 >>>> @Override >>>> public void setNextReader(AtomicReaderContext >>>> arg0) throws IOException { } >>>>=20 >>>> @Override >>>> public void collect(int docID) throws = IOException { >>>> Document doc =3D = indexSearcher.doc(docID, >>>> loadFields); >>>> found.found(doc); >>>> } >>>>=20 >>>> @Override >>>> public boolean acceptsDocsOutOfOrder() { >>>> return true; >>>> } >>>> }); >>>>=20 >>>> } >>>>=20 >>>>=20 >>>>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 21:15, Uwe = Schindler wrote: >>>>>=20 >>>>> Hi, >>>>>=20 >>>>>>> The big question is: Do you need the results paged at all? >>>>>>=20 >>>>>> Yup, because if we return all results, we get OME. >>>>>=20 >>>>> You get the OME because the paging collector cannot handle that, = so this >> is >>>> an XY problem. Would it not be better if you application just gets = the >> results >>>> as a stream and processes them one after each other? If this is the = case >> (and >>>> most statistics need it like that), your much better to NOT USE >> TOPDOCS!!!! >>>> Your requirement is diametral to getting top-scoring documents! You >> want to >>>> get ALL results as a sequence. >>>>>=20 >>>>>>> Do you need them sorted? >>>>>>=20 >>>>>> Nope. >>>>>=20 >>>>> OK, so unsorted streaming is the right approach. >>>>>=20 >>>>>>> If not, the easiest approach is to use a custom Collector that = does no >>>>>> sorting and just consumes the results. >>>>>>=20 >>>>>> Main bottleneck as I see come from next page search, that took = ~2-4 >>>>>> seconds. >>>>>=20 >>>>> This is because when paging the collector has to re-execute the = whole >>>> query and sort all results again, just with a larger window. So if = you have >>>> result pages of 50000 results and you want to get the second page, = it will >>>> internally sort 100000 results, because the first page needs to be >> calculated, >>>> too. If you go forward in results the windows gets larger and = larger, until it >>>> finally collects all results. >>>>>=20 >>>>> So just get the results as a stream by implementing the Collector = API is >> the >>>> right way to do this. >>>>>=20 >>>>>>>=20 >>>>>>> Uwe >>>>>>>=20 >>>>>>> ----- >>>>>>> Uwe Schindler >>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>>>> http://www.thetaphi.de >>>>>>> eMail: uwe@thetaphi.de >>>>>>>=20 >>>>>>>> -----Original Message----- >>>>>>>> From: Valentin Popov [mailto:valentin.po@gmail.com] >>>>>>>> Sent: Thursday, November 12, 2015 6:48 PM >>>>>>>> To: java-user@lucene.apache.org >>>>>>>> Subject: Re: 500 millions document for loop. >>>>>>>>=20 >>>>>>>> Toke, thanks! >>>>>>>>=20 >>>>>>>> We will look at this solution, looks like this is that what we = need. >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 20:42, Toke = Eskildsen >> >>>>>>>> wrote: >>>>>>>>>=20 >>>>>>>>> Valentin Popov wrote: >>>>>>>>>=20 >>>>>>>>>> We have ~10 indexes for 500M documents, each document >>>>>>>>>> has =C2=ABarchive date=C2=BB, and =C2=ABto=C2=BB address, one = of our task is >>>>>>>>>> calculate statistics of =C2=ABto=C2=BB for last year. Right = now we are >>>>>>>>>> using search archive_date:(current_date - 1 year) and = paginate >>>>>>>>>> results for 50k records for page. Bottleneck of that = approach, >>>>>>>>>> pagination take too long time and on powerful server it take >>>>>>>>>> ~20 days to execute, and it is very long. >>>>>>>>>=20 >>>>>>>>> Lucene does not like deep page requests due to the way the >> internal >>>>>>>> Priority Queue works. Solr has CursorMark, which should be = fairly >>>> simple >>>>>> to >>>>>>>> emulate in your Lucene handling code: >>>>>>>>>=20 >>>>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr- >>>> efficient- >>>>>>>> cursor-based-iteration-of-large-result-sets/ >>>>>>>>>=20 >>>>>>>>> - Toke Eskildsen >>>>>>>>>=20 >>>>>>>>> = --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: = java-user-unsubscribe@lucene.apache.org >>>>>>>>> For additional commands, e-mail: java-user- >> help@lucene.apache.org >>>>>>>>>=20 >>>>>>>>=20 >>>>>>>> Regards, >>>>>>>> Valentin Popov >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> = --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>>>> For additional commands, e-mail: java-user- >> help@lucene.apache.org >>>>>>>=20 >>>>>>>=20 >>>>>>> = --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>>> For additional commands, e-mail: = java-user-help@lucene.apache.org >>>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, >>>>>> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =D0=9F=D0=BE=D0=BF= =D0=BE=D0=B2 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>> = --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>=20 >>>>>=20 >>>>> = --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>=20 >>>>=20 >>>> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, >>>> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =D0=9F=D0=BE=D0=BF=D0= =BE=D0=B2 >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>> = --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>=20 >>>=20 >>> = --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>=20 >>=20 >>=20 >> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, >> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =D0=9F=D0=BE=D0=BF=D0=BE= =D0=B2 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 Regards, Valentin Popov --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org