Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C5CB218930 for ; Thu, 12 Nov 2015 18:23:58 +0000 (UTC) Received: (qmail 70378 invoked by uid 500); 12 Nov 2015 18:23:57 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 70323 invoked by uid 500); 12 Nov 2015 18:23:57 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 70300 invoked by uid 99); 12 Nov 2015 18:23:57 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Nov 2015 18:23:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id CFD31C467E for ; Thu, 12 Nov 2015 18:23:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.7 X-Spam-Level: X-Spam-Status: No, score=0.7 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KAM_ASCII_DIVIDERS=0.8, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id KUEuNGrbCgBo for ; Thu, 12 Nov 2015 18:23:44 +0000 (UTC) Received: from mail-lb0-f175.google.com (mail-lb0-f175.google.com [209.85.217.175]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 04A0F20CB8 for ; Thu, 12 Nov 2015 18:23:44 +0000 (UTC) Received: by lbbsy6 with SMTP id sy6so11666958lbb.2 for ; Thu, 12 Nov 2015 10:23:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=8SRgu0jwO8Yu0mxQwUR+/tmn83Xmm8BTH5bojVqMFac=; b=w9gZeTxSx655aqbZrmArvC2qe/4IoUFXtPLYHCXf3TAsi+CqIbAenYPkzH0Iv9uhFz fAaWT9wQemBHAkna08ntuVMIb1huHu1fc0Yf2MhLP3Fifcy/KSbXRmotu6mhKX/VKK0b UL9mbgIyVyM3Kk+qeTg/YNTjPLbUtf8huyFmwDMQqYBp3WSX9RjWh1hk90DXbTIy0+qZ gGGWOAHR/x2OLQ+2sNH5f2w4erJS5o9KxX+IjTxcg9o/BMP++rxCFArjREn8sw+0YRXO VUavlwxw0HP0HQVeketvGpp7yhpsWdavZG9mrIED//mK5m4CB+He9CXCsTFNagufa7lK XrYQ== X-Received: by 10.112.162.162 with SMTP id yb2mr8158942lbb.94.1447352623237; Thu, 12 Nov 2015 10:23:43 -0800 (PST) Received: from [192.168.0.4] (broadband-109-173-30-101.nationalcablenetworks.ru. [109.173.30.101]) by smtp.gmail.com with ESMTPSA id d187sm2489766lfd.26.2015.11.12.10.23.42 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 12 Nov 2015 10:23:42 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.1 \(3096.5\)) Subject: Re: 500 millions document for loop. From: Valentin Popov In-Reply-To: <004d01d11d76$107dee90$3179cbb0$@thetaphi.de> Date: Thu, 12 Nov 2015 21:23:41 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <3C3E62E6-6161-4ECC-994B-CCA73B5794D3@gmail.com> References: <9F9F27BA-8912-423D-8ECC-B878713C606B@gmail.com> <1447350178757.58991@statsbiblioteket.dk> <0CE3CC53-9FB4-498C-8B5E-3CD4632CDF62@gmail.com> <004a01d11d72$a89fb090$f9df11b0$@thetaphi.de> <004d01d11d76$107dee90$3179cbb0$@thetaphi.de> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.3096.5) Hi,=20 > On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 21:15, Uwe Schindler = wrote: >=20 > Hi, >=20 >>> The big question is: Do you need the results paged at all? >>=20 >> Yup, because if we return all results, we get OME. >=20 > You get the OME because the paging collector cannot handle that, so = this is an XY problem. Would it not be better if you application just = gets the results as a stream and processes them one after each other? If = this is the case (and most statistics need it like that), your much = better to NOT USE TOPDOCS!!!! Your requirement is diametral to getting = top-scoring documents! You want to get ALL results as a sequence. Use, thanks=20 if it is possible, could you provide some code example?=20 >=20 >>> Do you need them sorted? >>=20 >> Nope. >=20 > OK, so unsorted streaming is the right approach. >=20 >>> If not, the easiest approach is to use a custom Collector that does = no >> sorting and just consumes the results. >>=20 >> Main bottleneck as I see come from next page search, that took ~2-4 >> seconds. >=20 > This is because when paging the collector has to re-execute the whole = query and sort all results again, just with a larger window. So if you = have result pages of 50000 results and you want to get the second page, = it will internally sort 100000 results, because the first page needs to = be calculated, too. If you go forward in results the windows gets larger = and larger, until it finally collects all results. Is this mean we are not using cursor based iteration?=20 >=20 > So just get the results as a stream by implementing the Collector API = is the right way to do this. thanks!=20 >=20 >>>=20 >>> Uwe >>>=20 >>> ----- >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: uwe@thetaphi.de >>>=20 >>>> -----Original Message----- >>>> From: Valentin Popov [mailto:valentin.po@gmail.com] >>>> Sent: Thursday, November 12, 2015 6:48 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Re: 500 millions document for loop. >>>>=20 >>>> Toke, thanks! >>>>=20 >>>> We will look at this solution, looks like this is that what we = need. >>>>=20 >>>>=20 >>>>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 20:42, Toke = Eskildsen >>>> wrote: >>>>>=20 >>>>> Valentin Popov wrote: >>>>>=20 >>>>>> We have ~10 indexes for 500M documents, each document >>>>>> has =C2=ABarchive date=C2=BB, and =C2=ABto=C2=BB address, one of = our task is >>>>>> calculate statistics of =C2=ABto=C2=BB for last year. Right now = we are >>>>>> using search archive_date:(current_date - 1 year) and paginate >>>>>> results for 50k records for page. Bottleneck of that approach, >>>>>> pagination take too long time and on powerful server it take >>>>>> ~20 days to execute, and it is very long. >>>>>=20 >>>>> Lucene does not like deep page requests due to the way the = internal >>>> Priority Queue works. Solr has CursorMark, which should be fairly = simple >> to >>>> emulate in your Lucene handling code: >>>>>=20 >>>>> = http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient- >>>> cursor-based-iteration-of-large-result-sets/ >>>>>=20 >>>>> - Toke Eskildsen >>>>>=20 >>>>> = --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>=20 >>>>=20 >>>> Regards, >>>> Valentin Popov >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>> = --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>=20 >>>=20 >>> = --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>=20 >>=20 >>=20 >> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC, >> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =D0=9F=D0=BE=D0=BF=D0=BE= =D0=B2 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 Regards, Valentin Popov --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org