Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: 500 millions document for loop. 
From: Valentin Popov <valentin.po@gmail.com>
In-Reply-To: <004f01d11edb$9f47f7e0$ddd7e7a0$@thetaphi.de>
Date: Tue, 26 Apr 2016 17:00:21 +0300
Cc: Uwe Schindler <uwe@thetaphi.de>
Content-Transfer-Encoding: quoted-printable
Message-Id: <CD462C89-C9E4-433C-8AA1-D40B427E4F6D@gmail.com>
References: <9F9F27BA-8912-423D-8ECC-B878713C606B@gmail.com>
 <1447350178757.58991@statsbiblioteket.dk>
 <0CE3CC53-9FB4-498C-8B5E-3CD4632CDF62@gmail.com>
 <004a01d11d72$a89fb090$f9df11b0$@thetaphi.de>
 <BDB08B27-1A42-4DB8-B6F0-C2D487831C0D@gmail.com>
 <004d01d11d76$107dee90$3179cbb0$@thetaphi.de>
 <4BAB1A2E-E22B-4722-9601-353C6D278259@gmail.com>
 <004e01d11eda$d8524870$88f6d950$@thetaphi.de>
 <C7E24B76-FA78-4991-A9FC-F6BB78DF41E7@gmail.com>
 <004f01d11edb$9f47f7e0$ddd7e7a0$@thetaphi.de>
To: java-user@lucene.apache.org

Uwe, hello.=20

Is it possible to use same fast iterator, but apply sorting for date?=20

Regards,
Valentin.=20

> On 14 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 15:54, Uwe Schindler =
<uwe@thetaphi.de> wrote:
>=20
> For performance reasons, I would also return "false" for "out of =
order" documents. This allows to access stored fields in a more =
effective way (otherwise it seeks too much). For this type of collector =
the IO cost is higher than the small computing performance increase =
caused by out of order documents.
>=20
> Kind regards,
> Uwe
>=20
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>=20
>> -----Original Message-----
>> From: Valentin Popov [mailto:valentin.po@gmail.com]
>> Sent: Saturday, November 14, 2015 1:51 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: 500 millions document for loop.
>>=20
>> Thank you very much!
>>=20
>>=20
>>> On 14 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 15:49, Uwe =
Schindler <uwe@thetaphi.de> wrote:
>>>=20
>>> Hi,
>>>=20
>>> This code is buggy! The collect() call of the collector does not get =
a
>> document ID relative to the top-level IndexSearcher, it only gets a =
document
>> id relative to the reader reported in setNextReader (which is a =
atomic reader
>> responsible for a single Lucene index segment).
>>>=20
>>> In setNextReader, save the reference to the "current" reader. And =
use this
>> "current" reader to get the stored fields:
>>>=20
>>> 		indexSearcher.search(query, queryFilter, new Collector() =
{
>>> 			AtomicReader current;
>>>=20
>>> 			@Override
>>> 			public void setScorer(Scorer arg0) throws =
IOException
>> { }
>>>=20
>>> 			@Override
>>> 			public void setNextReader(AtomicReaderContext =
ctx)
>> throws IOException {
>>> 				current =3D ctx.reader();
>>> 			}
>>>=20
>>> 			@Override
>>> 			public void collect(int docID) throws =
IOException {
>>> 				Document doc =3D current.document(docID,
>> loadFields);
>>> 				found.found(doc);
>>> 			}
>>>=20
>>> 			@Override
>>> 			public boolean acceptsDocsOutOfOrder() {
>>> 				return true;
>>> 			}
>>> 		});
>>>=20
>>> Otherwise you get wrong document ids reported!!!
>>>=20
>>> Uwe
>>>=20
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>>=20
>>>> -----Original Message-----
>>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
>>>> Sent: Saturday, November 14, 2015 1:04 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: 500 millions document for loop.
>>>>=20
>>>> Hi, Uwe.
>>>>=20
>>>> Thanks for you advise.
>>>>=20
>>>> After implementing you suggestion, our calculation time drop down =
from
>> ~20
>>>> days to 3,5 hours.
>>>>=20
>>>> /**
>>>> *
>>>> * DocumentFound - callback function for each document
>>>> */
>>>> public void iterate(SearchOptions options, final DocumentFound =
found,
>> final
>>>> Set<String> loadFields) throws Exception {
>>>> 		Query query =3D options.getQuery();
>>>> 		Filter queryFilter =3D options.getQueryFilter();
>>>> 		final IndexSearcher indexSearcher =3D new
>>>>=20
>> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
>>>> ecutor());
>>>>=20
>>>> 		indexSearcher.search(query, queryFilter, new Collector() =
{
>>>>=20
>>>> 			@Override
>>>> 			public void setScorer(Scorer arg0) throws =
IOException
>>>> { }
>>>>=20
>>>> 			@Override
>>>> 			public void setNextReader(AtomicReaderContext
>>>> arg0) throws IOException { }
>>>>=20
>>>> 			@Override
>>>> 			public void collect(int docID) throws =
IOException {
>>>> 				Document doc =3D =
indexSearcher.doc(docID,
>>>> loadFields);
>>>> 				found.found(doc);
>>>> 			}
>>>>=20
>>>> 			@Override
>>>> 			public boolean acceptsDocsOutOfOrder() {
>>>> 				return true;
>>>> 			}
>>>> 		});
>>>>=20
>>>> 	}
>>>>=20
>>>>=20
>>>>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 21:15, Uwe =
Schindler <uwe@thetaphi.de> wrote:
>>>>>=20
>>>>> Hi,
>>>>>=20
>>>>>>> The big question is: Do you need the results paged at all?
>>>>>>=20
>>>>>> Yup, because if we return all results, we get OME.
>>>>>=20
>>>>> You get the OME because the paging collector cannot handle that, =
so this
>> is
>>>> an XY problem. Would it not be better if you application just gets =
the
>> results
>>>> as a stream and processes them one after each other? If this is the =
case
>> (and
>>>> most statistics need it like that), your much better to NOT USE
>> TOPDOCS!!!!
>>>> Your requirement is diametral to getting top-scoring documents! You
>> want to
>>>> get ALL results as a sequence.
>>>>>=20
>>>>>>> Do you need them sorted?
>>>>>>=20
>>>>>> Nope.
>>>>>=20
>>>>> OK, so unsorted streaming is the right approach.
>>>>>=20
>>>>>>> If not, the easiest approach is to use a custom Collector that =
does no
>>>>>> sorting and just consumes the results.
>>>>>>=20
>>>>>> Main bottleneck as I see come from next page search, that took =
~2-4
>>>>>> seconds.
>>>>>=20
>>>>> This is because when paging the collector has to re-execute the =
whole
>>>> query and sort all results again, just with a larger window. So if =
you have
>>>> result pages of 50000 results and you want to get the second page, =
it will
>>>> internally sort 100000 results, because the first page needs to be
>> calculated,
>>>> too. If you go forward in results the windows gets larger and =
larger, until it
>>>> finally collects all results.
>>>>>=20
>>>>> So just get the results as a stream by implementing the Collector =
API is
>> the
>>>> right way to do this.
>>>>>=20
>>>>>>>=20
>>>>>>> Uwe
>>>>>>>=20
>>>>>>> -----
>>>>>>> Uwe Schindler
>>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>>> http://www.thetaphi.de
>>>>>>> eMail: uwe@thetaphi.de
>>>>>>>=20
>>>>>>>> -----Original Message-----
>>>>>>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
>>>>>>>> Sent: Thursday, November 12, 2015 6:48 PM
>>>>>>>> To: java-user@lucene.apache.org
>>>>>>>> Subject: Re: 500 millions document for loop.
>>>>>>>>=20
>>>>>>>> Toke, thanks!
>>>>>>>>=20
>>>>>>>> We will look at this solution, looks like this is that what we =
need.
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 20:42, Toke =
Eskildsen
>> <te@statsbiblioteket.dk>
>>>>>>>> wrote:
>>>>>>>>>=20
>>>>>>>>> Valentin Popov <valentin.po@gmail.com> wrote:
>>>>>>>>>=20
>>>>>>>>>> We have ~10 indexes for 500M documents, each document
>>>>>>>>>> has =C2=ABarchive date=C2=BB, and =C2=ABto=C2=BB address, one =
of our task is
>>>>>>>>>> calculate statistics of =C2=ABto=C2=BB for last year. Right =
now we are
>>>>>>>>>> using search archive_date:(current_date - 1 year) and =
paginate
>>>>>>>>>> results for 50k records for page. Bottleneck of that =
approach,
>>>>>>>>>> pagination take too long time and on powerful server it take
>>>>>>>>>> ~20 days to execute, and it is very long.
>>>>>>>>>=20
>>>>>>>>> Lucene does not like deep page requests due to the way the
>> internal
>>>>>>>> Priority Queue works. Solr has CursorMark, which should be =
fairly
>>>> simple
>>>>>> to
>>>>>>>> emulate in your Lucene handling code:
>>>>>>>>>=20
>>>>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
>>>> efficient-
>>>>>>>> cursor-based-iteration-of-large-result-sets/
>>>>>>>>>=20
>>>>>>>>> - Toke Eskildsen
>>>>>>>>>=20
>>>>>>>>> =
---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: =
java-user-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-
>> help@lucene.apache.org
>>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> Regards,
>>>>>>>> Valentin Popov
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> =
---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-
>> help@lucene.apache.org
>>>>>>>=20
>>>>>>>=20
>>>>>>> =
---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: =
java-user-help@lucene.apache.org
>>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC,
>>>>>> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =D0=9F=D0=BE=D0=BF=
=D0=BE=D0=B2
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>> =
---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>=20
>>>>>=20
>>>>> =
---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>=20
>>>>=20
>>>> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC,
>>>> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =D0=9F=D0=BE=D0=BF=D0=
=BE=D0=B2
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>> =
---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>=20
>>>=20
>>> =
---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>=20
>>=20
>>=20
>> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC,
>> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =D0=9F=D0=BE=D0=BF=D0=BE=
=D0=B2
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>=20

Regards,
Valentin Popov


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org