Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
From: "Uwe Schindler" <uwe@thetaphi.de>
To: <java-user@lucene.apache.org>
References: <9F9F27BA-8912-423D-8ECC-B878713C606B@gmail.com>
 <1447350178757.58991@statsbiblioteket.dk>
 <0CE3CC53-9FB4-498C-8B5E-3CD4632CDF62@gmail.com>
 <004a01d11d72$a89fb090$f9df11b0$@thetaphi.de>
 <BDB08B27-1A42-4DB8-B6F0-C2D487831C0D@gmail.com>
 <004d01d11d76$107dee90$3179cbb0$@thetaphi.de>
 <4BAB1A2E-E22B-4722-9601-353C6D278259@gmail.com>
In-Reply-To: <4BAB1A2E-E22B-4722-9601-353C6D278259@gmail.com>
Subject: RE: 500 millions document for loop. 
Date: Sat, 14 Nov 2015 13:49:04 +0100
Message-ID: <004e01d11eda$d8524870$88f6d950$@thetaphi.de>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: 
 AQHn4lz3/4Haaex6Cgco9Udn7KAQdwJcHmHYAKjay1ABL1MpsAIuqzQjAe+7uH8B6Zd0VJ4cDQ2g
Content-Language: de

Hi,

This code is buggy! The collect() call of the collector does not get a =
document ID relative to the top-level IndexSearcher, it only gets a =
document id relative to the reader reported in setNextReader (which is a =
atomic reader responsible for a single Lucene index segment).

In setNextReader, save the reference to the "current" reader. And use =
this "current" reader to get the stored fields:

 		indexSearcher.search(query, queryFilter, new Collector() {
			AtomicReader current;=20

 			@Override
 			public void setScorer(Scorer arg0) throws IOException { }
=20
 			@Override
 			public void setNextReader(AtomicReaderContext ctx) throws =
IOException {=20
				current =3D ctx.reader();
			}
=20
 			@Override
 			public void collect(int docID) throws IOException {
 				Document doc =3D current.document(docID, loadFields);
 				found.found(doc);
 			}
=20
 			@Override
 			public boolean acceptsDocsOutOfOrder() {
 				return true;
 			}
 		});

Otherwise you get wrong document ids reported!!!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Valentin Popov [mailto:valentin.po@gmail.com]
> Sent: Saturday, November 14, 2015 1:04 PM
> To: java-user@lucene.apache.org
> Subject: Re: 500 millions document for loop.
>=20
> Hi, Uwe.
>=20
> Thanks for you advise.
>=20
> After implementing you suggestion, our calculation time drop down from =
~20
> days to 3,5 hours.
>=20
> /**
> *
> * DocumentFound - callback function for each document
> */
> public void iterate(SearchOptions options, final DocumentFound found, =
final
> Set<String> loadFields) throws Exception {
> 		Query query =3D options.getQuery();
> 		Filter queryFilter =3D options.getQueryFilter();
> 		final IndexSearcher indexSearcher =3D new
> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
> ecutor());
>=20
> 		indexSearcher.search(query, queryFilter, new Collector() {
>=20
> 			@Override
> 			public void setScorer(Scorer arg0) throws IOException
> { }
>=20
> 			@Override
> 			public void setNextReader(AtomicReaderContext
> arg0) throws IOException { }
>=20
> 			@Override
> 			public void collect(int docID) throws IOException {
> 				Document doc =3D indexSearcher.doc(docID,
> loadFields);
> 				found.found(doc);
> 			}
>=20
> 			@Override
> 			public boolean acceptsDocsOutOfOrder() {
> 				return true;
> 			}
> 		});
>=20
> 	}
>=20
>=20
> > On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 21:15, Uwe =
Schindler <uwe@thetaphi.de> wrote:
> >
> > Hi,
> >
> >>> The big question is: Do you need the results paged at all?
> >>
> >> Yup, because if we return all results, we get OME.
> >
> > You get the OME because the paging collector cannot handle that, so =
this is
> an XY problem. Would it not be better if you application just gets the =
results
> as a stream and processes them one after each other? If this is the =
case (and
> most statistics need it like that), your much better to NOT USE =
TOPDOCS!!!!
> Your requirement is diametral to getting top-scoring documents! You =
want to
> get ALL results as a sequence.
> >
> >>> Do you need them sorted?
> >>
> >> Nope.
> >
> > OK, so unsorted streaming is the right approach.
> >
> >>> If not, the easiest approach is to use a custom Collector that =
does no
> >> sorting and just consumes the results.
> >>
> >> Main bottleneck as I see come from next page search, that took ~2-4
> >> seconds.
> >
> > This is because when paging the collector has to re-execute the =
whole
> query and sort all results again, just with a larger window. So if you =
have
> result pages of 50000 results and you want to get the second page, it =
will
> internally sort 100000 results, because the first page needs to be =
calculated,
> too. If you go forward in results the windows gets larger and larger, =
until it
> finally collects all results.
> >
> > So just get the results as a stream by implementing the Collector =
API is the
> right way to do this.
> >
> >>>
> >>> Uwe
> >>>
> >>> -----
> >>> Uwe Schindler
> >>> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>> http://www.thetaphi.de
> >>> eMail: uwe@thetaphi.de
> >>>
> >>>> -----Original Message-----
> >>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
> >>>> Sent: Thursday, November 12, 2015 6:48 PM
> >>>> To: java-user@lucene.apache.org
> >>>> Subject: Re: 500 millions document for loop.
> >>>>
> >>>> Toke, thanks!
> >>>>
> >>>> We will look at this solution, looks like this is that what we =
need.
> >>>>
> >>>>
> >>>>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 20:42, Toke =
Eskildsen <te@statsbiblioteket.dk>
> >>>> wrote:
> >>>>>
> >>>>> Valentin Popov <valentin.po@gmail.com> wrote:
> >>>>>
> >>>>>> We have ~10 indexes for 500M documents, each document
> >>>>>> has =C2=ABarchive date=C2=BB, and =C2=ABto=C2=BB address, one =
of our task is
> >>>>>> calculate statistics of =C2=ABto=C2=BB for last year. Right now =
we are
> >>>>>> using search archive_date:(current_date - 1 year) and paginate
> >>>>>> results for 50k records for page. Bottleneck of that approach,
> >>>>>> pagination take too long time and on powerful server it take
> >>>>>> ~20 days to execute, and it is very long.
> >>>>>
> >>>>> Lucene does not like deep page requests due to the way the =
internal
> >>>> Priority Queue works. Solr has CursorMark, which should be fairly
> simple
> >> to
> >>>> emulate in your Lucene handling code:
> >>>>>
> >>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
> efficient-
> >>>> cursor-based-iteration-of-large-result-sets/
> >>>>>
> >>>>> - Toke Eskildsen
> >>>>>
> >>>>> =
---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: =
java-user-help@lucene.apache.org
> >>>>>
> >>>>
> >>>> Regards,
> >>>> Valentin Popov
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> =
---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>> =
---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >>
> >> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC,
> >> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =
=D0=9F=D0=BE=D0=BF=D0=BE=D0=B2
> >>
> >>
> >>
> >>
> >>
> >>
> >> =
---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > =
---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>=20
>=20
>  =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC,
> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =
=D0=9F=D0=BE=D0=BF=D0=BE=D0=B2
>=20
>=20
>=20
>=20
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org