Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
From: "Uwe Schindler" <uwe@thetaphi.de>
To: <java-user@lucene.apache.org>
References: <9F9F27BA-8912-423D-8ECC-B878713C606B@gmail.com>
 <1447350178757.58991@statsbiblioteket.dk>
 <0CE3CC53-9FB4-498C-8B5E-3CD4632CDF62@gmail.com>
 <004a01d11d72$a89fb090$f9df11b0$@thetaphi.de>
 <BDB08B27-1A42-4DB8-B6F0-C2D487831C0D@gmail.com>
 <004d01d11d76$107dee90$3179cbb0$@thetaphi.de>
 <4BAB1A2E-E22B-4722-9601-353C6D278259@gmail.com>
 <004e01d11eda$d8524870$88f6d950$@thetaphi.de>
 <C7E24B76-FA78-4991-A9FC-F6BB78DF41E7@gmail.com>
In-Reply-To: <C7E24B76-FA78-4991-A9FC-F6BB78DF41E7@gmail.com>
Subject: RE: 500 millions document for loop. 
Date: Sat, 14 Nov 2015 13:54:38 +0100
Message-ID: <004f01d11edb$9f47f7e0$ddd7e7a0$@thetaphi.de>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: 
 AQHn4lz3/4Haaex6Cgco9Udn7KAQdwJcHmHYAKjay1ABL1MpsAIuqzQjAe+7uH8B6Zd0VAGuo7UWArisdqqd+NUTwA==
Content-Language: de

For performance reasons, I would also return "false" for "out of order" =
documents. This allows to access stored fields in a more effective way =
(otherwise it seeks too much). For this type of collector the IO cost is =
higher than the small computing performance increase caused by out of =
order documents.

Kind regards,
Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Valentin Popov [mailto:valentin.po@gmail.com]
> Sent: Saturday, November 14, 2015 1:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: 500 millions document for loop.
>=20
> Thank you very much!
>=20
>=20
> > On 14 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 15:49, Uwe =
Schindler <uwe@thetaphi.de> wrote:
> >
> > Hi,
> >
> > This code is buggy! The collect() call of the collector does not get =
a
> document ID relative to the top-level IndexSearcher, it only gets a =
document
> id relative to the reader reported in setNextReader (which is a atomic =
reader
> responsible for a single Lucene index segment).
> >
> > In setNextReader, save the reference to the "current" reader. And =
use this
> "current" reader to get the stored fields:
> >
> > 		indexSearcher.search(query, queryFilter, new Collector() {
> > 			AtomicReader current;
> >
> > 			@Override
> > 			public void setScorer(Scorer arg0) throws IOException
> { }
> >
> > 			@Override
> > 			public void setNextReader(AtomicReaderContext ctx)
> throws IOException {
> > 				current =3D ctx.reader();
> > 			}
> >
> > 			@Override
> > 			public void collect(int docID) throws IOException {
> > 				Document doc =3D current.document(docID,
> loadFields);
> > 				found.found(doc);
> > 			}
> >
> > 			@Override
> > 			public boolean acceptsDocsOutOfOrder() {
> > 				return true;
> > 			}
> > 		});
> >
> > Otherwise you get wrong document ids reported!!!
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >> -----Original Message-----
> >> From: Valentin Popov [mailto:valentin.po@gmail.com]
> >> Sent: Saturday, November 14, 2015 1:04 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Re: 500 millions document for loop.
> >>
> >> Hi, Uwe.
> >>
> >> Thanks for you advise.
> >>
> >> After implementing you suggestion, our calculation time drop down =
from
> ~20
> >> days to 3,5 hours.
> >>
> >> /**
> >> *
> >> * DocumentFound - callback function for each document
> >> */
> >> public void iterate(SearchOptions options, final DocumentFound =
found,
> final
> >> Set<String> loadFields) throws Exception {
> >> 		Query query =3D options.getQuery();
> >> 		Filter queryFilter =3D options.getQueryFilter();
> >> 		final IndexSearcher indexSearcher =3D new
> >>
> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
> >> ecutor());
> >>
> >> 		indexSearcher.search(query, queryFilter, new Collector() {
> >>
> >> 			@Override
> >> 			public void setScorer(Scorer arg0) throws IOException
> >> { }
> >>
> >> 			@Override
> >> 			public void setNextReader(AtomicReaderContext
> >> arg0) throws IOException { }
> >>
> >> 			@Override
> >> 			public void collect(int docID) throws IOException {
> >> 				Document doc =3D indexSearcher.doc(docID,
> >> loadFields);
> >> 				found.found(doc);
> >> 			}
> >>
> >> 			@Override
> >> 			public boolean acceptsDocsOutOfOrder() {
> >> 				return true;
> >> 			}
> >> 		});
> >>
> >> 	}
> >>
> >>
> >>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 21:15, Uwe =
Schindler <uwe@thetaphi.de> wrote:
> >>>
> >>> Hi,
> >>>
> >>>>> The big question is: Do you need the results paged at all?
> >>>>
> >>>> Yup, because if we return all results, we get OME.
> >>>
> >>> You get the OME because the paging collector cannot handle that, =
so this
> is
> >> an XY problem. Would it not be better if you application just gets =
the
> results
> >> as a stream and processes them one after each other? If this is the =
case
> (and
> >> most statistics need it like that), your much better to NOT USE
> TOPDOCS!!!!
> >> Your requirement is diametral to getting top-scoring documents! You
> want to
> >> get ALL results as a sequence.
> >>>
> >>>>> Do you need them sorted?
> >>>>
> >>>> Nope.
> >>>
> >>> OK, so unsorted streaming is the right approach.
> >>>
> >>>>> If not, the easiest approach is to use a custom Collector that =
does no
> >>>> sorting and just consumes the results.
> >>>>
> >>>> Main bottleneck as I see come from next page search, that took =
~2-4
> >>>> seconds.
> >>>
> >>> This is because when paging the collector has to re-execute the =
whole
> >> query and sort all results again, just with a larger window. So if =
you have
> >> result pages of 50000 results and you want to get the second page, =
it will
> >> internally sort 100000 results, because the first page needs to be
> calculated,
> >> too. If you go forward in results the windows gets larger and =
larger, until it
> >> finally collects all results.
> >>>
> >>> So just get the results as a stream by implementing the Collector =
API is
> the
> >> right way to do this.
> >>>
> >>>>>
> >>>>> Uwe
> >>>>>
> >>>>> -----
> >>>>> Uwe Schindler
> >>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>>>> http://www.thetaphi.de
> >>>>> eMail: uwe@thetaphi.de
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
> >>>>>> Sent: Thursday, November 12, 2015 6:48 PM
> >>>>>> To: java-user@lucene.apache.org
> >>>>>> Subject: Re: 500 millions document for loop.
> >>>>>>
> >>>>>> Toke, thanks!
> >>>>>>
> >>>>>> We will look at this solution, looks like this is that what we =
need.
> >>>>>>
> >>>>>>
> >>>>>>> On 12 =D0=BD=D0=BE=D1=8F=D0=B1. 2015 =D0=B3., at 20:42, Toke =
Eskildsen
> <te@statsbiblioteket.dk>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Valentin Popov <valentin.po@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> We have ~10 indexes for 500M documents, each document
> >>>>>>>> has =C2=ABarchive date=C2=BB, and =C2=ABto=C2=BB address, one =
of our task is
> >>>>>>>> calculate statistics of =C2=ABto=C2=BB for last year. Right =
now we are
> >>>>>>>> using search archive_date:(current_date - 1 year) and =
paginate
> >>>>>>>> results for 50k records for page. Bottleneck of that =
approach,
> >>>>>>>> pagination take too long time and on powerful server it take
> >>>>>>>> ~20 days to execute, and it is very long.
> >>>>>>>
> >>>>>>> Lucene does not like deep page requests due to the way the
> internal
> >>>>>> Priority Queue works. Solr has CursorMark, which should be =
fairly
> >> simple
> >>>> to
> >>>>>> emulate in your Lucene handling code:
> >>>>>>>
> >>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
> >> efficient-
> >>>>>> cursor-based-iteration-of-large-result-sets/
> >>>>>>>
> >>>>>>> - Toke Eskildsen
> >>>>>>>
> >>>>>>> =
---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: =
java-user-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-user-
> help@lucene.apache.org
> >>>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>> Valentin Popov
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> =
---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-user-
> help@lucene.apache.org
> >>>>>
> >>>>>
> >>>>> =
---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: =
java-user-help@lucene.apache.org
> >>>>>
> >>>>
> >>>>
> >>>> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC,
> >>>> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =
=D0=9F=D0=BE=D0=BF=D0=BE=D0=B2
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> =
---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>> =
---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >> =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC,
> >> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =
=D0=9F=D0=BE=D0=BF=D0=BE=D0=B2
> >>
> >>
> >>
> >>
> >>
> >>
> >> =
---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > =
---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>=20
>=20
>  =D0=A1 =D0=A3=D0=B2=D0=B0=D0=B6=D0=B5=D0=BD=D0=B8=D0=B5=D0=BC,
> =D0=92=D0=B0=D0=BB=D0=B5=D0=BD=D1=82=D0=B8=D0=BD =
=D0=9F=D0=BE=D0=BF=D0=BE=D0=B2
>=20
>=20
>=20
>=20
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org