lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: 500 millions document for loop.
Date Sat, 14 Nov 2015 12:54:38 GMT
For performance reasons, I would also return "false" for "out of order" documents. This allows
to access stored fields in a more effective way (otherwise it seeks too much). For this type
of collector the IO cost is higher than the small computing performance increase caused by
out of order documents.

Kind regards,
Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Valentin Popov [mailto:valentin.po@gmail.com]
> Sent: Saturday, November 14, 2015 1:51 PM
> To: java-user@lucene.apache.org
> Subject: Re: 500 millions document for loop.
> 
> Thank you very much!
> 
> 
> > On 14 нояб. 2015 г., at 15:49, Uwe Schindler <uwe@thetaphi.de> wrote:
> >
> > Hi,
> >
> > This code is buggy! The collect() call of the collector does not get a
> document ID relative to the top-level IndexSearcher, it only gets a document
> id relative to the reader reported in setNextReader (which is a atomic reader
> responsible for a single Lucene index segment).
> >
> > In setNextReader, save the reference to the "current" reader. And use this
> "current" reader to get the stored fields:
> >
> > 		indexSearcher.search(query, queryFilter, new Collector() {
> > 			AtomicReader current;
> >
> > 			@Override
> > 			public void setScorer(Scorer arg0) throws IOException
> { }
> >
> > 			@Override
> > 			public void setNextReader(AtomicReaderContext ctx)
> throws IOException {
> > 				current = ctx.reader();
> > 			}
> >
> > 			@Override
> > 			public void collect(int docID) throws IOException {
> > 				Document doc = current.document(docID,
> loadFields);
> > 				found.found(doc);
> > 			}
> >
> > 			@Override
> > 			public boolean acceptsDocsOutOfOrder() {
> > 				return true;
> > 			}
> > 		});
> >
> > Otherwise you get wrong document ids reported!!!
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >> -----Original Message-----
> >> From: Valentin Popov [mailto:valentin.po@gmail.com]
> >> Sent: Saturday, November 14, 2015 1:04 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Re: 500 millions document for loop.
> >>
> >> Hi, Uwe.
> >>
> >> Thanks for you advise.
> >>
> >> After implementing you suggestion, our calculation time drop down from
> ~20
> >> days to 3,5 hours.
> >>
> >> /**
> >> *
> >> * DocumentFound - callback function for each document
> >> */
> >> public void iterate(SearchOptions options, final DocumentFound found,
> final
> >> Set<String> loadFields) throws Exception {
> >> 		Query query = options.getQuery();
> >> 		Filter queryFilter = options.getQueryFilter();
> >> 		final IndexSearcher indexSearcher = new
> >>
> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
> >> ecutor());
> >>
> >> 		indexSearcher.search(query, queryFilter, new Collector() {
> >>
> >> 			@Override
> >> 			public void setScorer(Scorer arg0) throws IOException
> >> { }
> >>
> >> 			@Override
> >> 			public void setNextReader(AtomicReaderContext
> >> arg0) throws IOException { }
> >>
> >> 			@Override
> >> 			public void collect(int docID) throws IOException {
> >> 				Document doc = indexSearcher.doc(docID,
> >> loadFields);
> >> 				found.found(doc);
> >> 			}
> >>
> >> 			@Override
> >> 			public boolean acceptsDocsOutOfOrder() {
> >> 				return true;
> >> 			}
> >> 		});
> >>
> >> 	}
> >>
> >>
> >>> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <uwe@thetaphi.de>
wrote:
> >>>
> >>> Hi,
> >>>
> >>>>> The big question is: Do you need the results paged at all?
> >>>>
> >>>> Yup, because if we return all results, we get OME.
> >>>
> >>> You get the OME because the paging collector cannot handle that, so this
> is
> >> an XY problem. Would it not be better if you application just gets the
> results
> >> as a stream and processes them one after each other? If this is the case
> (and
> >> most statistics need it like that), your much better to NOT USE
> TOPDOCS!!!!
> >> Your requirement is diametral to getting top-scoring documents! You
> want to
> >> get ALL results as a sequence.
> >>>
> >>>>> Do you need them sorted?
> >>>>
> >>>> Nope.
> >>>
> >>> OK, so unsorted streaming is the right approach.
> >>>
> >>>>> If not, the easiest approach is to use a custom Collector that does
no
> >>>> sorting and just consumes the results.
> >>>>
> >>>> Main bottleneck as I see come from next page search, that took ~2-4
> >>>> seconds.
> >>>
> >>> This is because when paging the collector has to re-execute the whole
> >> query and sort all results again, just with a larger window. So if you have
> >> result pages of 50000 results and you want to get the second page, it will
> >> internally sort 100000 results, because the first page needs to be
> calculated,
> >> too. If you go forward in results the windows gets larger and larger, until
it
> >> finally collects all results.
> >>>
> >>> So just get the results as a stream by implementing the Collector API is
> the
> >> right way to do this.
> >>>
> >>>>>
> >>>>> Uwe
> >>>>>
> >>>>> -----
> >>>>> Uwe Schindler
> >>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>>>> http://www.thetaphi.de
> >>>>> eMail: uwe@thetaphi.de
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
> >>>>>> Sent: Thursday, November 12, 2015 6:48 PM
> >>>>>> To: java-user@lucene.apache.org
> >>>>>> Subject: Re: 500 millions document for loop.
> >>>>>>
> >>>>>> Toke, thanks!
> >>>>>>
> >>>>>> We will look at this solution, looks like this is that what
we need.
> >>>>>>
> >>>>>>
> >>>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen
> <te@statsbiblioteket.dk>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Valentin Popov <valentin.po@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> We have ~10 indexes for 500M documents, each document
> >>>>>>>> has «archive date», and «to» address, one of our
task is
> >>>>>>>> calculate statistics of «to» for last year. Right
now we are
> >>>>>>>> using search archive_date:(current_date - 1 year) and
paginate
> >>>>>>>> results for 50k records for page. Bottleneck of that
approach,
> >>>>>>>> pagination take too long time and on powerful server
it take
> >>>>>>>> ~20 days to execute, and it is very long.
> >>>>>>>
> >>>>>>> Lucene does not like deep page requests due to the way the
> internal
> >>>>>> Priority Queue works. Solr has CursorMark, which should be fairly
> >> simple
> >>>> to
> >>>>>> emulate in your Lucene handling code:
> >>>>>>>
> >>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
> >> efficient-
> >>>>>> cursor-based-iteration-of-large-result-sets/
> >>>>>>>
> >>>>>>> - Toke Eskildsen
> >>>>>>>
> >>>>>>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-user-
> help@lucene.apache.org
> >>>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>> Valentin Popov
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-user-
> help@lucene.apache.org
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>
> >>>>
> >>>> С Уважением,
> >>>> Валентин Попов
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >> С Уважением,
> >> Валентин Попов
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> 
>  С Уважением,
> Валентин Попов
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message