lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: 500 millions document for loop.
Date Sat, 14 Nov 2015 12:49:04 GMT
Hi,

This code is buggy! The collect() call of the collector does not get a document ID relative
to the top-level IndexSearcher, it only gets a document id relative to the reader reported
in setNextReader (which is a atomic reader responsible for a single Lucene index segment).

In setNextReader, save the reference to the "current" reader. And use this "current" reader
to get the stored fields:

 		indexSearcher.search(query, queryFilter, new Collector() {
			AtomicReader current; 

 			@Override
 			public void setScorer(Scorer arg0) throws IOException { }
 
 			@Override
 			public void setNextReader(AtomicReaderContext ctx) throws IOException { 
				current = ctx.reader();
			}
 
 			@Override
 			public void collect(int docID) throws IOException {
 				Document doc = current.document(docID, loadFields);
 				found.found(doc);
 			}
 
 			@Override
 			public boolean acceptsDocsOutOfOrder() {
 				return true;
 			}
 		});

Otherwise you get wrong document ids reported!!!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Valentin Popov [mailto:valentin.po@gmail.com]
> Sent: Saturday, November 14, 2015 1:04 PM
> To: java-user@lucene.apache.org
> Subject: Re: 500 millions document for loop.
> 
> Hi, Uwe.
> 
> Thanks for you advise.
> 
> After implementing you suggestion, our calculation time drop down from ~20
> days to 3,5 hours.
> 
> /**
> *
> * DocumentFound - callback function for each document
> */
> public void iterate(SearchOptions options, final DocumentFound found, final
> Set<String> loadFields) throws Exception {
> 		Query query = options.getQuery();
> 		Filter queryFilter = options.getQueryFilter();
> 		final IndexSearcher indexSearcher = new
> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
> ecutor());
> 
> 		indexSearcher.search(query, queryFilter, new Collector() {
> 
> 			@Override
> 			public void setScorer(Scorer arg0) throws IOException
> { }
> 
> 			@Override
> 			public void setNextReader(AtomicReaderContext
> arg0) throws IOException { }
> 
> 			@Override
> 			public void collect(int docID) throws IOException {
> 				Document doc = indexSearcher.doc(docID,
> loadFields);
> 				found.found(doc);
> 			}
> 
> 			@Override
> 			public boolean acceptsDocsOutOfOrder() {
> 				return true;
> 			}
> 		});
> 
> 	}
> 
> 
> > On 12 нояб. 2015 г., at 21:15, Uwe Schindler <uwe@thetaphi.de> wrote:
> >
> > Hi,
> >
> >>> The big question is: Do you need the results paged at all?
> >>
> >> Yup, because if we return all results, we get OME.
> >
> > You get the OME because the paging collector cannot handle that, so this is
> an XY problem. Would it not be better if you application just gets the results
> as a stream and processes them one after each other? If this is the case (and
> most statistics need it like that), your much better to NOT USE TOPDOCS!!!!
> Your requirement is diametral to getting top-scoring documents! You want to
> get ALL results as a sequence.
> >
> >>> Do you need them sorted?
> >>
> >> Nope.
> >
> > OK, so unsorted streaming is the right approach.
> >
> >>> If not, the easiest approach is to use a custom Collector that does no
> >> sorting and just consumes the results.
> >>
> >> Main bottleneck as I see come from next page search, that took ~2-4
> >> seconds.
> >
> > This is because when paging the collector has to re-execute the whole
> query and sort all results again, just with a larger window. So if you have
> result pages of 50000 results and you want to get the second page, it will
> internally sort 100000 results, because the first page needs to be calculated,
> too. If you go forward in results the windows gets larger and larger, until it
> finally collects all results.
> >
> > So just get the results as a stream by implementing the Collector API is the
> right way to do this.
> >
> >>>
> >>> Uwe
> >>>
> >>> -----
> >>> Uwe Schindler
> >>> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>> http://www.thetaphi.de
> >>> eMail: uwe@thetaphi.de
> >>>
> >>>> -----Original Message-----
> >>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
> >>>> Sent: Thursday, November 12, 2015 6:48 PM
> >>>> To: java-user@lucene.apache.org
> >>>> Subject: Re: 500 millions document for loop.
> >>>>
> >>>> Toke, thanks!
> >>>>
> >>>> We will look at this solution, looks like this is that what we need.
> >>>>
> >>>>
> >>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <te@statsbiblioteket.dk>
> >>>> wrote:
> >>>>>
> >>>>> Valentin Popov <valentin.po@gmail.com> wrote:
> >>>>>
> >>>>>> We have ~10 indexes for 500M documents, each document
> >>>>>> has «archive date», and «to» address, one of our task is
> >>>>>> calculate statistics of «to» for last year. Right now we are
> >>>>>> using search archive_date:(current_date - 1 year) and paginate
> >>>>>> results for 50k records for page. Bottleneck of that approach,
> >>>>>> pagination take too long time and on powerful server it take
> >>>>>> ~20 days to execute, and it is very long.
> >>>>>
> >>>>> Lucene does not like deep page requests due to the way the internal
> >>>> Priority Queue works. Solr has CursorMark, which should be fairly
> simple
> >> to
> >>>> emulate in your Lucene handling code:
> >>>>>
> >>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
> efficient-
> >>>> cursor-based-iteration-of-large-result-sets/
> >>>>>
> >>>>> - Toke Eskildsen
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>
> >>>> Regards,
> >>>> Valentin Popov
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >>
> >> С Уважением,
> >> Валентин Попов
> >>
> >>
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
>  С Уважением,
> Валентин Попов
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message