lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valentin Popov <valentin...@gmail.com>
Subject Re: 500 millions document for loop.
Date Sat, 14 Nov 2015 13:12:36 GMT
Return "false" for "out of order", save 1 sec for 1M records, at the end it save 500 sec or
~10 minutes!

Thank you! 

> On 14 нояб. 2015 г., at 15:54, Uwe Schindler <uwe@thetaphi.de> wrote:
> 
> For performance reasons, I would also return "false" for "out of order" documents. This
allows to access stored fields in a more effective way (otherwise it seeks too much). For
this type of collector the IO cost is higher than the small computing performance increase
caused by out of order documents.
> 
> Kind regards,
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
>> -----Original Message-----
>> From: Valentin Popov [mailto:valentin.po@gmail.com]
>> Sent: Saturday, November 14, 2015 1:51 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: 500 millions document for loop.
>> 
>> Thank you very much!
>> 
>> 
>>> On 14 нояб. 2015 г., at 15:49, Uwe Schindler <uwe@thetaphi.de> wrote:
>>> 
>>> Hi,
>>> 
>>> This code is buggy! The collect() call of the collector does not get a
>> document ID relative to the top-level IndexSearcher, it only gets a document
>> id relative to the reader reported in setNextReader (which is a atomic reader
>> responsible for a single Lucene index segment).
>>> 
>>> In setNextReader, save the reference to the "current" reader. And use this
>> "current" reader to get the stored fields:
>>> 
>>> 		indexSearcher.search(query, queryFilter, new Collector() {
>>> 			AtomicReader current;
>>> 
>>> 			@Override
>>> 			public void setScorer(Scorer arg0) throws IOException
>> { }
>>> 
>>> 			@Override
>>> 			public void setNextReader(AtomicReaderContext ctx)
>> throws IOException {
>>> 				current = ctx.reader();
>>> 			}
>>> 
>>> 			@Override
>>> 			public void collect(int docID) throws IOException {
>>> 				Document doc = current.document(docID,
>> loadFields);
>>> 				found.found(doc);
>>> 			}
>>> 
>>> 			@Override
>>> 			public boolean acceptsDocsOutOfOrder() {
>>> 				return true;
>>> 			}
>>> 		});
>>> 
>>> Otherwise you get wrong document ids reported!!!
>>> 
>>> Uwe
>>> 
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>> 
>>>> -----Original Message-----
>>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
>>>> Sent: Saturday, November 14, 2015 1:04 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: 500 millions document for loop.
>>>> 
>>>> Hi, Uwe.
>>>> 
>>>> Thanks for you advise.
>>>> 
>>>> After implementing you suggestion, our calculation time drop down from
>> ~20
>>>> days to 3,5 hours.
>>>> 
>>>> /**
>>>> *
>>>> * DocumentFound - callback function for each document
>>>> */
>>>> public void iterate(SearchOptions options, final DocumentFound found,
>> final
>>>> Set<String> loadFields) throws Exception {
>>>> 		Query query = options.getQuery();
>>>> 		Filter queryFilter = options.getQueryFilter();
>>>> 		final IndexSearcher indexSearcher = new
>>>> 
>> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
>>>> ecutor());
>>>> 
>>>> 		indexSearcher.search(query, queryFilter, new Collector() {
>>>> 
>>>> 			@Override
>>>> 			public void setScorer(Scorer arg0) throws IOException
>>>> { }
>>>> 
>>>> 			@Override
>>>> 			public void setNextReader(AtomicReaderContext
>>>> arg0) throws IOException { }
>>>> 
>>>> 			@Override
>>>> 			public void collect(int docID) throws IOException {
>>>> 				Document doc = indexSearcher.doc(docID,
>>>> loadFields);
>>>> 				found.found(doc);
>>>> 			}
>>>> 
>>>> 			@Override
>>>> 			public boolean acceptsDocsOutOfOrder() {
>>>> 				return true;
>>>> 			}
>>>> 		});
>>>> 
>>>> 	}
>>>> 
>>>> 
>>>>> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <uwe@thetaphi.de>
wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>>>> The big question is: Do you need the results paged at all?
>>>>>> 
>>>>>> Yup, because if we return all results, we get OME.
>>>>> 
>>>>> You get the OME because the paging collector cannot handle that, so this
>> is
>>>> an XY problem. Would it not be better if you application just gets the
>> results
>>>> as a stream and processes them one after each other? If this is the case
>> (and
>>>> most statistics need it like that), your much better to NOT USE
>> TOPDOCS!!!!
>>>> Your requirement is diametral to getting top-scoring documents! You
>> want to
>>>> get ALL results as a sequence.
>>>>> 
>>>>>>> Do you need them sorted?
>>>>>> 
>>>>>> Nope.
>>>>> 
>>>>> OK, so unsorted streaming is the right approach.
>>>>> 
>>>>>>> If not, the easiest approach is to use a custom Collector that
does no
>>>>>> sorting and just consumes the results.
>>>>>> 
>>>>>> Main bottleneck as I see come from next page search, that took ~2-4
>>>>>> seconds.
>>>>> 
>>>>> This is because when paging the collector has to re-execute the whole
>>>> query and sort all results again, just with a larger window. So if you have
>>>> result pages of 50000 results and you want to get the second page, it will
>>>> internally sort 100000 results, because the first page needs to be
>> calculated,
>>>> too. If you go forward in results the windows gets larger and larger, until
it
>>>> finally collects all results.
>>>>> 
>>>>> So just get the results as a stream by implementing the Collector API
is
>> the
>>>> right way to do this.
>>>>> 
>>>>>>> 
>>>>>>> Uwe
>>>>>>> 
>>>>>>> -----
>>>>>>> Uwe Schindler
>>>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>>>> http://www.thetaphi.de
>>>>>>> eMail: uwe@thetaphi.de
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
>>>>>>>> Sent: Thursday, November 12, 2015 6:48 PM
>>>>>>>> To: java-user@lucene.apache.org
>>>>>>>> Subject: Re: 500 millions document for loop.
>>>>>>>> 
>>>>>>>> Toke, thanks!
>>>>>>>> 
>>>>>>>> We will look at this solution, looks like this is that what
we need.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen
>> <te@statsbiblioteket.dk>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Valentin Popov <valentin.po@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>>> We have ~10 indexes for 500M documents, each document
>>>>>>>>>> has «archive date», and «to» address, one of
our task is
>>>>>>>>>> calculate statistics of «to» for last year. Right
now we are
>>>>>>>>>> using search archive_date:(current_date - 1 year)
and paginate
>>>>>>>>>> results for 50k records for page. Bottleneck of that
approach,
>>>>>>>>>> pagination take too long time and on powerful server
it take
>>>>>>>>>> ~20 days to execute, and it is very long.
>>>>>>>>> 
>>>>>>>>> Lucene does not like deep page requests due to the way
the
>> internal
>>>>>>>> Priority Queue works. Solr has CursorMark, which should be
fairly
>>>> simple
>>>>>> to
>>>>>>>> emulate in your Lucene handling code:
>>>>>>>>> 
>>>>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
>>>> efficient-
>>>>>>>> cursor-based-iteration-of-large-result-sets/
>>>>>>>>> 
>>>>>>>>> - Toke Eskildsen
>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-
>> help@lucene.apache.org
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Valentin Popov
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-
>> help@lucene.apache.org
>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> С Уважением,
>>>>>> Валентин Попов
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>>>> С Уважением,
>>>> Валентин Попов
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> 
>> С Уважением,
>> Валентин Попов
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


 С Уважением,
Валентин Попов






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message