lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Atri Sharma (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
Date Tue, 07 May 2019 07:50:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834463#comment-16834463
] 

Atri Sharma edited comment on LUCENE-8757 at 5/7/19 7:49 AM:
-------------------------------------------------------------

{quote}I don't think we should push this if we already know we wanna do something different.
That said, I am not convinced the numbers are good defaults. At the same time I don't have
any numbers here do you have anything to back these defaults up?
{quote}
 

Sure. The reason I was suggesting pushing this patch per se is because the other approach
we are advancing would require a couple of new semantics to be introduced, so we could potentially
want users to have an option to opt-in for either of the two. That said, I believe the cost
based algorithm would also require some hard defaults to be present – to ensure that small
segments do not get independent threads even if system had the capacity.

 

RE: The default constant values, these numbers are derived from empirical testing across different
datasets in ESRally (nyc_taxis, logging) and looking at the default segment size distribution
of wikipedia10M dataset in luceneutil. However, this might not be a good default size to split
on.

 

One thing we could do (albeit expensive) is to take the mean number of documents in the corresponding
LeafReaderContexts for a query as the split point. Would that be a better dynamic way?


was (Author: atris):
:bq  I don't think we should push this if we already know we wanna do something different.
That said, I am not convinced the numbers are good defaults. At the same time I don't have
any numbers here do you have anything to back these defaults up?

 

Sure. The reason I was suggesting pushing this patch per se is because the other approach
we are advancing would require a couple of new semantics to be introduced, so we could pote
ntially want users to have an option to opt-in for either of the two. That said, I believe
the cost based algorithm would also require some hard defaults to be present – to ensure
that small segments do not get independent threads even if system had the capacity.

 

RE: The default constant values, these numbers are derived from empirical testing across different
datasets in ESRally (nyc_taxis, logging) and looking at the default segment size distribution
of wikipedia10M dataset in luceneutil. However, this might not be a good default size to split
on.

 

One thing we could do (albeit expensive) is to take the mean number of documents in the corresponding
LeafReaderContexts for a query as the split point. Would that be a better dynamic way?

> Better Segment To Thread Mapping Algorithm
> ------------------------------------------
>
>                 Key: LUCENE-8757
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8757
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Atri Sharma
>            Priority: Major
>         Attachments: LUCENE-8757.patch
>
>
> The current segments to threads allocation algorithm always allocates one thread per
segment. This is detrimental to performance in case of skew in segment sizes since small segments
also get their dedicated thread. This can lead to performance degradation due to context switching
overheads.
>  
> A better algorithm which is cognizant of size skew would have better performance for
realistic scenarios



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message