lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7579) Sorting on flushed segment
Date Mon, 05 Dec 2016 15:58:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722610#comment-15722610
] 

Michael McCandless commented on LUCENE-7579:
--------------------------------------------

This is a nice approach!  Basically, the codec remains unaware index
sorting is happening, which is a the right way to do it.  Instead, the
indexing chain takes care of it.  And to build the doc comparators you take
advantage of the in-heap buffered doc values.

I like that to sort stored fields, you are still just using the codec
APIs, writing to temp files, then using the codec to read the stored
fields back for sorting.

I also like how you were able to re-use the {{SortingXXX}} from
{{SortingLeafReader}}.  Later on we can maybe optimize some of these;
e.g. {{SortingFields}} and {{CachedXXXDVs}} should be able to take
advantage of the fact that the things they are sorting are all already
in heap (the indexing buffer), the way you did with
{{MutableSortingPointValues}} (cool).

Can we rename {{freezed}} to {{frozen}} in {{BinaryDocValuesWriter}}?
But: why would {{freezed}} ever be true when we call {{flush}}?
Shouldn't it only be called once, even in the sorting case?

I think the 6.x back port here is going to be especially tricky :)

Can we block creating a {{SortingLeafReader}} now (make its
constructor private)?  We only now ever use its inner classes I think?
And it is a dangerous class in the first place...  if we can do that,
maybe we rename it {{SortingCodecUtils}} or something, just for its
inner classes.

Do any of the exceptions tests for IndexWriter get angry?  Seems like
if we hit an {{IOException}} e.g. during the renaming that
{{SortingStoredFieldsConsumer.flush}} does we may leave undeleted
files?  Hmm or perhaps IW takes care of that by wrapping the directory
itself...

Can't you just pass {{sortMap::newToOld}} directly (method reference)
instead of making the lambda here?:

{noformat}
      writer.sort(state.segmentInfo.maxDoc(), mergeReader, state.fieldInfos,
          (docID) -> (sortMap.newToOld(docID)));
{noformat}


> Sorting on flushed segment
> --------------------------
>
>                 Key: LUCENE-7579
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7579
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified are not
sorted. The merge is responsible of sorting these segments potentially with others that are
already sorted (resulted from another merge). 
> I'd like to investigate the cost of sorting the segment directly during the flush. This
could make the merge faster since they are some cheap optimizations that can be done only
if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of rebuilding
the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the SortingLeafReader
implementation to translate the values that we have in RAM in a sorted enumeration for the
writers.
> For stored fields I use a two pass scheme where the documents are first written to disk
unsorted and then copied to another file with the correct sorting. I use the same stored field
format for the two steps and just remove the file produced by the first pass at the end of
the process.
> This prototype has no implementation for index sorting that use term vectors yet. I'll
add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts and compared
master with index sorting against my branch with index sorting on flush. I tried with sparsetaxis
and wikipedia and the first results are weird. When I use the SerialScheduler and only one
thread to write the docs,  index sorting on flush is slower. But when I use two threads the
sorting on flush is much faster even with the SerialScheduler. I'll continue to run the tests
in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know this part
at all so I did not fix the test yet. I don't even know if we can make it work with index
sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could you please
take a look ? I am sure there are plenty of bugs, ... but I think it's a good start to evaluate
the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message