Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9C887200BD4 for ; Fri, 16 Dec 2016 11:02:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 9B467160AF6; Fri, 16 Dec 2016 10:02:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E11BA160B32 for ; Fri, 16 Dec 2016 11:01:59 +0100 (CET) Received: (qmail 69836 invoked by uid 500); 16 Dec 2016 10:01:58 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 69712 invoked by uid 99); 16 Dec 2016 10:01:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Dec 2016 10:01:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 729742C03E1 for ; Fri, 16 Dec 2016 10:01:58 +0000 (UTC) Date: Fri, 16 Dec 2016 10:01:58 +0000 (UTC) From: "Ferenczi Jim (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (LUCENE-7579) Sorting on flushed segment MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 16 Dec 2016 10:02:00 -0000 [ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753993#comment-15753993 ] Ferenczi Jim commented on LUCENE-7579: -------------------------------------- This new API is maybe a premature optim that should not be part of this change. What about removing the API and rollback to a non optimized copy that "visits" each doc and copy it like the StoredFieldsReader is doing? This way the function would be private on the StoredFieldsConsumer. We can still add the optimization you're describing later but it can be confusing if the writes of the index writer are not compressed the same way than the other writes for stored fields ? > Sorting on flushed segment > -------------------------- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug > Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified are not sorted. The merge is responsible of sorting these segments potentially with others that are already sorted (resulted from another merge). > I'd like to investigate the cost of sorting the segment directly during the flush. This could make the merge faster since they are some cheap optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the SortingLeafReader implementation to translate the values that we have in RAM in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first written to disk unsorted and then copied to another file with the correct sorting. I use the same stored field format for the two steps and just remove the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts and compared master with index sorting against my branch with index sorting on flush. I tried with sparsetaxis and wikipedia and the first results are weird. When I use the SerialScheduler and only one thread to write the docs, index sorting on flush is slower. But when I use two threads the sorting on flush is much faster even with the SerialScheduler. I'll continue to run the tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know this part at all so I did not fix the test yet. I don't even know if we can make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could you please take a look ? I am sure there are plenty of bugs, ... but I think it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org