lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2719) Re-add SorterTemplate and use it to provide fast ArraySorting and replace BytesRefHash sorting
Date Tue, 26 Oct 2010 23:08:23 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-2719:
----------------------------------

    Attachment: LUCENE-2719-CollSupport.patch

After spending the evening with performance tests on BytesRefHash and Fuzzy automatons I cam
to the following conclusion, finalized in this hopefully last patch:

- Automatons use very short, mostly presorted arrays. Quicksort is ineffective for them. Initial
tests showed that even insertionSort is enough for most of the Transition arrays. As some
automatons also contain very large Transition arrays, it showed, that then insertionSAort
gets very slow. Quicksort gets better, but as the array is already sorted, mergesort beats
them all. SorterTemplate.mergeSort contains a limit, so when array size is < 12 entries,
it uses insertion sort for the sorting (also in later merge steps if the partitioned array
gets < 12 entries).
schindlerMinimize and mccandlessDeterminize are now using mergesort.
- BytesRefHash gets about 10% speed improvement by the recent extension to SorterTemplate
with setPivot/comparePivot abstract methods. This beats the old algorithm which is currently
in trunk, as for the quicksort algorithm used, the swapping of entries in the mail loop always
compares to the pivot value. If BytesRefHash needs to resolve this values every time, it gets
slow. The new patch improves a modified TestBytesRefHash.testSort for perf testing by 11%
(runs with -Xbatch -server -Xmx1024m, Java 1.5 on my computer in 12.5 secs on trunk, 11.1
secs with this patch):

{code}
public void testSortPerf() {
  long start = System.currentTimeMillis();
  BytesRef ref = new BytesRef();
  for (int j = 0; j < 200 * RANDOM_MULTIPLIER; j++) {
    for (int i = 0; i < 1797; i++) {
      String str;
      do {
        str = _TestUtil.randomRealisticUnicodeString(random, 1000);
      } while (str.length() == 0);
      ref.copy(str);
      hash.add(ref);
    }
    hash.sort(BytesRef.getUTF8SortedAsUTF16Comparator());
    hash.clear();
    hash.reinit();
  }
  System.out.println("time: "+(System.currentTimeMillis()-start));
}
{code}

I will commit this patch, which now also makes insertionSort public in SorterTemplate, ArrayUtil
and CollectionUtil tomorrow. I tend to also commit this to 3.x (merged to BytesRefHash-similar
class from 3.x). This is why the CHANGES.txt removes the SorterTemplate removal message (may
need to be modified, because SorterTemplate changed API). If we will only commit to trunk,
CHANGES would keep unchanged.

> Re-add SorterTemplate and use it to provide fast ArraySorting and replace BytesRefHash
sorting
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2719
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.1
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2719-CollSupport.patch, LUCENE-2719-CollSupport.patch, LUCENE-2719-CollSupport.patch,
LUCENE-2719.patch, LUCENE-2719.patch, LUCENE-2719.patch, LUCENE-2719.patch, LUCENE-2719.patch
>
>
> This patch adds back an optimized and rewritten SorterTemplate back to Lucene (removed
after release of 3.0). It is of use for several components:
> - Automaton: Automaton needs to sort States and other things. Using Arrays.sort() is
slow, because it clones internally to ensure stable search. This component is much faster.
This patch adds Arrays.sort() replacements in ArrayUtil that work with natural order or using
a Comparator<?>. You can choose between quickSort and mergeSort.
> - BytesRefHash uses another QuickSort algorithm without insertionSort for very short
ord arrays. This class uses SorterTemplate to provide the same with insertionSort fallback
in a very elegant way. Ideally this class can be used everywhere, where the sort algorithm
needs to be separated from the underlying data and you can implement a swap() and compare()
function (that get slot numbers instead of real values). This also applies to Solr (Yonik?).
> SorterTemplate provides quickSort and mergeSort algorithms. Internally for short arrays,
it automatically chooses insertionSort (like JDK's Arrays). The quickSort algorith was copied
modified from old BytesRefHash. This new class only shares MergeSort with the original CGLIB
SorterTemplate, which is no longer maintained.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message