asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Adapting TimSort into AsterixDB/Hyracks
Date Sat, 28 Oct 2017 22:31:01 GMT
Excellent thoughts!  We need to find someone in our community with a 
sorting wish.....


On 10/28/17 2:26 PM, Chen Luo wrote:
> Another sort performance issue to notice is the "normalized key".  When we
> sort frames of tuples, we build a directory of pointers to avoid memory
> movements. Each record pointer has four integers, i.e., *frame Id, tuple
> start, tuple end, normalized key*. Normalized key is critical to speed up
> comparisons by avoiding random memory accesses. On 1GB random records,
> using normalized key can improve sort performance by *3-4x*. However, this
> normalized key idea is under utilized in AsterixDB layer:
>
>     - The current length of normalized key is only 4 bytes, which limits its
>     applicability of longer (e.g., time/long/double) but fixed length types.
>     - Normalized key only skips memory accesses when two keys are different.
>     However, when records have a lot of duplicate keys, e.g., order on some
>     stateID or countyID, current mechanism won't help too much.
>     - Only a limited number of types have implemented this normalized key
>     computer (see [1]). For example, *UUID does not support it*, which makes
>     it much slower to sort UUID keys. Ideally, every sortable type should have
>     a corresponding mechanism to compute normalized key.
>
> These issues need to be resolved (maybe later) to improve sort performance.
> Otherwise, even the best cache-friendly sorting algorithm won't help too
> much, since each comparison would incur multiple cache misses.
>
> [1]
> https://github.com/apache/asterixdb/blob/b0595873432ccfb298d780da92151bc9dd2f6840/asterixdb/asterix-om/src/main/java/org/apache/asterix/formats/nontagged/NormalizedKeyComputerFactoryProvider.java
>
>
> On Sat, Oct 28, 2017 at 1:50 PM, Chen Luo <cluo8@uci.edu> wrote:
>
>> That can be done in Hyracks layer, but enabling Quicksort in AsterixDB has
>> other issues. Right now AsterixDB only uses merge sort because of its
>> stableness. Once switching to Quicksort, we have to make sure all the other
>> operators are happy with this change (some operators are OK with it, but
>> some are not).
>>
>> On Sat, Oct 28, 2017 at 12:05 PM, Mike Carey <dtabass@gmail.com> wrote:
>>
>>> So the word on the web seems maybe to be that Quicksort is generally
>>> superior and cache-friendly (cache oblivious). Wondering if we should just
>>> get our Quicksort code under control?
>>>
>>>
>>>
>>> On 10/28/17 11:36 AM, Chen Luo wrote:
>>>
>>>> Not exactly sure about this right now... But since TimSort is
>>>> essentially a
>>>> combination of insertion sort and merge sort, it's cache-friendliness
>>>> won't
>>>> be worse than our merge sort.
>>>>
>>>> This TimSort could be served as a short-term plugin-and-play improvement
>>>> of
>>>> our sorting algorithm. It is still stable, the same as our current merge
>>>> sort, but faster, especially on partially ordered dataset.
>>>>
>>>> Best regards,
>>>> Chen Luo
>>>>
>>>> On Sat, Oct 28, 2017 at 10:58 AM, Mike Carey <dtabass@gmail.com> wrote:
>>>>
>>>> How is it on cache-friendliness?
>>>>>
>>>>>
>>>>> On 10/27/17 11:38 PM, abdullah alamoudi wrote:
>>>>>
>>>>> While I have no answer to the question of legality, this sounds great.
>>>>>> ~Abdullah.
>>>>>>
>>>>>> On Oct 27, 2017, at 9:20 PM, Chen Luo <cluo8@uci.edu> wrote:
>>>>>>
>>>>>>> Hi devs,
>>>>>>>
>>>>>>> I have adapted the TimSort algorithm used in JDK (java.util.TimSort)
>>>>>>> into
>>>>>>> Hyracks, which gives 10-20% performance improvements on random
data.
>>>>>>> It
>>>>>>> will be more useful if the input data is partially sorted, e.g.,
>>>>>>> primary
>>>>>>> keys fetched from secondary index scan, which I haven't got time
to
>>>>>>> experiment with.
>>>>>>>
>>>>>>> *Before going any further, is it legal to adapt some algorithm
>>>>>>> implementation from JDK into our codebase? *I saw the JDK
>>>>>>> implementation
>>>>>>> itself is adopted from
>>>>>>> http://svn.python.org/projects/python/trunk/Objects/listsort.txt
as
>>>>>>> well.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Chen Luo
>>>>>>>
>>>>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message