tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-2606) Cache-friendly data structure for sorting
Date Thu, 09 Jul 2015 18:54:07 GMT

    [ https://issues.apache.org/jira/browse/TEZ-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621065#comment-14621065
] 

Gopal V commented on TEZ-2606:
------------------------------

bq. My idea is kvmeta's including new prefix of key's binary.

Yes, that's the partition LSB is the prefix of the key's binary, when you run Hive on Tez.


The full integer in partition is split into two groups - bits for the partition id and the
remaining bits used for the prefix of the key.

{code}
    if(hasher != null) {
      prefix = hasher.getProxy(key);
    }

    prefix = (partition << (32 - partitionBits)) | (prefix >>> partitionBits);
{code}

The original naive implementation of reading the bytes post-serialization in Tez didn't seem
to help at all (due to writables) and broke reverse ordering.

In the TezBytesComparator, we read out the key prefix (always as a positive integer to avoid
complement bit masks) and special-case it so that PIG or Cascading which needs ordering be
different from the byte ordinal comparisons won't be broken.

https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/comparator/TezBytesComparator.java#L59

And even that doesn't really work unless you use TEZ-1288, since for most Writables the first
few bytes are the size of the structure, not part of the key.

This performance boost has less to do with the cache coherence part of the problem and more
to do with the fact that the RawComparator interface thunk/trampoline is more expensive -
the partition-a > partition-b check now short-circuits more of those, which is the real
win (particularly for Cascading, which does tuple comparisons - not byte).

> Cache-friendly data structure for sorting
> -----------------------------------------
>
>                 Key: TEZ-2606
>                 URL: https://issues.apache.org/jira/browse/TEZ-2606
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Tsuyoshi Ozawa
>            Assignee: Tsuyoshi Ozawa
>
> Alphasort[1]  mentions prefix key sort is effective way. I'd like to suggest to change
a layout of ring buffer to include prefix of key in meta data. This can improve the cache
hit rate when sorting.
> [1] Alphasort: http://dl.acm.org/citation.cfm?id=615237



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message