crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Shi (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-368) TupleWritable.Comparator
Date Sun, 23 Mar 2014 14:59:42 GMT


Chao Shi commented on CRUNCH-368:

bq. Any idea on the actual performance improvement of this patch?

I've tested it on a modified version of SecondarySortExample program. I changed it from using
avro to writable. 

The input data is generated by "". The length of primary key is changed for each
round. The original program (which uses avros), the writable version with and without this
patch are tested. They are run on a single-node hadoop2 cluster. The results show this patch
can save ~15% of running time for this case.

It is more interesting that the avro version is much more faster than both writable versions.
I noticed in MR log that it spills half times than the writable versions. I guess this is
because avro encodes data more compactly (e.g. variable length integers in avro vs 64-bit
longs in writable). My sorting buffer is configured to 100 MB by default.

> TupleWritable.Comparator
> ------------------------
>                 Key: CRUNCH-368
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.10.0, 0.8.3
>            Reporter: Chao Shi
>            Assignee: Chao Shi
>         Attachments: crunch-368 benchmark.pdf, crunch-368.patch,
> This patch should improve comparison performance on TupleWritables. It saves the deserialization
overhead. It is particularly useful when the input tuple are large, e.g. contains long strings.
> Please note that this changes the binary format of TupleWritable. It adds a var-int indicating
size of field after each type code. This is a limitation of the writable system. We do not
know the size of each field until fully desalinizing it. 

This message was sent by Atlassian JIRA

View raw message