hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ryan rawson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3480) Reduce the size of Result serialization
Date Mon, 07 Mar 2011 04:56:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003260#comment-13003260
] 

ryan rawson commented on HBASE-3480:
------------------------------------

in my test the cost of serialization was larger than the time savings
of data transmission on the wire.

I think we are going to put a freeze on RPC changes soon, we need to
be thinking next gen.


> Reduce the size of Result serialization
> ---------------------------------------
>
>                 Key: HBASE-3480
>                 URL: https://issues.apache.org/jira/browse/HBASE-3480
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.0
>            Reporter: ryan rawson
>         Attachments: HBASE-3480-lzf.txt, HBASE-3480.txt
>
>
> When faced with a gigabit ethernet network connection, things are pretty slow actually.
 For example, let's take a 2 MB reply, using a 120MB/sec line rate, we are talking about about
16ms to transfer that data across a gige line.  This is a pretty significant amount of time.
> So this JIRA is about reducing the size of the Result[] serialization.  By exploiting
family and qualifier and rowkey duplication, I created a simple encoding scheme to use a dictionary
instead of literal strings.  
> in my testing, I am seeing some success with the sizes.  Average serialized size is about
1/2 of previous, but time to serialize on the regionserver side is way up, by a factor of
10x.  This might be due to the simplistic first implementation however.
> Here is the post change size:
> grep 'Serialized size' * | perl -ne '/Serialized size: (\d+?) in (\d+?) ns/ ; print $1,
" ", $2, "\n" if $1 > 10000;' | cut -f1 -d' ' | perl -ne '$sum += $_; $count++; END {print
$sum/$count, "\n"}'
> 377047.1125
> Here is the pre change size:
> grep 'Serialized size' * | perl -ne '/Serialized size: (\d+?) in (\d+?) ns/ ; print $1,
" ", $2, "\n" if $1 > 10000;' | cut -f1 -d' ' | perl -ne '$sum += $_; $count++; END {print
$sum/$count, "\n"}'
> 601078.505882353
> That is about a 60% improvement in size.
> But times are not so good, here are some samples of the old, in (size) (time in ns)
> 3874599 10685836
> 5582725 11525888
> so that is about 11ms to serialize 3-5mb of data.
> In the new implementation:
> 1898788 118504672
> 1630058 91133003
> this is 118-91ms for serialized sizes of 1.6-1.8 MB.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message