cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Muga Nishizawa (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (CASSANDRA-1735) Using MessagePack for reducing data size
Date Wed, 17 Nov 2010 05:42:17 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932822#action_12932822
] 

Muga Nishizawa edited comment on CASSANDRA-1735 at 11/17/10 12:40 AM:
----------------------------------------------------------------------

Jonathan,

Thanks for your response.

>What kind of performance improvement do you see with this patch?

Performance improvement available with this patch will be the following:
* Reducing serialization cost and the data size
* Increase throughput between clients and a Cassandra node

I have also measured the performance of MessagePack, from the viewpoints of reducing serialization
cost and throughput.  I will discuss details below.

== Reduction of serialization cost and the data size ==

(Summary)
MessagePack has proved to be better in reducing serialzation cost and the data size compared
to other serialization libraries in the test below.  

(Test environment)
I used "jvm-serializers" which is a well-known benchmark and compared performances with Protocol
Buffers, Thrift, and Avro.  Machine used for this benchmark has Core2 Duo 2GHz with 1GB RAM.

(Results)
      create  ser +same deser +shal +deep total size +dfl
 protobuf    683 6016 2973  3338  3454 3759 9775 239 149
 thrift      572 6287 5565  3479  3616 3770 10057 349 197 
 msgpack    291 4935 4750  3468  3545 3708 8748 236 150
 avro     2698 6409 3623  7480  9301 10481 16890 221 133

(Comments)
It may be better to compare serialization cost using objects with Cassandra like a Column
object.  But such objects and sizes vary by users, and is not suitable for comparing serialization
cost of various data.  According to the above result, the size of MessagePack's serialized
data is slightly larger than Avro.  But MessagePack has significantly low serialization cost
compared to Avro and Thrift.  

== Increasing throughput ==

(Summary)
I compared MessagePack based RPC of Cassandra to that of Thrift.  Random read throughput of
MessagePack based RPC is 15% higher than that of Thrift and random write throughput is 21%
higher.  

(Test environment)
In this evaluation, Cassandra node ran as a standalone on a machine with Core2 Duo 2GHz and
1GB RAM.  Client programs ran on two machines both with Core2 Duo 2GHz and 1GB RAM.  Client
program was based on ring cache.  It created 100 threads per a JVM on each machine and accesses
to a Cassandra node with ring cache.  

(Results)
* Thrift based RPC part of Cassandra(read: 5,200 query/sec., write: 11,200 query/sec.)
* MessagePack based RPC part of Cassandra (read: 6,000 query/sec., write: 13,600 query/sec.)

(Comments)
I measured the max throughput of random access (read/write) after 100 items (size of each
item is small) were stored in the Cassandra node.  The reason is because I wanted to make
the state of CPU bottle neck for the Cassandra node.  If the Cassandra node is the state of
Disk IO bottle neck, I thought that I cannot properly evaluate max throughput of the RPC part.
 

I did not measure the amount of data transferred in network during the evaluation directly.
 But from the benchmark result of jvm-serializers, I believe that the amount of transferred
data for MessagePack-based Cassandra would be reduced compared to that of Thrift.  


      was (Author: muga_nishizawa):
    Jonathan,

Thanks for your response.

>What kind of performance improvement do you see with this patch?

Performance improvement available with this patch will be the following:
* Reducing serialization cost and the data size
* Increase throughput between clients and a Cassandra node

I have also measured the performance of MessagePack, from the viewpoints of reducing serialization
cost and throughput.  I will discuss details below.

== Reduction of serialization cost and the data size ==

(Summary)
MessagePack has proved to be better in reducing serialzation cost and the data size compared
to other serialization libraries in the test below.  

(Test environment)
I used "jvm-serializers" which is a well-known benchmark and compared performances with Protocol
Buffers, Thrift, and Avro.  Machine used for this benchmark has Core2 Duo 2GHz with 1GB RAM.

(Results)
                                 create     ser   +same   deser   +shal   +deep   total  
size  +dfl
protobuf                         683    6016    2973    3338   3454    3759   9775    239
  149
thrift                              572    6287    5565    3479   3616    3770 10057    349
  197
msgpack                         291    4935    4750    3468   3545    3708   8748    236 
 150
avro                             2698    6409    3623    7480   9301   10481 16890    221
  133

(Comments)
It may be better to compare serialization cost using objects with Cassandra like a Column
object.  But such objects and sizes vary by users, and is not suitable for comparing serialization
cost of various data.  According to the above result, the size of MessagePack's serialized
data is slightly larger than Avro.  But MessagePack has significantly low serialization cost
compared to Avro and Thrift.  

== Increasing throughput ==

(Summary)
I compared MessagePack based RPC of Cassandra to that of Thrift.  Random read throughput of
MessagePack based RPC is 15% higher than that of Thrift and random write throughput is 21%
higher.  

(Test environment)
In this evaluation, Cassandra node ran as a standalone on a machine with Core2 Duo 2GHz and
1GB RAM.  Client programs ran on two machines both with Core2 Duo 2GHz and 1GB RAM.  Client
program was based on ring cache.  It created 100 threads per a JVM on each machine and accesses
to a Cassandra node with ring cache.  

(Results)
* Thrift based RPC part of Cassandra
  * Random read: 5,200 query/sec.
  * Random write: 11,200 query/sec.
* MessagePack based RPC part of Cassandra
  * Random read: 6,000 query/sec.
  * Random write: 13,600 query/sec.

(Comments)
I measured the max throughput of random access (read/write) after 100 items (size of each
item is small) were stored in the Cassandra node.  The reason is because I wanted to make
the state of CPU bottle neck for the Cassandra node.  If the Cassandra node is the state of
Disk IO bottle neck, I thought that I cannot properly evaluate max throughput of the RPC part.
 

I did not measure the amount of data transferred in network during the evaluation directly.
 But from the benchmark result of jvm-serializers, I believe that the amount of transferred
data for MessagePack-based Cassandra would be reduced compared to that of Thrift.  

  
> Using MessagePack for reducing data size
> ----------------------------------------
>
>                 Key: CASSANDRA-1735
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API
>    Affects Versions: 0.7 beta 3
>         Environment: Fedora11,  JDK1.6.0_20
>            Reporter: Muga Nishizawa
>         Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip
>
>
> For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack.
 The implementation details are attached as a patch.  The patch works on Cassandra 0.7.0-beta3.
 Please check it.  
> MessagePack is one of object serialization libraries for cross-languages like Thrift
and Protocol Buffers but it is much faster, small, and easy to implement.  MessagePack allows
reducing serialization cost and data size in network and disk.  
> MessagePack websites are
>     * website: http://msgpack.org/
>         This website compares MessagePack, Thrift and JSON.  
>     * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign
>     * source code: https://github.com/msgpack/msgpack/
> Performance of the data serialization library is one of the most important issues for
developing a distributed database in Java.  If the performance is bad, it significantly reduces
the overall database performance.  Java's GC also runs many times.  Cassandra has this problem
as well.  
> For reducing data size in network between a client and Cassandra, I prototyped the implementation
of a Cassandra RPC part with MessagePack and MessagePack-RPC.  The implementation is very
simple.  MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer)
> while adapting MessagePack's communication protocol and data serialization.  
> Major features of MessagePack-RPC are 
>     * Asynchronous RPC
>     * Parallel Pipelining
>     * Connection pooling
>     * Delayed return
>     * Event-driven I/O
>     * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign
>     * source code: https://github.com/msgpack/msgpack-rpc/
> The attached patch includes a ring cache program for MessagePack and its test program.
 
> You can check the behavior of the Cassandra RPC with MessagePack.  
> Thanks in advance, 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message