cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benoit Perroud (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3122) SSTableSimpleUnsortedWriter take long time when inserting big rows
Date Sun, 04 Sep 2011 12:54:09 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096867#comment-13096867
] 

Benoit Perroud commented on CASSANDRA-3122:
-------------------------------------------

Digging further in SSTableSimpleUnsortedWriter, I found out another point : 

every time newRow is called, serializedSize iterate through all the columns to compute the
size.

In my use case, I have line whith hourly values (data:h0|h1|h2|...|h23), and for every line
I will use the date of the day concatenated with the hour as key ("dateoftheday|hour"), and
the value composed (using composite) with the data as column name ([value,data]=null). More
clearly, my data look like :
abc:1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2
bcd:3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4

and the for every line I call 

writer.newRow("20110804|0"), writer.addColum(Composite(1, "abc"), empty_array), 
writer.newRow("20110804|1"), writer.addColum(Composite(2, "abc"), empty_array), 
writer.newRow("20110804|3"), writer.addColum(Composite(1, "abc"), empty_array), 
writer.newRow("20110804|4"), writer.addColum(Composite(2, "abc"), empty_array), 
...

So writer.newRow() is called 24 times for every lines.

So one solution could be to have a local class "CachedSizeColumFamily" extending ColumFamily
that will increase the serialized size at every addColumn, and return it directly when serializedSize()
is called.

In the same topic, even if ConcurrentSkipListMap claims to have good performances (which is
the case in multi threading environments), I had really better results using a TreeMap in
ColumnFamily (and then avoid the putIfAbscent call on the ConcurrentSkipListMap). In bulk
loading, SSTableSimpleUnsortedWriter is single threaded anyway, there is no needs of having
a complex but yes slower data structure like ConcurrentSkipListMap. An improvement in bulk
loading would be to use a "single threaded" ColumFamily for bulk loading. This could be part
of another Jira.



> SSTableSimpleUnsortedWriter take long time when inserting big rows
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-3122
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3122
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Benoit Perroud
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.8.5
>
>         Attachments: 3122.patch, SSTableSimpleUnsortedWriter-v2.patch, SSTableSimpleUnsortedWriter.patch
>
>
> In SSTableSimpleUnsortedWriter, when dealing with rows having a lot of columns, if we
call newRow several times (to flush data as soon as possible), the time taken by the newRow()
call is increasing non linearly. This is because when newRow is called, we merge the size
increasing existing CF with the new one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message