accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-1083) add concurrency to HDFS write-ahead log
Date Sat, 09 Mar 2013 05:51:12 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13597852#comment-13597852
] 

Josh Elser commented on ACCUMULO-1083:
--------------------------------------

bq. [~ecn] Concurrent, as in, a sub-set of tablets use a log, or concurrent, as in, there
are multiple log files available to log to, as described in the Big Table paper?

I think this got lost in the chatter.

bq. [~vines] Hadoop sync/appends have gone through several iterations since they were introduced
into the core release. Does anyone have any info on the performance of appends in the various
hadoop releases?

I would be curious about the results from running the same test against 0.23 or 2.0

bq. [~billie.rinaldi] If people think it's important for everyone to be able to edit their
own comments, we can ask.

I'd have to agree it'd be nice for everyone to be able to edit their own message. Unless there's
something to say otherwise.

bq. [~brassard] It's also worth noting that the default Accumulo behavior is to use the default
of Hadoop, so it's not necessarily set to 3 automatically.

Given we used to run with 2 by default (and we didn't have widespread data-loss issues), we
could have Accumulo do the setrep 2 by default for the WAL dir. But, I think this merits discussion
which is parallel to the one we're already having.

bq. [~bills] For 1.5, that table seems to indicate that adding n replicas of a log file results
in a 1/(n + 1) effect on aggregate throughput.

Sounds about right.

But I'm not sure the picture isn't a bit larger than what we're looking at here. [~brassard],
can you post relevant HDFS configurations (hdfs block size, datanode handler count, datanode
xcievers, etc)? Given the beefiness of your nodes, performance seems low to me. Doing some
hand-wavy math.. 15*8=120drives, ~100MB/s write per drive, gives you ~12GB/s write across
your cluster. Even knocking that back to say 1.2GB/s write "actual".. that's still far above
5.7MB/s * 8nodes * 3replicas = 136MB/s.

With 32 cores, I wouldn't think you're CPU bound. Yet it doesn't seem like you should be IO
bound. I assume you're also memory heavy? Do you have stats from Ganglia or iostat that show
IO performance? What about network performance? What filesystem are you using for the partitions
hosting the data dirs? 

Without research, I also don't know how the TServer "finishing" writing a mutation aligns
with the replicas writing. Does each datanode writing a replica need to complete the write
before the TServer will return to the client? Does only one datanode need to finish? Lots
of questions :D
                
> add concurrency to HDFS write-ahead log
> ---------------------------------------
>
>                 Key: ACCUMULO-1083
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1083
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Adam Fuchs
>             Fix For: 1.6.0
>
>         Attachments: walog-performance.jpg, walog-replication-factor-performance.jpg
>
>
> When running tablet servers on beefy nodes (lots of disks), the write-ahead log can be
a serious bottleneck. Today we ran a continuous ingest test of 1.5-SNAPSHOT on an 8-node (plus
a master node) cluster in which the nodes had 32 cores and 15 drives each. Running with write-ahead
log off resulted in a >4x performance improvement sustained over a long period.
> I believe the culprit is that the WAL is only using one file at a time per tablet server,
which means HDFS is only appending to one drive (plus replicas). If we increase the number
of concurrent WAL files supported on a tablet server we could probably drastically improve
the performance on systems with many disks. As it stands, I believe Accumulo is significantly
more optimized for a larger number of smaller nodes (3-4 drives).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message