cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Fines (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2388) ColumnFamilyRecordReader fails for a given split because a host is down, even if records could reasonably be read from other replica.
Date Wed, 21 Nov 2012 14:43:59 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502016#comment-13502016
] 

Scott Fines commented on CASSANDRA-2388:
----------------------------------------

I have two distinct use-cases where running TaskTrackers alongside Cassandra nodes does not
accomplish our goals:

1. Joining data. We have a large data set in cassandra, true, but we have a *much* larger
data set held in Hadoop itself (around 4 orders of magnitude larger in hadoop than in cassandra).
We need to join the two datasets together, and use the output from that join to feed multiple
systems, none of which are cassandra. Since the data in Hadoop is so much larger than that
in Cassandra, we have to bring the Cassandra data to hadoop, not the other way around. Because
of security concerns, we can't spread our hadoop data onto our cassandra nodes (even if that
didn't screw with our capacity planning), so we have no other choice but to move the Cassandra
data (in small chunks) onto Hadoop. Why not use HBase, you say? We needed Cassandra for its
write performance for other problems than this one. 

1. Offline, incremental backups. We have a large volume of time-series data held in Cassandra,
and taking nightly snapshots and moving them to our archival center is prohibitively slow--it
turns out that moving RF copies of our entire dataset over a leased line every night is a
pretty bad idea. Instead, I use MapReduce to take an incremental backup of a much smaller
subset of the data, then move that. That way, we not only are not moving the entire data set,
but we are also using Cassandra's consistency mechanisms to resolve all the replicas. The
only efficient way I've found to do this is via MapReduce (we use the Random Partitioner),
and since it's an offline backup, we need to move it over the network anyway--may as well
use the optimized network connecting Hadoop and Cassandra instead of the tiny pipe connecting
cassandra to our archival center. 

Both of these reasons dictate that we *not* run a TT alongside our Cassandra nodes, no matter
what the *recommended* approach is. In this case, we need a strong, fault-tolerant CFIF to
serve our purposes.


                
> ColumnFamilyRecordReader fails for a given split because a host is down, even if records
could reasonably be read from other replica.
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2388
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2388
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Eldon Stegall
>            Assignee: Mck SembWever
>            Priority: Minor
>              Labels: hadoop, inputformat
>             Fix For: 1.1.7
>
>         Attachments: 0002_On_TException_try_next_split.patch, CASSANDRA-2388-addition1.patch,
CASSANDRA-2388-extended.patch, CASSANDRA-2388.patch, CASSANDRA-2388.patch, CASSANDRA-2388.patch,
CASSANDRA-2388.patch
>
>
> ColumnFamilyRecordReader only tries the first location for a given split. We should try
multiple locations for a given split.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message