cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ilya Maykov (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-2527) Add ability to snapshot data as input to hadoop jobs
Date Tue, 15 May 2012 19:59:08 GMT


Ilya Maykov commented on CASSANDRA-2527:

We wrote a Hadoop InputFormat class that could read SSTable files directly, completely bypassing
the Cassandra server - not that hard to do as the SSTable file format is pretty simple. Then
we exported the snapshot directories over NFS to our hadoop workers and ran the MR job that
way. Obviously only useful if you want to iterate through all of the data in your Cassandra
cluster. Also has a lot of overhead - this approach reads through stale versions of data that
haven't been compacted away yet, and reads RF replicas of each row ... exposing snapshots
in special snapshot keyspaces so they could be mapped using stock hadoop mappers may be a
better way to go.
> Add ability to snapshot data as input to hadoop jobs
> ----------------------------------------------------
>                 Key: CASSANDRA-2527
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jeremy Hanna
>              Labels: hadoop
> It is desirable to have immutable inputs to hadoop jobs for the duration of the job.
 That way re-execution of individual tasks do not alter the output.  One way to accomplish
this would be to snapshot the data that is used as input to a job.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message