cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Schnitzius <>
Subject Re: Updating (as opposed to just setting) Cassandra data via Hadoop
Date Thu, 06 May 2010 03:44:29 GMT
Apologies, Hadoop recently deprecated a whole bunch of classes and I
misunderstood how the new ones work.

What I'll be doing is creating an InputFormat class that
uses ColumnFamilyInputFormat to get splits from the existing Cassandra data,
and merges them with splits from a SequenceFileInputFormat.

Is this a reasonable approach, or is there a better, more standard way to
update Cassandra data with new Hadoop data?  It may just boil down to a
design decision, but it would seem to me that this would be a problem that
would've been encountered many times before...


On Thu, May 6, 2010 at 12:23 AM, Jonathan Ellis <> wrote:

> I'm a little confused.  CombineFileInputFormat is designed to combine
> multiple small input splits into one larger one.  It's not for merging
> data (that needs to be part of the reduce phase).  Maybe I'm
> misunderstanding what you're saying.
> On Tue, May 4, 2010 at 10:53 PM, Mark Schnitzius
> <> wrote:
> > I have a situation where I need to accumulate values in Cassandra on an
> > ongoing basis.  Atomic increments are still in the works apparently
> > (see so for the
> time
> > being I'll be using Hadoop, and attempting to feed in both the existing
> > values and the new values to a M/R process where they can be combined
> > together and written back out to Cassandra.
> > The approach I'm taking is to use Hadoop's CombineFileInputFormat to
> blend
> > the existing data (using Cassandra's ColumnFamilyInputFormat) with the
> newly
> > incoming data (using something like Hadoop's SequenceFileInputFormat).
> > I was just wondering, has anyone here tried this, and were there issues?
> >  I'm worried because the CombineFileInputFormat has restrictions around
> > splits being from different pools so I don't know how this will play out
> > with data from both Cassandra and HDFS.  The other option, I suppose, is
> to
> > use a separate M/R process to replicate the data onto HDFS first, but I'd
> > rather avoid the extra step and duplication of storage.
> > Also, if you've tackled a similar situation in the past using a different
> > approach, I'd be keen to hear about it...
> >
> > Thanks
> > Mark
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support

View raw message