cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robbie Strickland (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
Date Wed, 02 May 2012 15:18:50 GMT


Robbie Strickland commented on CASSANDRA-4208:

I spent a good bit of time analyzing the changes needed to make this work using MultipleOutputs,
and it would involve:

1. Removing hard-coded references to WritableComparable and Writable in MultipleOutputs.getNamedOutputKeyClass()
and getNamedOutputValueClass().
2. Removing hard-coded call to FileOutputFormat.setOutputName() in getRecordWriter().
3. Adding an abstract setOutputName() to OutputFormat so the call in #2 can be made generic.
An alernative is a default no-op implementation so it doesn't break existing output formats
who don't care about this.
4. Implementing setOutputName() in ColumnFamilyOutputFormat, which would set the config property
for the CF (where the "name" corresponds to CF).
5. Separating CFOF.setColumnFamily() and setKeyspace(), where setColumnFamily() is just a
pass-through to setOutputName() (or vice versa).

This solution would allow MultipleOutputs support in conformance with the existing API, and
it should not break any existing reducer code.  I don't personally love the boilerplate it
adds to my reducer, and I think it's much less obvious than handling it at the write() call,
but I can get over that if I have to. :)  I am willing to do the work on both sides if this
is where the consensus is, though I don't know what the response will be in the Hadoop community.

> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>                 Key: CASSANDRA-4208
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: trunk-4208.txt
> It is not currently possible to output records to more than one column family in a single
reducer.  Considering that writing values to Cassandra often involves multiple column families
(i.e. updating your index when you insert a new value), this seems overly restrictive.  I
am submitting a patch that moves the specification of column family from the job configuration
to the write() call in ColumnFamilyRecordWriter.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message