cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Kjellman (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4208) ColumnFamilyOutputFormat should support writing to multiple column families
Date Fri, 21 Sep 2012 19:45:07 GMT


Michael Kjellman commented on CASSANDRA-4208:

Both ColumnFamilyOutputFormat and BulkOutputFormat. addNamedOutput never seems to set the
column family.

I would assume:

ConfigHelper.setOutputKeyspace(job.getConfiguration(), KEYSPACE);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY1, ColumnFamilyOutputFormat.class,
ByteBuffer.class, List.class);
MultipleOutputs.addNamedOutput(job, OUTPUT_COLUMN_FAMILY2, ColumnFamilyOutputFormat.class,
ByteBuffer.class, List.class);

is all that is needed. If i don't setup the job with job.SetOutputFormatClass(ColumnFamilyOutputFormat.class)
FileOutputFormat throws an exception

Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory
not set.
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(

If i do specify that at the job level the job name never seems to to set the column family
name on that job.

additionally, using the job name as the column family name is slightly inconvenient as we
use '_' in our column family names which is not a valid character in MultipleOutputs as it
looks like _# is the way they internally keep track of counters if that is enabled. 
> ColumnFamilyOutputFormat should support writing to multiple column families
> ---------------------------------------------------------------------------
>                 Key: CASSANDRA-4208
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>    Affects Versions: 1.1.0
>            Reporter: Robbie Strickland
>         Attachments: cassandra-1.1-4208.txt, cassandra-1.1-4208-v2.txt, cassandra-1.1-4208-v3.txt,
trunk-4208.txt, trunk-4208-v2.txt
> It is not currently possible to output records to more than one column family in a single
reducer.  Considering that writing values to Cassandra often involves multiple column families
(i.e. updating your index when you insert a new value), this seems overly restrictive.  I
am submitting a patch that moves the specification of column family from the job configuration
to the write() call in ColumnFamilyRecordWriter.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message