cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-6793) NPE in Hadoop Word count example
Date Tue, 11 Mar 2014 17:51:49 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13930645#comment-13930645
] 

Jonathan Ellis edited comment on CASSANDRA-6793 at 3/11/14 5:51 PM:
--------------------------------------------------------------------

I confess that I'm mystified by the schema introduced in CASSANDRA-4421:

{noformat}
/**
 * This counts the occurrences of words in ColumnFamily
 *   cql3_worldcount ( user_id text,
 *                   category_id text,
 *                   sub_category_id text,
 *                   title  text,
 *                   body  text,
 *                   PRIMARY KEY (user_id, category_id, sub_category_id))
 *
 * For each word, we output the total number of occurrences across all body texts.
 *
 * When outputting to Cassandra, we write the word counts to column family
 *  output_words ( row_id1 text,
 *                 row_id2 text,
 *                 word text,
 *                 count_num text,
 *                 PRIMARY KEY ((row_id1, row_id2), word))
 * as a {word, count} to columns: word, count_num with a row key of "word sum"
 */
{noformat}

Both the input and output tables look far more complex than necessary.  

My preferred solution would be to just strip the output down to {{(word text primary key,
count int)}}, and make a similar simplification for the input.

Can you shed any light [~alexliu68]?


was (Author: jbellis):
I confess that I'm mystified by the schema introduced in CASSANDRA-4421:

{noformat}
/**
 * This counts the occurrences of words in ColumnFamily
 *   cql3_worldcount ( user_id text,
 *                   category_id text,
 *                   sub_category_id text,
 *                   title  text,
 *                   body  text,
 *                   PRIMARY KEY (user_id, category_id, sub_category_id))
 *
 * For each word, we output the total number of occurrences across all body texts.
 *
 * When outputting to Cassandra, we write the word counts to column family
 *  output_words ( row_id1 text,
 *                 row_id2 text,
 *                 word text,
 *                 count_num text,
 *                 PRIMARY KEY ((row_id1, row_id2), word))
 * as a {word, count} to columns: word, count_num with a row key of "word sum"
 */
/**
 * This counts the occurrences of words in ColumnFamily
 *   cql3_worldcount ( user_id text,
 *                   category_id text,
 *                   sub_category_id text,
 *                   title  text,
 *                   body  text,
 *                   PRIMARY KEY (user_id, category_id, sub_category_id))
 *
 * For each word, we output the total number of occurrences across all body texts.
 *
 * When outputting to Cassandra, we write the word counts to column family
 *  output_words ( row_id1 text,
 *                 row_id2 text,
 *                 word text,
 *                 count_num text,
 *                 PRIMARY KEY ((row_id1, row_id2), word))
 * as a {word, count} to columns: word, count_num with a row key of "word sum"
 */
{noformat}

Both the input and output tables look far more complex than necessary.  

My preferred solution would be to just strip the output down to {{(word text primary key,
count int)}}, and make a similar simplification for the input.

Can you shed any light [~alexliu68]?

> NPE in Hadoop Word count example
> --------------------------------
>
>                 Key: CASSANDRA-6793
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6793
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Examples
>            Reporter: Chander S Pechetty
>            Assignee: Chander S Pechetty
>            Priority: Minor
>              Labels: hadoop
>         Attachments: trunk-6793.txt
>
>
> The partition keys requested in WordCount.java do not match the primary key set up in
the table output_words. It looks this patch was not merged properly from [CASSANDRA-5622|https://issues.apache.org/jira/browse/CASSANDRA-5622].The
attached patch addresses the NPE and uses the correct keys defined in #5622.
> I am assuming there is no need to fix the actual NPE like throwing an InvalidRequestException
back to user to fix the partition keys, as it would be trivial to get the same from the TableMetadata
using the driver API.
> java.lang.NullPointerException
> 	at org.apache.cassandra.dht.Murmur3Partitioner.getToken(Murmur3Partitioner.java:92)
> 	at org.apache.cassandra.dht.Murmur3Partitioner.getToken(Murmur3Partitioner.java:40)
> 	at org.apache.cassandra.client.RingCache.getRange(RingCache.java:117)
> 	at org.apache.cassandra.hadoop.cql3.CqlRecordWriter.write(CqlRecordWriter.java:163)
> 	at org.apache.cassandra.hadoop.cql3.CqlRecordWriter.write(CqlRecordWriter.java:63)
> 	at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:587)
> 	at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> 	at WordCount$ReducerToCassandra.reduce(Unknown Source)
> 	at WordCount$ReducerToCassandra.reduce(Unknown Source)
> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message