cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Forsberg (Issue Comment Edited) (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats
Date Wed, 22 Feb 2012 08:28:49 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13213452#comment-13213452
] 

Erik Forsberg edited comment on CASSANDRA-3859 at 2/22/12 8:27 AM:
-------------------------------------------------------------------

bq. I am not seeing this on our end. Our job is running 50 reducers on our end, and it certainly
takes > timeout seconds (600 for us). It's progressing ...

Just to make sure we're measuring the same thing - are your reducers taking more than 600
seconds *after* the creation of sstables have finished? 

For us, the creation of sstables take ~10 minutes - and during that period the job is consuming
input, so Hadoop knows it's active, and then it's the loading phase that takes much longer,
and gets killed if I don't set mapred.task.timeout seconds to a very high value.

bq. Brandon, one thing I could think of, is if they are adding a lot of batches, we don't
actually call progress until the loop is over.

Hmm.. what is "a batch" in this context?

Samarth points out that this **may** be a bug in our Hadoop version. We're a bit behind, running
Cloudera's CDH2 (Hadoop 0.20.1+169.89) on our production system. One suspect could be https://issues.apache.org/jira/browse/MAPREDUCE-1905,
but I'm unsure if that affects the version we're running. We'll try to figure out by running
some tests on different versions of Hadoop.

                
      was (Author: forsberg):
    bq. I am not seeing this on our end. Our job is running 50 reducers on our end, and it
certainly takes > timeout seconds (600 for us). It's progressing ...

Just to make sure we're measuring the same thing - are your reducers taking more than 600
seconds *after* the creation of sstables have finished? 

For us, the creation of sstables take ~10 minutes - and during that period the job is consuming
input, so Hadoop knows it's active, and then it's the loading phase that takes much longer,
and gets killed if I don't set mapred.task.timeout seconds to a very high value.

bq. Brandon, one thing I could think of, is if they are adding a lot of batches, we don't
actually call progress until the loop is over.

Hmm.. what is "a batch" in this context?

                  
> Add Progress Reporting to Cassandra OutputFormats
> -------------------------------------------------
>
>                 Key: CASSANDRA-3859
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop, Tools
>    Affects Versions: 1.1.0
>            Reporter: Samarth Gahire
>            Assignee: Brandon Williams
>            Priority: Minor
>              Labels: bulkloader, hadoop, mapreduce, sstableloader
>             Fix For: 1.1.0
>
>         Attachments: 0001-add-progress-reporting-to-BOF.txt, 0002-Add-progress-to-CFOF.txt
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When we are using the BulkOutputFormat to load the data to cassandra. We should use the
progress reporting to Hadoop Job within Sstable loader because while loading the data for
particular task if streaming is taking more time and progress is not reported to Job it may
kill the task with timeout exception. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message