hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jiang licht <licht_ji...@yahoo.com>
Subject Re: Unexpected empty result problem (zero-sized part-### files)?
Date Mon, 22 Feb 2010 04:30:24 GMT
Thanks, Amogh. Actually, I noticed the following 2 settings in mapred-site.xml but haven't
tested it yet.

mapred.skip.attempts.to.start.skipping
mapred.skip.map.max.skip.records

I will try these settings to see if they help.

BTW, I started another post in pig-use mailing list and based on what Ashutosh pointed out,
my understanding is that even one corrupted gz file will fail the whole map job, even many
other map tasks succeed and produce map outputs, which explains why I got 0-byte part-###
files for the pig script.

Thanks,

Michael

--- On Sun, 2/21/10, Amogh Vasekar <amogh@yahoo-inc.com> wrote:

From: Amogh Vasekar <amogh@yahoo-inc.com>
Subject: Re: Unexpected empty result problem (zero-sized part-### files)?
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Date: Sunday, February 21, 2010, 9:49 PM


>> So, considering this situation of loading mixed good and corrupted ".gz" files, how
to still get expected results?
Try manipulating the value mapred.max.map.failures.percent to a % of files you expect to be
corrupted / acceptable data skip percent.

Amogh

On 2/21/10 7:17 AM, "jiang licht" <licht_jiang@yahoo.com> wrote:

I think I found what caused the problem. Actually, the folder to load to 'a' contains all
".gz" files. Somehow, some .gz files are corrupted. Thus, "java.io.EOFException: Unexpected
end of ZLIB input stream" were thrown.

I did the following test: I truncated a ".gz" file and name it "corrupted.gz". Then load only
this file it to 'a' and execute the same remaining scripts. This cause the exact same error
message dumped as given in the 1st post. The same thing happens if loading both this file
and other good gz files.

My guess is that such corrupted files will not be loaded (since the above exception will be
thrown). But data from good .gz files still got loaded. Then why empty result is generated
(0-sized part-####)? So, considering this situation of loading mixed good and corrupted ".gz"
files, how to still get expected results?

Thanks!


Michael

--- On Sat, 2/20/10, Ashutosh Chauhan <ashutosh.chauhan@gmail.com> wrote:

From: Ashutosh Chauhan <ashutosh.chauhan@gmail.com>
Subject: Re: Unexpected empty result problem (zero-sized part-### files)?
To: common-user@hadoop.apache.org
Date: Saturday, February 20, 2010, 7:29 PM

A log file with a name like pig_1234567890.log must be sitting in the
directory from where you launched your pig script. Can you send its content
?

Ashutosh

On Sat, Feb 20, 2010 at 16:41, jiang licht <licht_jiang@yahoo.com> wrote:

> I have a pig script as follows (see far below). It loads 2 data sets,
> perform some filtering, then join the two sets. Lastly count occurrences of
> a combination of fields and writes results to hdfs.
>
> --load raw data
>
> a = LOAD 'foldera/*';
>
>
>
> b = LOAD 'somefile';
>
>
>
> --choose rows and columns
>
> a_filtered = FILTER a BY somecondition;
>
>
>
> a_filtered_shortened = FOREACH a_filtered GENERATE somefields;
>
>
>
> a_filtered_shortened_unique = DISTINCT a_filtered_short PARALLEL #;
>
>
>
> --join a & b and count occurrences of a combination of fields
>
> ab = JOIN a_filtered_short_unique BY somefield, b by somefield PARALLEL
> #;
>
>
>
> ab_shortened = FOREACH ab GENERATE somefileds;
>
>
>
> ab_shortened_grouped = GROUP ab_shortened BY ($0, $1) PARALLEL #;
>
>
>
> --c will contain: fields, counts
>
> c = FOREACH ab_shortened_grouped GENERATE FLATTEN($0),
> COUNT(ab_shortened);
>
>
>
> --save results
>
> STORE c INTO 'MYRESULTS' USING PigStorage();
>
> PROBLEM is that empty sets (empty part-#### files) were generated. But a
> non-empty result is expected. For example, if I chose to load one file
> (instead of loading all files in a folder) to 'a', quite a number of tuples
> are created (non-empty part-### files).
>
> It seems to me the logic in the script is good and it generates correct
> result for randomly selected file anyway. So, I am wondering what could
> cause this empty result problem?
>
> FYI, I ran the same script multiple time and all gave me empty part-###
> files. Though in the output, I did see repeatedly error message similar to
> the following ones that show one result file is failed to produce (these are
> last lines from job output). Could this be the problem? How to locate the
> cause? Thanks!
>
> ...
> 2010-02-20 16:21:37,737 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 86% complete
> 2010-02-20 16:21:38,239 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 87% complete
> 2010-02-20 16:21:39,265 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 88% complete
> 2010-02-20 16:21:44,286 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 93% complete
> 2010-02-20 16:21:46,931 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 95% complete
> 2010-02-20 16:21:47,432 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 99% complete
> 2010-02-20 16:21:54,005 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2010-02-20 16:21:54,005 [main] ERROR
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map reduce job(s) failed!
> 2010-02-20 16:21:54,008 [main] ERROR
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Failed to produce result in:
> "hdfs://hostA:50001/tmp/temp829697187/tmp-531977953"
> 2010-02-20 16:21:54,008 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Successfully stored result in:
> "hdfs://hostA:50001/tmp/temp829697187/tmp504533728"
> 2010-02-20 16:21:54,023 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Successfully stored result in: "hdfs://hostA:50001/user/root/MYRESULTS"
> 2010-02-20 16:21:54,056 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Records written : 0
> 2010-02-20 16:21:54,056 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Bytes written : 0
> 2010-02-20 16:21:54,056 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Some jobs have failed!
>
>
>
>








      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message