hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul <mrid...@yahoo-inc.com>
Subject Re: [jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
Date Tue, 30 Dec 2008 08:06:11 GMT

A similar thing existed with PigStorage iirc (atleast last time I 
checked it a while back - unless I missed something) ...
If the record boundary aligned itself with hdfs boundary, the subsequent 
record would get dropped by pig.

To illustrate
map1 would read until end of its block or last record boundary - 
whichever happens last.
map2 would assume partial read by map1 and proceed to find record 
delimiter for its block - and read from there on.
Hence if map1's record boundary and end of hdfs block coincide, map2 
ends up skipping first record from its block.

Not sure if similar thing is happening here.

Regards,
Mridul

Benjamin Reed (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
>
> Benjamin Reed updated PIG-570:
> ------------------------------
>
>     Attachment: PIG-570.patch
>
> I believe the problem is due to bad position tracking. In the current version of the
code, we chop up the input into blocks, but unfortunately when using bzip there are bzip block
boundaries, HDFS block boundaries, and record boundaries. if the bzip block boundaries line
up too closely, a record could get skipped or possibly corrupted.
>
> i was able to reproduce a problem, hopefully it is the same as your problem in the attached
test case.
>
> the root cause turn out to be improper tracking of "position". if we blindly use the
position of the underlying stream and a bzip block and HDFS block line up we may think that
we have read the first record of the next slice when in fact we have only read the bzip block
header.
>
> the attached patch fixes the problem by defining the position of the stream as the position
of the start of the current block header in the underlying stream.
>
>   
>> Large BZip files  Seem to loose data in Pig
>> -------------------------------------------
>>
>>                 Key: PIG-570
>>                 URL: https://issues.apache.org/jira/browse/PIG-570
>>             Project: Pig
>>          Issue Type: Bug
>>    Affects Versions: types_branch, 0.0.0, 0.1.0, site
>>         Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
>>            Reporter: Alex Newman
>>             Fix For: types_branch, 0.0.0, 0.1.0, site
>>
>>         Attachments: PIG-570.patch
>>
>>
>> So I don't believe  bzip2 input to pig is working, at least not with large files.
It seems as though map files are getting cut off. The maps complete way too quickly and the
actual row of data that pig tries to process often randomly gets cut, and becomes incomplete.
Here are my symptoms:
>> - Maps seem to be completing in a unbelievably fast rate
>> With uncompressed data
>> Status: Succeeded
>> Started at: Wed Dec 17 21:31:10 EST 2008
>> Finished at: Wed Dec 17 22:42:09 EST 2008
>> Finished in: 1hrs, 10mins, 59sec
>> map	100.00%
>> 4670	0	0	4670	0	0 / 21
>> reduce	57.72%
>> 13	0	0	13	0	0 / 4
>> With bzip compressed data
>> Started at: Wed Dec 17 21:17:28 EST 2008
>> Failed at: Wed Dec 17 21:17:52 EST 2008
>> Failed in: 24sec
>> Black-listed TaskTrackers: 2
>> Kind	% Complete	Num Tasks	Pending	Running	Complete	Killed	Failed/Killed
>> Task Attempts
>> map	100.00%
>> 183	0	0	15	168	54 / 22
>> reduce	100.00%
>> 13	0	0	0	13	0 / 0
>> The errors we get:
>> ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, 0HAW, CHIX,
)
>> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
>> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
>> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
>> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
>> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
>> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
>> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>> Last 4KB
>> attempt_200812161759_0045_m_000007_0	task_200812161759_0045_m_000007	tsdhb06.factset.com
FAILED	
>> java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec	A, CSGN,
VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
>> 	at org.apache.pig.data.Tuple.getField(Tuple.java:176)
>> 	at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
>> 	at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
>> 	at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
>> 	at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
>> 	at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
>> 	at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
>>     
>
>   


Mime
View raw message