crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable
Date Thu, 29 May 2014 18:43:10 GMT
Hey Som,

No, no need for a custom partitioner or special GroupByOptions when
you're using the AvroPathPerKeyTarget. As you probably know, it's
definitely a good idea to have all values under the same key next to
each other in the PTable that is being output.

Any chance you could try this with a build from the current head of
the 0.8 branch? It's named apache-crunch-0.8 in git. This really
sounds like it's related to CRUNCH-316, so it would be good if we
could check if that fix corrects this issue or not.

- Gabriel


On Thu, May 29, 2014 at 7:46 PM, Som Satpathy <somsatpathy@gmail.com> wrote:
> Hi Josh/Gabriel,
>
> This problem has been confounding us for a while. Do we need to pass a
> custom Partitioner or pass specific GroupByOptions into the groupBy to make
> it work with the AvroPathPerKeyTarget? I assume there is no need for that.
>
> Thanks,
> Som
>
>
> On Wed, May 28, 2014 at 7:46 AM, Suraj Satishkumar Sheth
> <surajsat@adobe.com> wrote:
>>
>> Hi Josh,
>>
>> Thanks for the quick response
>>
>>
>>
>> Here are the logs :
>>
>> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
>> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at
>> org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66)
>> at
>> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157)
>> at
>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
>> at
>> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
>> at
>> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at
>> org.apache.hadoop.mapred.Child$4.run(Child.java:268) at
>> java.security.AccessController.doPrivileged(Native Method) at
>> javax.security.auth.Subject.doAs(Subject.java:415) at
>> org.apache.hadoop.security.UserGroupInformation.d
>>
>>
>>
>> Even when we read the output of AvroPathPerKeyTarget into a PCollection
>> and try to count the number of records in the PCollection, we get the same
>> error.
>>
>> The strange thing is that this occurs rarely(once in 3-4 times) even when
>> we try it on the same data multiple times.
>>
>>
>>
>>
>>
>> The versions being used :
>>
>> Avro – 1.7.5
>>
>> Crunch - 0.8.2-hadoop2
>>
>>
>>
>> Thanks and Regards,
>>
>> Suraj Sheth
>>
>>
>>
>> From: Josh Wills [mailto:jwills@cloudera.com]
>> Sent: Wednesday, May 28, 2014 7:56 PM
>> To: user@crunch.apache.org
>> Subject: Re: Issue with AvroPathperKeyTarget in crunch while writing data
>> to multiple files for each of the keys of the PTable
>>
>>
>>
>> That sounds super annoying. Which version are you using? There was this
>> issue that is fixed in master, but not in any release yet. (I'm trying to
>> get one out this week if at all possible.)
>>
>>
>>
>> https://issues.apache.org/jira/browse/CRUNCH-316
>>
>>
>>
>> Can you check your logs for that in-memory buffer error?
>>
>>
>>
>> On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth
>> <surajsat@adobe.com> wrote:
>>
>> Hi,
>>
>> We have a use case where we have a PTable which consists of 30 keys and
>> millions of values per key. We want to write the values for each of the keys
>> into separate files.
>>
>> Although, creating 30 different PTables using filter and then, writing
>> each of them to HDFS is working for us, it is highly inefficient.
>>
>>
>>
>> I have been trying to write data from a PTable into multiple files
>> corresponding to the values of the keys using AvroPathPerKeyTarget.
>>
>>
>>
>> So, the usage is something like this :
>>
>> finalRecords.groupByKey().write(new AvroPathPerKeyTarget(outPath));
>>
>>
>>
>> where finalRecords is a PTable whose keys are Strings and values are AVRO
>> records
>>
>>
>>
>> It is verified that the data contains exactly 30 unique keys. The amount
>> of data is a few millions for a few keys while a few thousands for a few
>> other keys.
>>
>>
>>
>> Expectation : It will divide the data 30 parts and write them to the
>> specified place in HDFS creating a directory for each key. We will be able
>> to read the data as a PCollection<Avro> later for our next job.
>>
>>
>>
>> Issue : It is able to create 30 different directories for the keys and all
>> the directories have data of non-zero size.
>>
>>        But, occasionally, a few files get corrupted. When we try to read
>> it into a PCollection<Avro> and try to use it, it throws an error :
>>
>>        Caused by: java.io.IOException: Invalid sync!
>>
>>
>>
>> Symptoms : The issue occurs intermittently. It occurs once in 3-4 runs and
>> only one or two files among 30 get corrupted in that run.
>>
>>            The filesize of the corrupted Avro file is either very high or
>> very low than expected. E.g. if we are expecting a file of 100MB, we will
>> get a file of 30MB or 250MB if that is corrupted due to
>> AvroPathPerKeyTarget.
>>
>>
>>
>> We increased the number of reducers to 500, so that, no two keys(among 30
>> keys) go to the same reducer. Inspite of this change, we were able to see
>> the error.
>>
>>
>>
>> Any ideas/suggestions to fix this issue or explanation of this issue will
>> be helpful.
>>
>>
>>
>>
>>
>> Thanks and Regards,
>>
>> Suraj Sheth
>>
>>
>>
>>
>>
>> --
>>
>> Director of Data Science
>>
>> Cloudera
>>
>> Twitter: @josh_wills
>
>

Mime
View raw message