crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Som Satpathy <somsatpa...@gmail.com>
Subject Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable
Date Thu, 29 May 2014 17:46:45 GMT
Hi Josh/Gabriel,

This problem has been confounding us for a while. Do we need to pass a
custom Partitioner or pass specific GroupByOptions into the groupBy to make
it work with the AvroPathPerKeyTarget? I assume there is no need for that.

Thanks,
Som


On Wed, May 28, 2014 at 7:46 AM, Suraj Satishkumar Sheth <surajsat@adobe.com
> wrote:

>  Hi Josh,
>
> Thanks for the quick response
>
>
>
> Here are the logs :
>
> org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
> at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at
> org.apache.crunch.types.avro.AvroRecordReader.nextKeyValue(AvroRecordReader.java:66)
> at
> org.apache.crunch.impl.mr.run.CrunchRecordReader.nextKeyValue(CrunchRecordReader.java:157)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
> at
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at
> org.apache.hadoop.mapred.Child$4.run(Child.java:268) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:415) at
> org.apache.hadoop.security.UserGroupInformation.d
>
>
>
> Even when we read the output of AvroPathPerKeyTarget into a PCollection
> and try to count the number of records in the PCollection, we get the same
> error.
>
> The strange thing is that this occurs rarely(once in 3-4 times) even when
> we try it on the same data multiple times.
>
>
>
>
>
> The versions being used :
>
> *Avro – 1.7.5*
>
> *Crunch - *0.8.2-hadoop2
>
>
>
> Thanks and Regards,
>
> Suraj Sheth
>
>
>
> *From:* Josh Wills [mailto:jwills@cloudera.com]
> *Sent:* Wednesday, May 28, 2014 7:56 PM
> *To:* user@crunch.apache.org
> *Subject:* Re: Issue with AvroPathperKeyTarget in crunch while writing
> data to multiple files for each of the keys of the PTable
>
>
>
> That sounds super annoying. Which version are you using? There was this
> issue that is fixed in master, but not in any release yet. (I'm trying to
> get one out this week if at all possible.)
>
>
>
> https://issues.apache.org/jira/browse/CRUNCH-316
>
>
>
> Can you check your logs for that in-memory buffer error?
>
>
>
> On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth <
> surajsat@adobe.com> wrote:
>
> Hi,
>
> We have a use case where we have a PTable which consists of 30 keys and
> millions of values per key. We want to write the values for each of the
> keys into separate files.
>
> Although, creating 30 different PTables using filter and then, writing
> each of them to HDFS is working for us, it is highly inefficient.
>
>
>
> I have been trying to write data from a PTable into multiple files
> corresponding to the values of the keys using AvroPathPerKeyTarget.
>
>
>
> So, the usage is something like this :
>
> *finalRecords**.**groupByKey**().**write**(new**
> AvroPathPerKeyTarget(outPath));*
>
>
>
> *where finalRecords is a PTable whose keys are Strings and values are AVRO
> records*
>
>
>
> It is verified that the data contains exactly 30 unique keys. The amount
> of data is a few millions for a few keys while a few thousands for a few
> other keys.
>
>
>
> Expectation : It will divide the data 30 parts and write them to the
> specified place in HDFS creating a directory for each key. We will be able
> to read the data as a PCollection<Avro> later for our next job.
>
>
>
> Issue : It is able to create 30 different directories for the keys and all
> the directories have data of non-zero size.
>
>        But, occasionally, a few files get corrupted. When we try to read
> it into a PCollection<Avro> and try to use it, it throws an error :
>
> *       Caused by: java.io.IOException: Invalid sync!*
>
>
>
> *Symptoms : *The issue occurs intermittently. It occurs once in 3-4 runs
> and only one or two files among 30 get corrupted in that run.
>
>            The filesize of the corrupted Avro file is either very high or
> very low than expected. E.g. if we are expecting a file of 100MB, we will
> get a file of 30MB or 250MB if that is corrupted due to
> AvroPathPerKeyTarget.
>
>
>
> We increased the number of reducers to 500, so that, no two keys(among 30
> keys) go to the same reducer. Inspite of this change, we were able to see
> the error.
>
>
>
> Any ideas/suggestions to fix this issue or explanation of this issue will
> be helpful.
>
>
>
>
>
> Thanks and Regards,
>
> Suraj Sheth
>
>
>
>
>
> --
>
> Director of Data Science
>
> Cloudera <http://www.cloudera.com>
>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message