crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Issue with AvroPathperKeyTarget in crunch while writing data to multiple files for each of the keys of the PTable
Date Wed, 28 May 2014 14:25:31 GMT
That sounds super annoying. Which version are you using? There was this
issue that is fixed in master, but not in any release yet. (I'm trying to
get one out this week if at all possible.)

https://issues.apache.org/jira/browse/CRUNCH-316

Can you check your logs for that in-memory buffer error?


On Wed, May 28, 2014 at 7:11 AM, Suraj Satishkumar Sheth <surajsat@adobe.com
> wrote:

>  Hi,
>
> We have a use case where we have a PTable which consists of 30 keys and
> millions of values per key. We want to write the values foe each of the
> keys into separate files.
>
> Although, creating 30 different PTables using filter and then, writing
> each of them to HDFS is working for us, it is highly inefficient.
>
>
>
> I have been trying to write data from a PTable into multiple files
> corresponding to the values of the keys using AvroPathPerKeyTarget.
>
>
>
> So, the usage is something like this :
>
> *finalRecords**.**groupByKey**().**write**(**new**
> AvroPathPerKeyTarget(outPath));*
>
>
>
> *where finalRecords is a PCollection of Avro*
>
>
>
> It is verified that the data contains exactly 30 unique keys. The amount
> of data is a few millions for a few keys while a few thousands for a few
> other keys.
>
>
>
> Expectation : It will divide the data 30 parts and write them to the
> specified place in HDFS creating a directory for each key. We will be able
> to read the data as a PCollection<Avro> later for our next job.
>
>
>
> Issue : It is able to create 30 different directories for the keys and all
> the directories have data of non-zero size.
>
>        But, occasionally, a few files get corrupted. When we try to read
> it into a PCollection<Avro> and try to use it, it throws an error :
>
> *       Caused by: java.io.IOException: Invalid sync!*
>
>
>
> *Symptoms : *The issue occurs intermittently. It occurs once in 3-4 runs
> and only one or two files among 30 get corrupted in that run.
>
>            The filesize of the corrupted Avro file is either very high or
> very low than expected. E.g. if we are expecting a file of 100MB, we will
> get a file of 30MB or 250MB if that is corrupted due to
> AvroPathPerKeyTarget.
>
>
>
> We increased the number of reducers to 500, so that, no two keys(among 30
> keys) go to the same reducer. Inspite of this change, we were able to see
> the error.
>
>
>
> Any ideas/suggestions to fix this issue or explanation of this issue will
> be helpful.
>
>
>
>
>
> Thanks and Regards,
>
> Suraj Sheth
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message