crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Reading Avro to GenericRecord
Date Tue, 28 Jan 2014 14:31:10 GMT
Committed this as CRUNCH-334. Thanks Magnus!


On Tue, Jan 28, 2014 at 1:07 AM, Magnus Runesson <magru@linuxalert.org>wrote:

>  Thanks! Looks like it works for me.
>
> Here is a patch to expose it to scrunch:
>
> diff --git
> a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
> b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
> index 89b331b..b77b042 100644
> --- a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
> +++ b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
> @@ -19,11 +19,14 @@ package org.apache.crunch.scrunch
>
>  import org.apache.crunch.io.{From => from, To => to, At => at}
>  import org.apache.crunch.types.avro.AvroType
> -import org.apache.hadoop.fs.Path;
> +import org.apache.hadoop.fs.Path
> +import org.apache.hadoop.conf.Configuration
> +;
>
>  trait From {
>    def avroFile[T](path: String, atype: AvroType[T]) = from.avroFile(path,
> atype)
>    def avroFile[T](path: Path, atype: AvroType[T]) = from.avroFile(path,
> atype)
> +  def avroFile[T](path: Path, conf: Configuration) = from.avroFile(path,
> conf)
>    def textFile(path: String) = from.textFile(path)
>    def textFile(path: Path) = from.textFile(path)
>
>  }
>
>
> On 1/28/14 2:04 AM, Josh Wills wrote:
>
> Patch is here: https://issues.apache.org/jira/browse/CRUNCH-333
>
>
> On Mon, Jan 27, 2014 at 10:08 AM, Josh Wills <josh.wills@gmail.com> wrote:
>
>> Of course. I wrote up a little patch that adds a method to From.java to
>> open the Avro file and pull out the schema and return a Source of
>> GenericData.Record, but I had to roll to some meetings before I got a
>> chance to test it. I'll post something later this evening ET.
>>  On Jan 27, 2014 11:56 AM, "Magnus Runesson" <magru@linuxalert.org>
>> wrote:
>>
>>>  Thanks for quick answer.
>>>
>>> It is totally OK and reasonable to take one file in a directory and
>>> assume all other has the same schema.
>>>
>>>
>>> On 2014-01-27 18:27, Josh Wills wrote:
>>>
>>> No, I haven't written a way to do that yet, and I feel bad about it-- a
>>> Clouderan asked me for just such a feature a couple of weeks ago and it
>>> slipped my mind. I don't think it's hard to do, just a little tedious and
>>> will require refreshing my memory of the Avro APIs. There's also the
>>> potential issue that multiple Avro files in the same input directory can
>>> have different schemas, so the one we would end up reading might be
>>> somewhat arbitrary (e.g., based on the timestamp of the files in the
>>> directory, or some such thing)-- is that ok?
>>>
>>>
>>> On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson <magru@linuxalert.org>wrote:
>>>
>>>> Can I in (s)crunch read an Avro-file to GenericRecord without provide
>>>> the schema? I want crunch to get the schema from the avro-file it reads.
>>>> How do I do it?
>>>>
>>>> /Magnus
>>>>
>>>
>>>
>>>
>
>
>  --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message