crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnus Runesson <ma...@linuxalert.org>
Subject Re: Reading Avro to GenericRecord
Date Tue, 28 Jan 2014 09:07:03 GMT
Thanks! Looks like it works for me.

Here is a patch to expose it to scrunch:

diff --git 
a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala 
b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
index 89b331b..b77b042 100644
--- a/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
+++ b/crunch-scrunch/src/main/scala/org/apache/crunch/scrunch/IO.scala
@@ -19,11 +19,14 @@ package org.apache.crunch.scrunch

  import org.apache.crunch.io.{From => from, To => to, At => at}
  import org.apache.crunch.types.avro.AvroType
-import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.conf.Configuration
+;

  trait From {
    def avroFile[T](path: String, atype: AvroType[T]) = 
from.avroFile(path, atype)
    def avroFile[T](path: Path, atype: AvroType[T]) = 
from.avroFile(path, atype)
+  def avroFile[T](path: Path, conf: Configuration) = 
from.avroFile(path, conf)
    def textFile(path: String) = from.textFile(path)
    def textFile(path: Path) = from.textFile(path)
  }


On 1/28/14 2:04 AM, Josh Wills wrote:
> Patch is here: https://issues.apache.org/jira/browse/CRUNCH-333
>
>
> On Mon, Jan 27, 2014 at 10:08 AM, Josh Wills <josh.wills@gmail.com 
> <mailto:josh.wills@gmail.com>> wrote:
>
>     Of course. I wrote up a little patch that adds a method to
>     From.java to open the Avro file and pull out the schema and return
>     a Source of GenericData.Record, but I had to roll to some meetings
>     before I got a chance to test it. I'll post something later this
>     evening ET.
>
>     On Jan 27, 2014 11:56 AM, "Magnus Runesson" <magru@linuxalert.org
>     <mailto:magru@linuxalert.org>> wrote:
>
>         Thanks for quick answer.
>
>         It is totally OK and reasonable to take one file in a
>         directory and assume all other has the same schema.
>
>
>         On 2014-01-27 18:27, Josh Wills wrote:
>>         No, I haven't written a way to do that yet, and I feel bad
>>         about it-- a Clouderan asked me for just such a feature a
>>         couple of weeks ago and it slipped my mind. I don't think
>>         it's hard to do, just a little tedious and will require
>>         refreshing my memory of the Avro APIs. There's also the
>>         potential issue that multiple Avro files in the same input
>>         directory can have different schemas, so the one we would end
>>         up reading might be somewhat arbitrary (e.g., based on the
>>         timestamp of the files in the directory, or some such
>>         thing)-- is that ok?
>>
>>
>>         On Mon, Jan 27, 2014 at 9:12 AM, Magnus Runesson
>>         <magru@linuxalert.org <mailto:magru@linuxalert.org>> wrote:
>>
>>             Can I in (s)crunch read an Avro-file to GenericRecord
>>             without provide the schema? I want crunch to get the
>>             schema from the avro-file it reads. How do I do it?
>>
>>             /Magnus
>>
>>
>
>
>
>
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>


Mime
View raw message