flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: streaming Avro to HDFS
Date Wed, 06 Feb 2013 18:58:45 GMT
Here you are: http://flume.apache.org/FlumeDeveloperGuide.html#client  


Hari  

--  
Hari Shreedharan


On Wednesday, February 6, 2013 at 10:20 AM, Alan Miller wrote:

> Thanks Hari,
>   
> Are there any links to examples of how to use the RpcClient?
>   
> Alan
>   
> From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]  
> Sent: Wednesday, February 06, 2013 7:16 PM
> To: user@flume.apache.org (mailto:user@flume.apache.org)
> Subject: Re: streaming Avro to HDFS  
>   
> Alan,
>  
>   
>  
> I think this is probably because the AvroClient is not really very "smart." It is mainly
useful for testing the AvroSource. The AvroClient reads the file passed in, and sends one
line per event (in 1.2.0, in 1.3.0+ there is an option of sending all files in a directory).
So the events are not really sent as Avro files, and since you are using the text serializer
they are dumped as is. Since events can arrive out of order, your data is likely to be invalid
Avro. Also the new line character that is used to split the event may actually have been part
of the real avro serialization, removing it simply made it invalid avro.   
>  
>   
>  
> My advice would be to use the RpcClient to read the file, and send the data such that
you send the data in a valid format, by making sure one avro "container" is in one event.
>  
>   
>  
>   
>  
> Hari
>  
>   
>  
> --  
>  
> Hari Shreedharan
>  
>   
>  
>  
> On Wednesday, February 6, 2013 at 9:58 AM, Alan Miller wrote:
> >  
> > Hi I’m just getting started with Flume and trying to understand the flow of things.
> >  
> >  
> >   
> >  
> >  
> > I have avro binary data files being generated on remote nodes and I want to use
 
> >  
> >  
> > Flume (1.2.0) to stream them to my HDFS cluster at a central location. It seems
I can
> >  
> >  
> > stream the data but the resulting files on HDFS seem corrupt.  Here’s what I did:
> >  
> >  
> >   
> >  
> >  
> > For my “master” (on the NameNode of my Hadoop cluster)  I started this:
> >  
> >  
> > flume-ng agent -f agent.conf  -Dflume.root.logger=DEBUG,console -n agent
> >  
> >  
> > With this config:
> >  
> >  
> > agent.channels = memory-channel
> >  
> >  
> > agent.sources = avro-source
> >  
> >  
> > agent.sinks = hdfs-sink
> >  
> >  
> >   
> >  
> >  
> > agent.channels.memory-channel.type = memory
> >  
> >  
> > agent.channels.memory-channel.capacity = 1000
> >  
> >  
> > agent.channels.memory-channel.transactionCapacity = 100
> >  
> >  
> >   
> >  
> >  
> > agent.sources.avro-source.channels = memory-channel
> >  
> >  
> > agent.sources.avro-source.type = avro
> >  
> >  
> > agent.sources.avro-source.bind = 10.10.10.10
> >  
> >  
> > agent.sources.avro-source.port = 41414
> >  
> >  
> >   
> >  
> >  
> > agent.sinks.hdfs-sink.type = hdfs
> >  
> >  
> > agent.sinks.hdfs-sink.channel = memory-channel
> >  
> >  
> > agent.sinks.hdfs-sink.hdfs.path = hdfs://namenode1:9000/flume
> >  
> >  
> >   
> >  
> >  
> > On a remote node I streamed a test file like this:
> >  
> >  
> > flume-ng avro-client -H 10.10.10.10 -p 41414 -F /tmp/test.avro
> >  
> >  
> >   
> >  
> >  
> > I can see the master is writing to HDFS
> >  
> >  
> >   ……
> >  
> >  
> >   13/02/06 09:37:55 INFO hdfs.BucketWriter: Creating hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp
> >  
> >  
> >   13/02/06 09:38:25 INFO hdfs.BucketWriter: Renaming hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp
 
> >  
> >  
> >   to hdfs://namenode1:9000/flume/FlumeData.1360172273684
> >  
> >  
> >   
> >  
> >  
> > But the data doesn’t seem right. The original file is 4551 bytes, the file written
to  
> >  
> >  
> > HDFS was only 219 bytes
> >  
> >  
> >   [localhost] $ ls –l FlumeData.1360172273684 /tmp/test.avro
> >  
> >  
> >   -rwxr-xr-x 1 amiller amiller  219 Feb  6 18:51 FlumeData.1360172273684
> >  
> >  
> >   -rwxr-xr-x 1 amiller amiller 4551 Feb 6 12:00 /tmp/test.avro
> >  
> >  
> >   
> >  
> >  
> >   [localhost] $ avro cat /tmp/test.avro  
> >  
> >  
> >   {"system_model": null, "nfsv4": null, "ip": null, "site": null, "nfsv3": null,
"export": null, "ifnet": [{"send_bps": 1234, "recv_bps": 5678, "name": "eth0"}, {"send_bps":
100, "recv_bps": 200, "name": "eth1"}, {"send_bps": 0, "recv_bps": 0, "name": "eth2"}], "disk":
null, "hostname": "localhost", "total_mem": null, "ontapi_version": null, "serial_number":
null, "cifs": null, "cpu_model": null, "volume": null, "time_stamp": 1357639723, "aggregate":
null, "num_cpu": null, "cpu_speed_mhz": null, "hostid": null, "kernel_version": null, "qtree":
null, "processor": null}
> >  
> >  
> >   
> >  
> >  
> >   [localhost] $ hadoop fs -copyToLocal /flume/FlumeData.1360172273684 .
> >  
> >  
> >   [localhost] $ avro cat FlumeData.1360172273684
> >  
> >  
> >   panic: ord() expected a character, but string of length 0 found
> >  
> >  
> >   
> >  
> >  
> > Alan
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>  



Mime
View raw message