avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Getting started with Avro + Reading from an Avro formatted file
Date Tue, 24 Jan 2012 20:44:20 GMT
If you want to try out the Python API for Avro datafiles, I had
written a short blog post on reading/writing that at
http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/
which still holds good I think. Hope this helps.

On Wed, Jan 25, 2012 at 1:50 AM, selvi k <gridsngators@gmail.com> wrote:
> I found out what the issue was:
> I first needed to install snappy downloaded from here:
> http://code.google.com/p/snappy/
>
> After a simple ./configure, make and make install, 'easy_install avro'
> completed successfully.
>
> I will try out both the CSV conversion options and update this thread in a
> bit.
>
> -Selvi
>
>
>
> On Tue, Jan 24, 2012 at 2:37 PM, selvi k <gridsngators@gmail.com> wrote:
>>
>> Douglas and Harsh - Thanks a lot for the immediate and detailed replies!
>> Looks like both of these would work well for me.
>>
>>
>> In order to start trying these, I have tried a few things to get started
>> with Avro, but this is where I am stuck:
>>
>>
>> 1. I first downloaded the stable version in the form of
>> "avro-1.6.1.tar.gz". (I am working out all this on a Ubuntu 10.04 machine).
>>
>> I don't find a readme file and am not familar with installing a python
>> package, so I am not sure if what I am doing is correct. After some basic
>> googling, I did:
>>
>> avro-1.6.1$ ./setup.py build
>>
>> This appears to complete successfully. Then when I do this:
>>
>> ...avro-1.6.1$ sudo ./setup.py install
>>
>> I get an error message. (pasted at the end of this mail [1])
>>
>>
>> 2. I tried the technique suggested by Harsh, but it ends with a similar
>> error as pasted below in [2]
>>
>> /avro$ sudo easy_install avro
>>
>> Then I tried to install snappy by itself:
>>
>> /avro$ sudo easy_install python-snappy
>>
>> I get the same error.
>>
>> Also I read that that this might help with this type of error, so I tried:
>>
>> avro$ sudo apt-get install python2.6-dev
>>
>> I ensured I have gcc and installed g++ too (because I wasn't sure what was
>> needed).
>>
>> I did see a similar error message reported here for Avro and OS X:
>> https://issues.apache.org/jira/browse/AVRO-981
>>
>> Before installing g++ and python-dev, the error message I was seeing from
>> easy_install python_snappy was different and shorter (attached below) [3].
>>
>>
>>
>>
>> Sorry if I should just be reading up on general Python development or
>> packages or installs (and/or other things), before I should even be
>> attempting to do this.  I'll be doing that now to move further.  But in case
>> anyone might have suggestions for the errors I am seeing, that would be
>> great.
>>
>>
>> I did find this Quick Start Guide from the main Avro wiki page, but when I
>> look through the Python example it is once again focussed client/server and
>> RPC communication between them:
>>
>> https://github.com/phunt/avro-rpc-quickstart
>>
>>
>> Also my understanding is that I must 'install' or deploy Avro before I can
>> try out the C bindings suggested by Douglas. I am stating this since I am
>> not exactly clear by what this meant: -  "especially since the C bindings
>> don't have any library dependencies to install". I am assuming it means, I
>> don't need anything beyond a basic install of Avro.
>>
>>
>>
>> 3. With regards to the two suggested ways, would either of these
>> techniques allow me to filter my data records using some sort of a condition
>> on a field?(or a few fields)  If not it seems like I would have to resort to
>> first grepping the log file with the condition I want, and then using either
>> of these two techniques to convert to CSV file. This would still be much
>> better than what I am doing now, which is through not-so-pretty awk
>> invocations to retrieve the fields I need (after the initial grep). But if
>> the existing API, allows me to scan through the log file and specify
>> conditions for fields, it might be much more efficient. I can imagine that I
>> might have to use the low-level API and write a program to do this, but I am
>> not sure at this point how to get started on this.
>>
>>
>> Any pointers would be really helpful!
>>
>>
>> Thank you,
>>
>> Selvi
>>
>>
>>
>>
>>
>> [1]
>>
>>
>> /avro-1.6.1$ sudo ./setup.py install
>>
>> running install
>>
>> Checking .pth file support in /usr/local/lib/python2.6/dist-packages/
>>
>> /usr/bin/python -E -c pass
>>
>> TEST PASSED: /usr/local/lib/python2.6/dist-packages/ appears to support
>> .pth files
>>
>> running bdist_egg
>>
>> running egg_info
>>
>> writing requirements to avro.egg-info/requires.txt
>>
>> writing avro.egg-info/PKG-INFO
>>
>> writing top-level names to avro.egg-info/top_level.txt
>>
>> writing dependency_links to avro.egg-info/dependency_links.txt
>>
>> reading manifest file 'avro.egg-info/SOURCES.txt'
>>
>> writing manifest file 'avro.egg-info/SOURCES.txt'
>>
>> installing library code to build/bdist.linux-x86_64/egg
>>
>> running install_lib
>>
>> running build_py
>>
>> creating build/bdist.linux-x86_64
>>
>> creating build/bdist.linux-x86_64/egg
>>
>> creating build/bdist.linux-x86_64/egg/avro
>>
>> copying build/lib.linux-x86_64-2.6/avro/io.py ->
>> build/bdist.linux-x86_64/egg/avro
>>
>> copying build/lib.linux-x86_64-2.6/avro/datafile.py ->
>> build/bdist.linux-x86_64/egg/avro
>>
>> copying build/lib.linux-x86_64-2.6/avro/tool.py ->
>> build/bdist.linux-x86_64/egg/avro
>>
>> copying build/lib.linux-x86_64-2.6/avro/txipc.py ->
>> build/bdist.linux-x86_64/egg/avro
>>
>> copying build/lib.linux-x86_64-2.6/avro/ipc.py ->
>> build/bdist.linux-x86_64/egg/avro
>>
>> copying build/lib.linux-x86_64-2.6/avro/protocol.py ->
>> build/bdist.linux-x86_64/egg/avro
>>
>> copying build/lib.linux-x86_64-2.6/avro/__init__.py ->
>> build/bdist.linux-x86_64/egg/avro
>>
>> copying build/lib.linux-x86_64-2.6/avro/schema.py ->
>> build/bdist.linux-x86_64/egg/avro
>>
>> byte-compiling build/bdist.linux-x86_64/egg/avro/io.py to io.pyc
>>
>> byte-compiling build/bdist.linux-x86_64/egg/avro/datafile.py to
>> datafile.pyc
>>
>> byte-compiling build/bdist.linux-x86_64/egg/avro/tool.py to tool.pyc
>>
>> byte-compiling build/bdist.linux-x86_64/egg/avro/txipc.py to txipc.pyc
>>
>> byte-compiling build/bdist.linux-x86_64/egg/avro/ipc.py to ipc.pyc
>>
>> byte-compiling build/bdist.linux-x86_64/egg/avro/protocol.py to
>> protocol.pyc
>>
>> byte-compiling build/bdist.linux-x86_64/egg/avro/__init__.py to
>> __init__.pyc
>>
>> byte-compiling build/bdist.linux-x86_64/egg/avro/schema.py to schema.pyc
>>
>> creating build/bdist.linux-x86_64/egg/EGG-INFO
>>
>> installing scripts to build/bdist.linux-x86_64/egg/EGG-INFO/scripts
>>
>> running install_scripts
>>
>> running build_scripts
>>
>> creating build/bdist.linux-x86_64/egg/EGG-INFO/scripts
>>
>> copying build/scripts-2.6/avro ->
>> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
>>
>> changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/avro to 755
>>
>> copying avro.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
>>
>> copying avro.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
>>
>> copying avro.egg-info/dependency_links.txt ->
>> build/bdist.linux-x86_64/egg/EGG-INFO
>>
>> copying avro.egg-info/requires.txt ->
>> build/bdist.linux-x86_64/egg/EGG-INFO
>>
>> copying avro.egg-info/top_level.txt ->
>> build/bdist.linux-x86_64/egg/EGG-INFO
>>
>> zip_safe flag not set; analyzing archive contents...
>>
>>
>> creating dist
>>
>> creating 'dist/avro-1.6.1-py2.6.egg' and adding
>> 'build/bdist.linux-x86_64/egg' to it
>>
>> removing 'build/bdist.linux-x86_64/egg' (and everything under it)
>>
>> Processing avro-1.6.1-py2.6.egg
>>
>> Removing /usr/local/lib/python2.6/dist-packages/avro-1.6.1-py2.6.egg
>>
>> Copying avro-1.6.1-py2.6.egg to /usr/local/lib/python2.6/dist-packages
>>
>> avro 1.6.1 is already the active version in easy-install.pth
>>
>> Installing avro script to /usr/local/bin
>>
>>
>> Installed /usr/local/lib/python2.6/dist-packages/avro-1.6.1-py2.6.egg
>>
>> Processing dependencies for avro==1.6.1
>>
>> Searching for python-snappy
>>
>> Reading http://pypi.python.org/simple/python-snappy/
>>
>> Reading http://github.com/andrix/python-snappy
>>
>> Best match: python-snappy 0.3.2
>>
>> Downloading
>> http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f
>>
>> Processing python-snappy-0.3.2.tar.gz
>>
>> Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir
>> /tmp/easy_install-1J0R1s/python-snappy-0.3.2/egg-dist-tmp-luBG6u
>>
>> cc1plus: warning: command line option "-Wstrict-prototypes" is valid for
>> Ada/C/ObjC but not for C++
>>
>> snappymodule.cc:31:22: error: snappy-c.h: No such file or directory
>>
>> snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*,
>> PyObject*)’:
>>
>> snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope
>>
>> snappymodule.cc:62: error: expected ‘;’ before ‘status’
>>
>> snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not declared
>> in this scope
>>
>> snappymodule.cc:79: error: ‘status’ was not declared in this scope
>>
>> snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this
>> scope
>>
>> snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope
>>
>> snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*,
>> PyObject*)’:
>>
>> snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope
>>
>> snappymodule.cc:107: error: expected ‘;’ before ‘status’
>>
>> snappymodule.cc:120: error: ‘status’ was not declared in this scope
>>
>> snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared
>> in this scope
>>
>> snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope
>>
>> snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this
>> scope
>>
>> snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope
>>
>> snappymodule.cc: In function ‘PyObject*
>> snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’:
>>
>> snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope
>>
>> snappymodule.cc:151: error: expected ‘;’ before ‘status’
>>
>> snappymodule.cc:156: error: ‘status’ was not declared in this scope
>>
>> snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not
>> declared in this scope
>>
>> snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope
>>
>> snappymodule.cc: At global scope:
>>
>> snappymodule.cc:41: warning: ‘_state’ defined but not used
>>
>> error: Setup script exited with error: command 'gcc' failed with exit
>> status 1
>>
>> ...avro/avro-1.6.1$ avro --help
>>
>>
>> ************************************************************************
>>
>>
>> [2] /avro$ sudo easy_install avro
>>
>> Searching for avro
>>
>> Best match: avro 1.6.1
>>
>> Processing avro-1.6.1-py2.6.egg
>>
>> avro 1.6.1 is already the active version in easy-install.pth
>>
>> Installing avro script to /usr/local/bin
>>
>>
>> Using /usr/local/lib/python2.6/dist-packages/avro-1.6.1-py2.6.egg
>>
>> Processing dependencies for avro
>>
>> Searching for python-snappy
>>
>> Reading http://pypi.python.org/simple/python-snappy/
>>
>> Reading http://github.com/andrix/python-snappy
>>
>> Best match: python-snappy 0.3.2
>>
>> Downloading
>> http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f
>>
>> Processing python-snappy-0.3.2.tar.gz
>>
>> Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir
>> /tmp/easy_install-c6jLm0/python-snappy-0.3.2/egg-dist-tmp-TTWQBN
>>
>> cc1plus: warning: command line option "-Wstrict-prototypes" is valid for
>> Ada/C/ObjC but not for C++
>>
>> snappymodule.cc:31:22: error: snappy-c.h: No such file or directory
>>
>> snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*,
>> PyObject*)’:
>>
>> snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope
>>
>> snappymodule.cc:62: error: expected ‘;’ before ‘status’
>>
>> snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not declared
>> in this scope
>>
>> snappymodule.cc:79: error: ‘status’ was not declared in this scope
>>
>> snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this
>> scope
>>
>> snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope
>>
>> snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*,
>> PyObject*)’:
>>
>> snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope
>>
>> snappymodule.cc:107: error: expected ‘;’ before ‘status’
>>
>> snappymodule.cc:120: error: ‘status’ was not declared in this scope
>>
>> snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared
>> in this scope
>>
>> snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope
>>
>> snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this
>> scope
>>
>> snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope
>>
>> snappymodule.cc: In function ‘PyObject*
>> snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’:
>>
>> snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope
>>
>> snappymodule.cc:151: error: expected ‘;’ before ‘status’
>>
>> snappymodule.cc:156: error: ‘status’ was not declared in this scope
>>
>> snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not
>> declared in this scope
>>
>> snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope
>>
>> snappymodule.cc: At global scope:
>>
>> snappymodule.cc:41: warning: ‘_state’ defined but not used
>>
>> error: Setup script exited with error: command 'gcc' failed with exit
>> status 1
>>
>>
>> ************************************************************************
>>
>>
>> [3]
>>
>> python$ sudo easy_install python-snappy
>>
>> Searching for python-snappy
>>
>> Reading http://pypi.python.org/simple/python-snappy/
>>
>> Reading http://github.com/andrix/python-snappy
>>
>> Best match: python-snappy 0.3.2
>>
>> Downloading
>> http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f
>>
>> Processing python-snappy-0.3.2.tar.gz
>>
>> Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir
>> /tmp/easy_install-Hpzssm/python-snappy-0.3.2/egg-dist-tmp-UStJPW
>>
>> gcc: error trying to exec 'cc1plus': execvp: No such file or directory
>>
>> error: Setup script exited with error: command 'gcc' failed with exit
>> status 1
>>
>>
>>
>>
>>
>> On Tue, Jan 24, 2012 at 11:01 AM, Harsh J <harsh@cloudera.com> wrote:
>>>
>>> Selvi,
>>>
>>> Expanding on Douglas' response, if you have installed Avro's python
>>> libraries (Simplest way to get latest stable is: "easy_install avro",
>>> or install from the distribution -- Post back if you need help on
>>> this), you can simply do, using the now-installed 'avro' executable:
>>>
>>> $ ls
>>> sample_input.avro
>>>
>>> $ avro cat sample_input.avro --format csv
>>> 011990-99999,0,-619524000000
>>> 011990-99999,22,-619506000000
>>> 011990-99999,-11,-619484400000
>>> 012650-99999,111,-655531200000
>>> 012650-99999,78,-655509600000
>>>
>>> Or, write to a resultant file, as you would regularly in a shell:
>>>
>>> $ avro cat sample_input.avro --format csv > sample_input.csv
>>>
>>> For more options on avro's cat and write opts:
>>>
>>> $ avro --help
>>>
>>> On Tue, Jan 24, 2012 at 9:01 PM, selvi k <gridsngators@gmail.com> wrote:
>>> > Hello All,
>>> >
>>> >
>>> > I would like some suggestions on where I can start in the Avro project.
>>> >
>>> >
>>> > I want to be able to read from an Avro formatted log file (specifically
>>> > the
>>> > History Log file created at the end of a Hadoop job) and create a Comma
>>> > Separated file of certain log entries. I need a csv file because this
>>> > is the
>>> > format that is accepted by post processing software I am working with
>>> > (eg:
>>> > Matlab).
>>> >
>>> >
>>> > Initially I was using a BASH script to grep and awk from this file and
>>> > create my CSV file because I needed a very few values from it, and a
>>> > quick
>>> > script just worked. I didn't try to get to know what format the log
>>> > file was
>>> > in and utilize that. (my bad!)  Now that I need to be scaling up and
>>> > want to
>>> > have a reliable way to parse, I would like to try and do it the right
>>> > way.
>>> >
>>> >
>>> > My question is this: For the above goal, could you please guide me with
>>> > steps I can follow - such as reading material and libraries I could try
>>> > to
>>> > use. As I go through the Quick Start Guide and FAQ, I see that a lot of
>>> > the
>>> > information here is geared to someone who wants to use the data
>>> > serialization and RPC functionality provided by Avro. Given that I only
>>> > want
>>> > to be able to "read", where may I start?
>>> >
>>> >
>>> > I can comfortably script with BASH and Perl. Given that I only see
>>> > support
>>> > for Java, Python and Ruby, I think I can take this as as opportunity to
>>> > learn Python and get up to speed.
>>> >
>>> >
>>> > Thanks a lot.
>>> >
>>> >
>>> > -Selvi
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>> Customer Ops. Engineer, Cloudera
>>
>>
>



-- 
Harsh J
Customer Ops. Engineer, Cloudera

Mime
View raw message