hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Tarjan <p...@paulisageek.com>
Subject Python CSV record reader
Date Wed, 04 Nov 2009 20:07:53 GMT
I wrote a CSV Jute record parser in python, and thought some people on
the list might also be interested.

http://github.com/ptarjan/hadoop_record

You can use it in your streaming jobs with

-inputformat SequenceFileAsTextInputFormat -file hadoop_record.mod

And just showing some features:

>>> from hadoop_record import csv
>>> csv("T")
True
>>> csv(";-1234")
-1234
>>> csv("1.0E-10")
1e-10
>>> csv("s{T,F}")
[True, False]
>>> csv("v{T,F}")
[True, False]
>>> csv("v{s{T,F}}")
[[True, False]]
>>> csv("m{'don't,#73746f70}")
{LazyString("don't"): LazyString('stop')}
>>> csv("'\xe2\x98\x83")
LazyString('\xe2\x98\x83')
>>> str(csv("'\xe2\x98\x83"))
'\xe2\x98\x83'
>>> unicode(csv("'\xe2\x98\x83"))
u'\u2603'
>>> csv("'%00%0a%25%2c")
LazyString('\x00\n%,')

The LazyString was needed because I was spending most of my CPU just
decoding data from the Jute record that I didn't care about. It
shouldn't get in your way too much, as long as you cast it to a str
first.

So let me know what you think. For bugs, fork, fix, and then send me a
pull request (or use the issues tracker).

Paul

Mime
View raw message