hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Smith <stu24m...@yahoo.com>
Subject Re: Best way to write files to hdfs (from a Python app)
Date Thu, 12 Aug 2010 18:46:56 GMT
Hello Bjoern,
  
  Loading binary data into HBase isn't terribly different then other data. In fact - just
a warning - what gives me the most trouble is strings - because I have to deal with windows
clients, which doesn't do UTF8 - So you have to be careful loading and retrieving strings
on windows clients. Caveats aside.. example code (not in python, but I think the relevant
data types should map easily).

Open a connection:

//mailing lists say TBuffered is important! (vs just TSocket).
TBufferedTransport transport = new TBufferedTransport(new TSocket(host, port));
TProtocol protocol = new TBinaryProtocol(transport, true, true);
Hbase.Client client = new Hbase.Client(protocol);
transport.Open();

//Get your file buffer (buf), and ..
Mutation mut = new Mutation();
mut.Column = encoder.GetBytes(/*some column*/);
mut.Value = buf;

List<Mutation> row = new List<Mutation>();
newRow.Add(mut);

client.mutateRow(encoder.GetBytes(/*tablename*/),encoder.GetBytes(/*row key*/), newRow);

transport.Close()

So not so bad - once you figured it out - thrift documentation is a bit sparse ;)

I've stored files up to 500 MB in HBase - but I wouldn't recommend it. Hbase handles it fine,
but I just had a M/R task throw an OOME when processing a large cell. My rule of thumb has
been to store anything up to 64MB in Hbase - basically up to the chunk size of the hdfs file
system. Basically the *lower* limit for hdfs is my upper limit for hbase.  

That said, the hbase FAQ says 10 MB, as you mentioned. But that's an average size. My average
size is closer to 300 KB, but there's a lot of variance/deviation around that number. For
your average I would definitely follow the Hbase FAQ's advice of 10 MB. For the the max -
64 MB?

Take care,
  -stu




--- On Thu, 8/12/10, Bjoern Schiessle <bjoern@schiessle.org> wrote:

> From: Bjoern Schiessle <bjoern@schiessle.org>
> Subject: Re: Best way to write files to hdfs (from a Python app)
> To: hdfs-user@hadoop.apache.org
> Date: Thursday, August 12, 2010, 8:04 AM
> On Tue, 10 Aug 2010 16:02:04 +0000 stu24mail@yahoo.com
> wrote:
> >  Thrift works with binary data - at least with
> hbase. I have a C# app
> > that people can use to put binary (and get) files in
> hbase via Thrift.
> > I'll send example code later. I also have java apps
> that upload files
> > to hdfs directly and are not on the server - but they
> do need access to
> > the copies of the config files. But they just use the
> standard java
> > hdfs api.
> 
> could be interesting. Can you send me some example code?
> 
> How large are the binary file you store on hbase? I have
> read that for
> large files (> 10MB) hdfs is the better place to store
> binary data.
> 
> best wishes,
> Björn
> 


      

Mime
View raw message