hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taeho Kang" <tka...@gmail.com>
Subject Re: Question for HBase users
Date Tue, 08 Jan 2008 01:38:19 GMT
Hi Lars,

The test result for the effect of optimization can be found at the bottom of
this link http://issues.apache.org/jira/browse/HADOOP-1687
However, If you were using the latest version of Hadoop (0.15 and up), then
all the namenode optimization would've been built in.

Using Zip archive is another approach. If storing data to HDFS is the only
thing you want, this approach should be good enough.
However, if you want some data processing done, you have to keep mind that a
zip file on HDFS cannot be processed with Map Reduce. It first has to be
downloaded to a local filesystem, and then processed there locally (which is
slow and no good!)

If you want to parallel-process the data in a zip file on HDFS with Hadoop
Map Reduce...
(1) download zip to local FS
(2) unzip it
(3) upload the unzipped data to HDFS
(4) run map reduce.

That's quite a bit of work (and it's going to take a lot of time, too....)

Let me know if I could give you more help.


On Jan 7, 2008 1:26 PM, Lars George <lars@worldlingo.com> wrote:

> Hi Taeho,
> > Fortunately for us, we don't have a need for storing millions of files
> in
> > HDFS just yet. We are adding only a few thousand files a day, so that
> gives
> > us a handful of days. And we've been using Hadoop more than a year, and
> its
> > reliability has been superb.
> >
> Sounds great.
> > This is just a rough estimation, but we see that 1GB of RAM is required
> in
> > namenode for every 1 million files. Newer versions of Hadoop have more
> > optimized namenode, hence it could host more files. But to be
> conservative,
> > we see 6-7 million files is the limit for a 8GB namenode machine.
> >
> Ah, that would explain why my first attempt failed, I have a namenode
> with 1GB of RAM running. That worked OK up to about 3m files, then it
> died - completely. I am using now a nightly build of Hadoop/Hbase, does
> that mean I am in better shape now? How much better does it perform?
> > I don't think adding the "consolidation" feature into Hadoop is a good
> > idea.
> > As I said, you may have to add an "layer" that does the consolidation
> work,
> > and use that layer only when necessary.
> >
> Yes of course, that is what I meant, we have to handle the creation of
> the slaps on our end. But that is where I think we have to reinvent the
> wheel so to speak.
> > As far as the performance is concerned, I don't think it's much of an
> issue.
> > The only cost I can think of is the time taken to make a query to a DB
> plus
> > some time to find the desired file from a given "slap."
> >
> OK, my concern is more the size of each slap. Doing some quick math
> (correct me if I am wrong), 80TB total storage divided by say a max of
> 1m slaps means 83MB per slap. That is quite a chunk to load. Unless I
> can do a positioned load of the chunk out of a slap. Does Hadoop have a
> seek load feature?
> > Also, you may also create a slap in a way no one file can overlap more
> than
> > one slap.
> >
> Yes, that makes sense. I could think of for example simply add files
> together, like an mbox. Or use a ZIP archive. First I would cache enough
> files in a scratch directory in Hadoop and then archive them as one
> slap. (Again that sounds similar to what Hbase is doing?)
> > Updates... woo.. here we go again. Hadoop is not designed to handle this
> > need. Basically, its HDFS is designed for large files that rarely change
> -
> >
> Yes, understood. I could think of replacing whole slaps, or delete slaps
> once all contained files are obsolete.
> > Let us know how your situation goes.
> >
> Will do.
> Lars

Taeho Kang [tkang.blogspot.com]
Software Engineer, NHN Corporation, Korea

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message