Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates
 209.85.221.187 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type;
        b=MuW8JUur1GEqUurQ2wCQuzAqrPccsBUQmOHTR4tGhuN/BFppaPfDVy/cFY6pRF5ckk
         iATvcRZO7YqPHBXWKvktoOYkaerqURttVTC5y5p4GahpWyxksngyZMpiRH0Jyyr1X9SN
         0t7hqeQ0x58MbTVNSxebsnAmhu9mRI6DTc+EM=
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: <5D66A842901F8E41815AF6D27A28EC490A84DF4229@Mail-Ab02.rmg-ny.com>
References: <5D66A842901F8E41815AF6D27A28EC490A84DF41A0@Mail-Ab02.rmg-ny.com>
	 <31a243e70910210755h2f5bc6e6ib504c515b0006272@mail.gmail.com>
	 <5D66A842901F8E41815AF6D27A28EC490A84DF41BE@Mail-Ab02.rmg-ny.com>
	 <31a243e70910210804u59d7efb8p920490ab8f564986@mail.gmail.com>
	 <5D66A842901F8E41815AF6D27A28EC490A84DF4229@Mail-Ab02.rmg-ny.com>
Date: Wed, 21 Oct 2009 08:43:10 -0700
Message-ID: <7c962aed0910210843s201651c8i55ddd2f268f62cb8@mail.gmail.com>
Subject: Re: Table Upload Optimization
From: stack <stack@duboce.net>
To: hbase-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=00032557f4ba08b9a6047673d63a

--00032557f4ba08b9a6047673d63a
Content-Type: text/plain; charset=ISO-8859-1

On Wed, Oct 21, 2009 at 8:22 AM, Mark Vigeant
<mark.vigeant@riskmetrics.com>wrote:

> Ok, so first in response to St. Ack, nothing fishy appears to be happening
> in the logs: data is being written to all regionservesrs.
>
> And it's not hovering around 100%  done, it just has sent about 118 map
> jobs, or "Task attempts"
>
>
I saw this in your first posting: 10/21/09 10:22:52 INFO mapred.JobClient:
map 100% reduce 0%.

Is your job writing hbase in the map task or in reducer?  Are you using
TableOutputFormat?


> I'm using Hadoop 0.20.1 and HBase 0.20.0
>
> Each node is a virtual machine with 2 CPU, 4 GB host memory and 100 GB
> storage.
>
>
You are running DN, TT, HBase, and ZK on above?  One disk shared by all?


> I don't know what you meant by slots per TT...
>

Children running at any one time on a TaskTracker.  You should start with
one only since you have such an anemic platform.


>
> And the heapsize is the default of 1000 MB. That is probably a huge
> problem, now that I think about it, heh.
>
> And there is absolutely no special configuration that I'm using. I have
> Hbase running my zookeeper quorum on 2 machines, but that's about it.
>


You've upped filedescriptors and xceivers, all the stuff in 'Getting
Started'?

St>Ack


>
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-Daniel Cryans
> Sent: Wednesday, October 21, 2009 11:04 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Table Upload Optimization
>
> Well the XMLStreamingInputFormat lets you map XML files which is neat
> but it has a problem and always needs to be patched. I wondered if
> that was missing but in your case it's not the problem.
>
> Did you check the logs of the master and region servers? Also I'd like to
> know
>
> - Version of Hadoop and HBase
> - Nodes's hardware
> - How many map slots per TT
> - HBASE_HEAPSIZE from conf/hbase-env.sh
> - Special configuration you use
>
> Thx,
>
> J-D
>
> On Wed, Oct 21, 2009 at 7:57 AM, Mark Vigeant
> <mark.vigeant@riskmetrics.com> wrote:
> > No. Should I?
> >
> > -----Original Message-----
> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-Daniel Cryans
> > Sent: Wednesday, October 21, 2009 10:55 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: Table Upload Optimization
> >
> > Are you using the Hadoop Streaming API?
> >
> > J-D
> >
> > On Wed, Oct 21, 2009 at 7:52 AM, Mark Vigeant
> > <mark.vigeant@riskmetrics.com> wrote:
> >> Hey
> >>
> >> So I want to upload a lot of XML data into an HTable. I have a class
> that successfully maps up to about 500 MB of data or so (on one
> regionserver) into a table, but if I go for much bigger than that it takes
> forever and eventually just stops. I tried uploading a big XML file into my
> 4 regionserver cluster (about 7 GB) and it's been a day and it's still going
> at it.
> >>
> >> What I get when I run the job on the 4 node cluster is:
> >> 10/21/09 10:22:35 INFO mapred.LocalJobRunner:
> >> 10/21/09 10:22:38 INFO mapred.LocalJobRunner:
> >> (then it does that for a while until...)
> >> 10/21/09 10:22:52 INFO mapred.TaskRunner: Task
> attempt_local_0001_m_000117_0 is done. And is in the process of committing
> >> 10/21/09 10:22:52 INFO mapred.LocalJobRunner:
> >> 10/21/09 10:22:52 mapred.TaskRunner: Task
> 'attempt_local_0001_m_000117_0' is done.
> >> 10/21/09 10:22:52 INFO mapred.JobClient:   map 100% reduce 0%
> >> 10/21/09 10:22:58 INFO mapred.LocalJobRunner:
> >> 10/21/09 10:22:59 INFO mapred.JobClient: map 99% reduce 0%
> >>
> >>
> >> I'm convinced I'm not configuring hbase or hadoop correctly. Any
> suggestions?
> >>
> >> Mark Vigeant
> >> RiskMetrics Group, Inc.
> >>
> >
>

--00032557f4ba08b9a6047673d63a--