Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Date: Tue, 12 Apr 2016 17:18:47 +0200 (CEST)
From: Ivan Cores gonzalez <ivan.cores@inria.fr>
To: user@hbase.apache.org
Message-ID: <1813175950.23398815.1460474327980.JavaMail.zimbra@inria.fr>
In-Reply-To: 
 <CALte62zX=OdPiM6uP7F-u4RvwXfzW0G97js+h4mZe1kG5S+jJg@mail.gmail.com>
References: <1567071767.23042954.1460378683769.JavaMail.zimbra@inria.fr>
 <1139842460.23047353.1460378931207.JavaMail.zimbra@inria.fr>
 <CALte62zX=OdPiM6uP7F-u4RvwXfzW0G97js+h4mZe1kG5S+jJg@mail.gmail.com>
Subject: Re: Processing rows in parallel with MapReduce jobs.
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Thread-Topic: Processing rows in parallel with MapReduce jobs.
Thread-Index: 7FeM7tMpGe4e/Noxd4CMJwZLYW5ckQ==

Hi Ted,
Yes, I mean same region. 

I wasn't using the getSplits() function. I'm trying to add it to my code
but I'm not sure how I have to do it. Is there any example in the website? 
I can not find anything. (By the way, I'm using TableInputFormat, not InputFormat)

But just to confirm, with the getSplits() function, Are mappers processing
rows in the same region executed in parallel? (assuming that there are empty
processors/cores)

Thanks,
Ivan.


----- Mensaje original -----
> De: "Ted Yu" <yuzhihong@gmail.com>
> Para: user@hbase.apache.org
> Enviados: Lunes, 11 de Abril 2016 15:10:29
> Asunto: Re: Processing rows in parallel with MapReduce jobs.
> 
> bq. if they are located in the same split?
> 
> Probably you meant same region.
> 
> Can you show the getSplits() for the InputFormat of your MapReduce job ?
> 
> Thanks
> 
> On Mon, Apr 11, 2016 at 5:48 AM, Ivan Cores gonzalez <ivan.cores@inria.fr>
> wrote:
> 
> > Hi all,
> >
> > I have a small question regarding the MapReduce jobs behaviour with HBase.
> >
> > I have a HBase test table with only 8 rows. I splitted the table with the
> > hbase shell
> > split command into 2 splits. So now there are 4 rows in every split.
> >
> > I create a MapReduce job that only prints the row key in the log files.
> > When I run the MapReduce job, every row is processed by 1 mapper. But the
> > mappers
> > in the same split are executed sequentially (inside the same container).
> > That means,
> > the first four rows are processed sequentially by 4 mappers. The system
> > has cores
> > that are free, so is it possible to process rows in parallel if they are
> > located
> > in the same split?
> >
> > The only way I found to have 8 mappers executed in parallel is split the
> > table
> > in 8 splits (1 split per row). But obviously this is not the best solution
> > for
> > big tables ...
> >
> > Thanks,
> > Ivan.
> >
>