Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Date: Thu, 21 Apr 2016 16:31:54 +0200 (CEST)
From: Ivan Cores gonzalez <ivan.cores@inria.fr>
To: user@hbase.apache.org
Message-ID: <234997790.25872235.1461249114014.JavaMail.zimbra@inria.fr>
In-Reply-To: 
 <CALte62zekq_pdqUkfBwG4Sj3K_vQ=wQ6PuWeget-3=1PJSgTfg@mail.gmail.com>
References: <1567071767.23042954.1460378683769.JavaMail.zimbra@inria.fr>
 <CALte62zX=OdPiM6uP7F-u4RvwXfzW0G97js+h4mZe1kG5S+jJg@mail.gmail.com>
 <1813175950.23398815.1460474327980.JavaMail.zimbra@inria.fr>
 <CALte62wodjXUPMyuC1OfwmGvnxa_2uc2VE=FKsciv3zX5nigrQ@mail.gmail.com>
 <632124774.24734614.1460969657741.JavaMail.zimbra@inria.fr>
 <CALte62w5xgqsTCszMzrN=x_+Mi2gQuP0ihf=J3xHJ9Kp7PuKcw@mail.gmail.com>
 <1846411371.25044788.1461049975409.JavaMail.zimbra@inria.fr>
 <CALte62zekq_pdqUkfBwG4Sj3K_vQ=wQ6PuWeget-3=1PJSgTfg@mail.gmail.com>
Subject: Re: Processing rows in parallel with MapReduce jobs.
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Thread-Topic: Processing rows in parallel with MapReduce jobs.
Thread-Index: m9yJCyavuoB5O4YgyhwL3FAVzkgX/Q==

Thanks Ted,=20
Finally I found the real mistake, the class had to be declared static.

Best,
Iv=E1n.

----- Mensaje original -----
> De: "Ted Yu" <yuzhihong@gmail.com>
> Para: user@hbase.apache.org
> Enviados: Martes, 19 de Abril 2016 15:56:56
> Asunto: Re: Processing rows in parallel with MapReduce jobs.
>=20
> From the error, you need to provide an argumentless ctor for
> MyTableInputFormat.
>=20
> On Tue, Apr 19, 2016 at 12:12 AM, Ivan Cores gonzalez <ivan.cores@inria.f=
r>
> wrote:
>=20
> >
> > Hi Ted,
> >
> > Sorry, I forgot to write the error. In runtime I have the next exceptio=
n:
> >
> > Exception in thread "main" java.lang.RuntimeException:
> > java.lang.NoSuchMethodException:
> > simplerowcounter.SimpleRowCounter$MyTableInputFormat.<init>()
> >
> > the program works fine if I don't use "MyTableInputFormat" modifying th=
e
> > call to initTableMapperJob:
> >
> >     TableMapReduceUtil.initTableMapperJob(tableName, scan,
> > RowCounterMapper.class,
> >                 ImmutableBytesWritable.class, Result.class, job);   // =
-->
> > works fine without MyTableInputFormat
> >
> > That's why I asked If you see any problem in the code. Because maybe I
> > forgot override some method or something is missing.
> >
> > Best,
> > Iv=E1n.
> >
> >
> > ----- Mensaje original -----
> > > De: "Ted Yu" <yuzhihong@gmail.com>
> > > Para: user@hbase.apache.org
> > > Enviados: Martes, 19 de Abril 2016 0:22:05
> > > Asunto: Re: Processing rows in parallel with MapReduce jobs.
> > >
> > > Did you see the "    Message to log?" log ?
> > >
> > > Can you pastebin the error / exception you got ?
> > >
> > > On Mon, Apr 18, 2016 at 1:54 AM, Ivan Cores gonzalez <
> > ivan.cores@inria.fr>
> > > wrote:
> > >
> > > >
> > > >
> > > > Hi Ted,
> > > > So, If I understand the behaviour of getSplits(), I can create
> > "virtual"
> > > > splits overriding the getSplits function.
> > > > I was performing some tests, but my code crash in runtime and I can=
not
> > > > found the problem.
> > > > Any help? I didn't find examples.
> > > >
> > > >
> > > > public class SimpleRowCounter extends Configured implements Tool {
> > > >
> > > >   static class RowCounterMapper extends
> > > > TableMapper<ImmutableBytesWritable, Result> {
> > > >     public static enum Counters { ROWS }
> > > >     @Override
> > > >     public void map(ImmutableBytesWritable row, Result value, Conte=
xt
> > > > context) {
> > > >       context.getCounter(Counters.ROWS).increment(1);
> > > >                 try {
> > > >                         Thread.sleep(3000); //Simulates work
> > > >                 } catch (InterruptedException name) { }
> > > >     }
> > > >   }
> > > >
> > > >   public class MyTableInputFormat extends TableInputFormat {
> > > >     @Override
> > > >     public List<InputSplit> getSplits(JobContext context) throws
> > > > IOException {
> > > >         //Just to detect if this method is being called ...
> > > >         List<InputSplit> splits =3D super.getSplits(context);
> > > >         System.out.printf("    Message to log? \n" );
> > > >         return splits;
> > > >     }
> > > >   }
> > > >
> > > >   @Override
> > > >   public int run(String[] args) throws Exception {
> > > >     if (args.length !=3D 1) {
> > > >       System.err.println("Usage: SimpleRowCounter <tablename>");
> > > >       return -1;
> > > >     }
> > > >     String tableName =3D args[0];
> > > >
> > > >     Scan scan =3D new Scan();
> > > >     scan.setFilter(new FirstKeyOnlyFilter());
> > > >     scan.setCaching(500);
> > > >     scan.setCacheBlocks(false);
> > > >
> > > >     Job job =3D new Job(getConf(), getClass().getSimpleName());
> > > >     job.setJarByClass(getClass());
> > > >
> > > >     TableMapReduceUtil.initTableMapperJob(tableName, scan,
> > > > RowCounterMapper.class,
> > > >                 ImmutableBytesWritable.class, Result.class, job, tr=
ue,
> > > > MyTableInputFormat.class);
> > > >
> > > >     job.setNumReduceTasks(0);
> > > >     job.setOutputFormatClass(NullOutputFormat.class);
> > > >     return job.waitForCompletion(true) ? 0 : 1;
> > > >   }
> > > >
> > > >   public static void main(String[] args) throws Exception {
> > > >     int exitCode =3D ToolRunner.run(HBaseConfiguration.create(),
> > > >         new SimpleRowCounter(), args);
> > > >     System.exit(exitCode);
> > > >   }
> > > > }
> > > >
> > > > Thanks so much,
> > > > Iv=E1n.
> > > >
> > > >
> > > >
> > > >
> > > > ----- Mensaje original -----
> > > > > De: "Ted Yu" <yuzhihong@gmail.com>
> > > > > Para: user@hbase.apache.org
> > > > > Enviados: Martes, 12 de Abril 2016 17:29:52
> > > > > Asunto: Re: Processing rows in parallel with MapReduce jobs.
> > > > >
> > > > > Please take a look at TableInputFormatBase#getSplits() :
> > > > >
> > > > >    * Calculates the splits that will serve as input for the map
> > tasks.
> > > > The
> > > > >
> > > > >    * number of splits matches the number of regions in a table.
> > > > >
> > > > > Each mapper would be reading one of the regions.
> > > > >
> > > > > On Tue, Apr 12, 2016 at 8:18 AM, Ivan Cores gonzalez <
> > > > ivan.cores@inria.fr>
> > > > > wrote:
> > > > >
> > > > > > Hi Ted,
> > > > > > Yes, I mean same region.
> > > > > >
> > > > > > I wasn't using the getSplits() function. I'm trying to add it t=
o my
> > > > code
> > > > > > but I'm not sure how I have to do it. Is there any example in t=
he
> > > > website?
> > > > > > I can not find anything. (By the way, I'm using TableInputForma=
t,
> > not
> > > > > > InputFormat)
> > > > > >
> > > > > > But just to confirm, with the getSplits() function, Are mappers
> > > > processing
> > > > > > rows in the same region executed in parallel? (assuming that th=
ere
> > are
> > > > > > empty
> > > > > > processors/cores)
> > > > > >
> > > > > > Thanks,
> > > > > > Ivan.
> > > > > >
> > > > > >
> > > > > > ----- Mensaje original -----
> > > > > > > De: "Ted Yu" <yuzhihong@gmail.com>
> > > > > > > Para: user@hbase.apache.org
> > > > > > > Enviados: Lunes, 11 de Abril 2016 15:10:29
> > > > > > > Asunto: Re: Processing rows in parallel with MapReduce jobs.
> > > > > > >
> > > > > > > bq. if they are located in the same split?
> > > > > > >
> > > > > > > Probably you meant same region.
> > > > > > >
> > > > > > > Can you show the getSplits() for the InputFormat of your
> > MapReduce
> > > > job ?
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On Mon, Apr 11, 2016 at 5:48 AM, Ivan Cores gonzalez <
> > > > > > ivan.cores@inria.fr>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I have a small question regarding the MapReduce jobs behavi=
our
> > with
> > > > > > HBase.
> > > > > > > >
> > > > > > > > I have a HBase test table with only 8 rows. I splitted the
> > table
> > > > with
> > > > > > the
> > > > > > > > hbase shell
> > > > > > > > split command into 2 splits. So now there are 4 rows in eve=
ry
> > > > split.
> > > > > > > >
> > > > > > > > I create a MapReduce job that only prints the row key in th=
e
> > log
> > > > files.
> > > > > > > > When I run the MapReduce job, every row is processed by 1
> > mapper.
> > > > But
> > > > > > the
> > > > > > > > mappers
> > > > > > > > in the same split are executed sequentially (inside the sam=
e
> > > > > > container).
> > > > > > > > That means,
> > > > > > > > the first four rows are processed sequentially by 4 mappers=
.
> > The
> > > > system
> > > > > > > > has cores
> > > > > > > > that are free, so is it possible to process rows in paralle=
l if
> > > > they
> > > > > > are
> > > > > > > > located
> > > > > > > > in the same split?
> > > > > > > >
> > > > > > > > The only way I found to have 8 mappers executed in parallel=
 is
> > > > split
> > > > > > the
> > > > > > > > table
> > > > > > > > in 8 splits (1 split per row). But obviously this is not th=
e
> > best
> > > > > > solution
> > > > > > > > for
> > > > > > > > big tables ...
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Ivan.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>=20