mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: MatrixMultiplicationJob runs with 1 mapper only ?
Date Wed, 23 Jan 2013 13:12:27 GMT
Mappers are usually extremely fast since they start themselves on top of
the data and their job is usually just parsing and emitting key value
pairs. Hadoop's choices are usually fine.

If not it is usually because the mapper is emitting far more data than it
ingests. Are you computing some kind of Cartesian product of input?

That's slow no matter what. More mappers may increase parallelism but its
still a lot of I/O. Avoid it if you can by sampling or pruning unimportant
values. Otherwise , try to implement a Combiner.
On Jan 23, 2013 12:04 PM, "Jonas Grote" <jfgrote@gmail.com> wrote:

> I'd play with the mapred.map.tasks option. Setting it to something bigger
> than 1 gave me performance improvements for various hadoop jobs on my
> cluster.
>
>
> 2013/1/16 Ashish <paliwalashish@gmail.com>
>
> > I am afraid I don't know the answer. Need to experiment a bit more. I
> have
> > not used CompositeInputFormat so cannot comment.
> >
> > Probably, someone else on the ML(Mailing List) would be able to guide
> here.
> >
> >
> > On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi <stutiawasthi@hcl.com>
> > wrote:
> >
> > > Thanks Ashish,
> > >
> > > So according to the link if one is using CompositeInputFormat then it
> > will
> > > take entire file as Input to a mapper without considering
> > > InputSplits/blocksize.
> > > If I am understanding it correctly then it is asking to break [Original
> > > Input File]->[flie1,file2,....] .
> > >
> > > So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1,
> > > [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]
> > >
> > > Now will the input path in MatrixMultiplicationJob will be directory
> path
> > > : /test/smallfiles  ??
> > >
> > > Will breaking file in such manner will cause problem in algorithmic
> > > execution of MR job. Im not sure if output will be correct .
> > >
> > > -----Original Message-----
> > > From: Ashish [mailto:paliwalashish@gmail.com]
> > > Sent: Wednesday, January 16, 2013 5:44 PM
> > > To: user@mahout.apache.org
> > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > MatrixMultiplicationJob internally sets InputFormat as
> > CompositeInputFormat
> > >
> > > JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class);
> > > conf.setInputFormat(CompositeInputFormat.class);
> > >
> > > and AFAIK, CompositeInputFormat ignores the splits. See this
> > >
> >
> http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join
> > >
> > > Unfortunately, I don't know any other alternative as of now.
> > >
> > >
> > > On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi <stutiawasthi@hcl.com>
> > > wrote:
> > >
> > > > The issue is that currently my matrix is of dimension (100x100k),
> > > > Later it can be (1MX10M) or big.
> > > >
> > > > Even now if my job is running with the single mapper for (100x100k)
> > > > and it is not able to complete the Job. As I mentioned map task just
> > > > proceed to 0.99% and started spilling the map output. Hence I wanted
> > > > to tune my job so that Mahout is able to complete the job and I can
> > > > utilize my cluster resources.
> > > >
> > > > As MatrixMultiplicationJob is a MR, so it should be able to handle
> > > > parallel map tasks. I am not sure if there is any algorithmic
> > > > constraints due to which it runs only with single mapper ?
> > > > I have taken the reference of thread so that I can set Configuration
> > > > myself rather by getting it with getConf() but did not got any
> success
> > > >
> > > >
> http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
> > > > ers-in-DistributedRowMatrix-Jobs-td888980.html
> > > >
> > > > Stuti
> > > >
> > > > -----Original Message-----
> > > > From: Sean Owen [mailto:srowen@gmail.com]
> > > > Sent: Wednesday, January 16, 2013 4:46 PM
> > > > To: Mahout User List
> > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > > >
> > > > Why do you need multiple mappers? Is one too slow? Many are not
> > > > necessarily faster for small input On Jan 16, 2013 10:46 AM, "Stuti
> > > > Awasthi" <stutiawasthi@hcl.com> wrote:
> > > >
> > > > > Hi,
> > > > > I tried to call programmatically also but facing same issue : Only
> > > > > single MapTask is running and that too spilling the map output
> > > >  continuously.
> > > > > Hence im not able to generate the output for large matrix
> > > multiplication.
> > > > >
> > > > > Code Snippet :
> > > > >
> > > > > DistributedRowMatrix a = new DistributedRowMatrix(new
> > > > > Path("/test/points/matrixA"), new
> > > > > Path("/test/temp"),Integer.parseInt("100"),
> > > > > Integer.parseInt("100000")); DistributedRowMatrix b = new
> > > > > DistributedRowMatrix(new Path("/test/points/matrixA"),new
> > > > > Path("tempDir"),Integer.parseInt("100"),
> > > > > Integer.parseInt("100000"));
> > > > > Configuration conf = new Configuration();
> > > > > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818");
> > > > > conf.set("mapred.child.java.opts",
> > > > > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> > > > > a.setConf(conf);
> > > > > b.setConf(conf);
> > > > > a.times(b);
> > > > >
> > > > > Where Im going wrong. Any idea ?
> > > > >
> > > > > Thanks
> > > > > Stuti
> > > > > -----Original Message-----
> > > > > From: Stuti Awasthi
> > > > > Sent: Wednesday, January 16, 2013 2:55 PM
> > > > > To: Mahout User List
> > > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > > > >
> > > > > Hey Sean,
> > > > > Thanks for response. MatrixMultiplicationJob help shows the usage
> > like
> > > :
> > > > > usage: <command> [Generic Options] [Job-Specific Options]
> > > > >
> > > > > Here Generic Option can be provided by -D <property=value>.
Hence I
> > > > > tried with commandline -D options but it seems like that it is not
> > > > > making any effect.  It is also suggested in :
> > > > >
> > > > >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho
> > > > > ut
> > > > > /common/AbstractJob.html
> > > > >
> > > > > Here I have noted 1 thing after your suggestion  that currently Im
> > > > > passing arguments like -D<property=value> rather than -D
> > > > > <property=value>. I tried with space between -D and property=value
> > > > > also but then its giving error
> > > > > like:
> > > > > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected
> > > > > /test/points/matrixA while processing Job-Specific Options:
> > > > >
> > > > > No such error comes if im passing the arguments without space
> between
> > > -D.
> > > > >
> > > > > By reference of Hadoop Definite Guide : "Do not confuse setting
> > > > > Hadoop properties using the -D property=value option to
> > > > > GenericOptionsParser (and
> > > > > ToolRunner) with setting JVM system properties using the
> > > > > -Dproperty=value option to the java command. The syntax for JVM
> > > > > system properties does not allow any whitespace between the D and
> > > > > the property name, whereas GenericOptionsParser requires them to
be
> > > > > separated by whitespace."
> > > > >
> > > > > Hence I suppose that GenericOptions should be parsed by -D
> > > > > property=value rather than -Dproperty=value.
> > > > >
> > > > > Additionally I tried -Dmapred.max.split.size=10485760 also through
> > > > > commandline but again only single MapTask started.
> > > > >
> > > > > Please Suggest
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Sean Owen [mailto:srowen@gmail.com]
> > > > > Sent: Wednesday, January 16, 2013 1:23 PM
> > > > > To: Mahout User List
> > > > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > > > >
> > > > > It's up to Hadoop in the end.
> > > > >
> > > > > Try calling FileInputFormat.setMaxInputSplitSize() with a smallish
> > > > > value, like your 10MB (10000000).
> > > > >
> > > > > I don't know if Hadoop params can be set as sys properties like
> that
> > > > > anyway?
> > > > >
> > > > > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi
> > > > > <stutiawasthi@hcl.com>
> > > > > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am trying to multiple dense matrix of size [100 x 100k]. The
> > > > > > size of
> > > > > the file is 104MB and with default block sizeof 64MB only 2 blocks
> > > > > are getting created.
> > > > > > So I reduced the block size to 10MB and now my file divided
into
> > > > > > 11
> > > > > blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT
> and
> > > > > 9 DN/TT.
> > > > > >
> > > > > > Everytime Im running Mahout MatrixMultiplicationJob through
> > > > > > commandline,
> > > > > I can see on JobTracker WebUI that only 1 map task is launched.
> > > > > According to my understanding of Inputsplit, there should be 11 map
> > > > tasks launched.
> > > > > > Apart from this Map task stays at 0.99% completion and in the
> > > > > > Tasks Logs
> > > > > , I can see that map task is spilling the map output.
> > > > > >
> > > > > > Mahout Command:
> > > > > >
> > > > > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > > > > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> > > > > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA
> --numRowsA
> > > > > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB
100
> > > > > > --numColsB
> > > > > > 100000 --tempDir /test/temp
> > > > > >
> > > > > > Now here I want to know that why only 1 map task is launched
> > > > > > everytime
> > > > > and how can I performance tune the cluster so that I can perform
> the
> > > > > dense matrix multiplication of the order [90K x 1 Million] .
> > > > > >
> > > > > > Thanks
> > > > > > Stuti
> > > > > >
> > > > > >
> > > > > > ::DISCLAIMER::
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --------
> > > > > >
> > > > > > The contents of this e-mail and any attachment(s) are
> confidential
> > > > > > and
> > > > > intended for the named recipient(s) only.
> > > > > > E-mail transmission is not guaranteed to be secure or error-free
> > > > > > as information could be intercepted, corrupted, lost, destroyed,
> > > > > > arrive late or incomplete, or may contain viruses in
> transmission.
> > > > > > The e mail
> > > > > and its contents (with or without referred errors) shall therefore
> > > > > not attach any liability on the originator or HCL or its
> affiliates.
> > > > > > Views or opinions, if any, presented in this email are solely
> > > > > > those of the author and may not necessarily reflect the views
or
> > > > > > opinions of HCL or its affiliates. Any form of reproduction,
> > > > > > dissemination, copying, disclosure, modification, distribution
> and
> > > > > > / or publication of
> > > > > this message without the prior written consent of authorized
> > > > > representative of HCL is strictly prohibited. If you have received
> > > > > this email in error please delete it and notify the sender
> > immediately.
> > > > > > Before opening any email and/or attachments, please check them
> for
> > > > > viruses and other defects.
> > > > > >
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --------
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > thanks
> > > ashish
> > >
> > > Blog: http://www.ashishpaliwal.com/blog
> > > My Photo Galleries: http://www.pbase.com/ashishpaliwal
> > >
> >
> >
> >
> > --
> > thanks
> > ashish
> >
> > Blog: http://www.ashishpaliwal.com/blog
> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message