mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuti Awasthi <stutiawas...@hcl.com>
Subject RE: MatrixMultiplicationJob runs with 1 mapper only ?
Date Mon, 28 Jan 2013 13:28:42 GMT
Hi,
I would like to again consolidate all the steps which I performed. 

Issue : MatrixMultiplication example is getting executed with only 1 map task.

Steps :
1. I created a file with size 104MB which is divided into 11 blocks with size 10MB each. The
file contains 200x100000 size of matrix. 
2. I exported $MAHOUT_OPTS to the following 
          $   echo $MAHOUT_OPTS
          -Dmapred.min.split.size=10485760 -Dmapred.map.tasks=7
3.  Tried to execute matrix multiplication example using commandline :
mahout matrixmult --inputPathA /test/points/matrixA --numRowsA 200 --numColsA 100000 --inputPathB
/test/points/matrixA --numRowsB 200 --numColsB 100000 --tempDir /test/temp

When I check the Jobtracker UI , its shows me following for the running job :
Running Map Tasks : 1
Occupied Map Slots: 1

How can I distribute the map task on different mappers for MatrixMultiplication Job dynamically.

Is it even possible that MatrixMultiplication can run distributedly on multiple mappers as
it internally uses CompositeInputFormat .

Please Suggest

Thanks
Stuti


-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Wednesday, January 23, 2013 6:42 PM
To: Mahout User List
Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?

Mappers are usually extremely fast since they start themselves on top of the data and their
job is usually just parsing and emitting key value pairs. Hadoop's choices are usually fine.

If not it is usually because the mapper is emitting far more data than it ingests. Are you
computing some kind of Cartesian product of input?

That's slow no matter what. More mappers may increase parallelism but its still a lot of I/O.
Avoid it if you can by sampling or pruning unimportant values. Otherwise , try to implement
a Combiner.
On Jan 23, 2013 12:04 PM, "Jonas Grote" <jfgrote@gmail.com> wrote:

> I'd play with the mapred.map.tasks option. Setting it to something 
> bigger than 1 gave me performance improvements for various hadoop jobs 
> on my cluster.
>
>
> 2013/1/16 Ashish <paliwalashish@gmail.com>
>
> > I am afraid I don't know the answer. Need to experiment a bit more. 
> > I
> have
> > not used CompositeInputFormat so cannot comment.
> >
> > Probably, someone else on the ML(Mailing List) would be able to 
> > guide
> here.
> >
> >
> > On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi 
> > <stutiawasthi@hcl.com>
> > wrote:
> >
> > > Thanks Ashish,
> > >
> > > So according to the link if one is using CompositeInputFormat then 
> > > it
> > will
> > > take entire file as Input to a mapper without considering 
> > > InputSplits/blocksize.
> > > If I am understanding it correctly then it is asking to break 
> > > [Original Input File]->[flie1,file2,....] .
> > >
> > > So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1, 
> > > [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]
> > >
> > > Now will the input path in MatrixMultiplicationJob will be 
> > > directory
> path
> > > : /test/smallfiles  ??
> > >
> > > Will breaking file in such manner will cause problem in 
> > > algorithmic execution of MR job. Im not sure if output will be correct .
> > >
> > > -----Original Message-----
> > > From: Ashish [mailto:paliwalashish@gmail.com]
> > > Sent: Wednesday, January 16, 2013 5:44 PM
> > > To: user@mahout.apache.org
> > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > MatrixMultiplicationJob internally sets InputFormat as
> > CompositeInputFormat
> > >
> > > JobConf conf = new JobConf(initialConf, 
> > > MatrixMultiplicationJob.class); 
> > > conf.setInputFormat(CompositeInputFormat.class);
> > >
> > > and AFAIK, CompositeInputFormat ignores the splits. See this
> > >
> >
> http://stackoverflow.com/questions/8654200/hadoop-file-splits-composit
> einputformat-inner-join
> > >
> > > Unfortunately, I don't know any other alternative as of now.
> > >
> > >
> > > On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi 
> > > <stutiawasthi@hcl.com>
> > > wrote:
> > >
> > > > The issue is that currently my matrix is of dimension 
> > > > (100x100k), Later it can be (1MX10M) or big.
> > > >
> > > > Even now if my job is running with the single mapper for 
> > > > (100x100k) and it is not able to complete the Job. As I 
> > > > mentioned map task just proceed to 0.99% and started spilling 
> > > > the map output. Hence I wanted to tune my job so that Mahout is 
> > > > able to complete the job and I can utilize my cluster resources.
> > > >
> > > > As MatrixMultiplicationJob is a MR, so it should be able to 
> > > > handle parallel map tasks. I am not sure if there is any 
> > > > algorithmic constraints due to which it runs only with single mapper ?
> > > > I have taken the reference of thread so that I can set 
> > > > Configuration myself rather by getting it with getConf() but did 
> > > > not got any
> success
> > > >
> > > >
> http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
> > > > ers-in-DistributedRowMatrix-Jobs-td888980.html
> > > >
> > > > Stuti
> > > >
> > > > -----Original Message-----
> > > > From: Sean Owen [mailto:srowen@gmail.com]
> > > > Sent: Wednesday, January 16, 2013 4:46 PM
> > > > To: Mahout User List
> > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > > >
> > > > Why do you need multiple mappers? Is one too slow? Many are not 
> > > > necessarily faster for small input On Jan 16, 2013 10:46 AM, 
> > > > "Stuti Awasthi" <stutiawasthi@hcl.com> wrote:
> > > >
> > > > > Hi,
> > > > > I tried to call programmatically also but facing same issue : 
> > > > > Only single MapTask is running and that too spilling the map 
> > > > > output
> > > >  continuously.
> > > > > Hence im not able to generate the output for large matrix
> > > multiplication.
> > > > >
> > > > > Code Snippet :
> > > > >
> > > > > DistributedRowMatrix a = new DistributedRowMatrix(new 
> > > > > Path("/test/points/matrixA"), new 
> > > > > Path("/test/temp"),Integer.parseInt("100"),
> > > > > Integer.parseInt("100000")); DistributedRowMatrix b = new 
> > > > > DistributedRowMatrix(new Path("/test/points/matrixA"),new 
> > > > > Path("tempDir"),Integer.parseInt("100"),
> > > > > Integer.parseInt("100000"));
> > > > > Configuration conf = new Configuration(); 
> > > > > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818"); 
> > > > > conf.set("mapred.child.java.opts",
> > > > > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> > > > > a.setConf(conf);
> > > > > b.setConf(conf);
> > > > > a.times(b);
> > > > >
> > > > > Where Im going wrong. Any idea ?
> > > > >
> > > > > Thanks
> > > > > Stuti
> > > > > -----Original Message-----
> > > > > From: Stuti Awasthi
> > > > > Sent: Wednesday, January 16, 2013 2:55 PM
> > > > > To: Mahout User List
> > > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > > > >
> > > > > Hey Sean,
> > > > > Thanks for response. MatrixMultiplicationJob help shows the 
> > > > > usage
> > like
> > > :
> > > > > usage: <command> [Generic Options] [Job-Specific Options]
> > > > >
> > > > > Here Generic Option can be provided by -D <property=value>.

> > > > > Hence I tried with commandline -D options but it seems like 
> > > > > that it is not making any effect.  It is also suggested in :
> > > > >
> > > > >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho
> > > > > ut
> > > > > /common/AbstractJob.html
> > > > >
> > > > > Here I have noted 1 thing after your suggestion  that 
> > > > > currently Im passing arguments like -D<property=value> rather

> > > > > than -D <property=value>. I tried with space between -D and

> > > > > property=value also but then its giving error
> > > > > like:
> > > > > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected 
> > > > > /test/points/matrixA while processing Job-Specific Options:
> > > > >
> > > > > No such error comes if im passing the arguments without space
> between
> > > -D.
> > > > >
> > > > > By reference of Hadoop Definite Guide : "Do not confuse 
> > > > > setting Hadoop properties using the -D property=value option 
> > > > > to GenericOptionsParser (and
> > > > > ToolRunner) with setting JVM system properties using the 
> > > > > -Dproperty=value option to the java command. The syntax for 
> > > > > JVM system properties does not allow any whitespace between 
> > > > > the D and the property name, whereas GenericOptionsParser 
> > > > > requires them to be separated by whitespace."
> > > > >
> > > > > Hence I suppose that GenericOptions should be parsed by -D 
> > > > > property=value rather than -Dproperty=value.
> > > > >
> > > > > Additionally I tried -Dmapred.max.split.size=10485760 also 
> > > > > through commandline but again only single MapTask started.
> > > > >
> > > > > Please Suggest
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Sean Owen [mailto:srowen@gmail.com]
> > > > > Sent: Wednesday, January 16, 2013 1:23 PM
> > > > > To: Mahout User List
> > > > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > > > >
> > > > > It's up to Hadoop in the end.
> > > > >
> > > > > Try calling FileInputFormat.setMaxInputSplitSize() with a 
> > > > > smallish value, like your 10MB (10000000).
> > > > >
> > > > > I don't know if Hadoop params can be set as sys properties 
> > > > > like
> that
> > > > > anyway?
> > > > >
> > > > > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi 
> > > > > <stutiawasthi@hcl.com>
> > > > > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am trying to multiple dense matrix of size [100 x 100k]. 
> > > > > > The size of
> > > > > the file is 104MB and with default block sizeof 64MB only 2 
> > > > > blocks are getting created.
> > > > > > So I reduced the block size to 10MB and now my file divided

> > > > > > into
> > > > > > 11
> > > > > blocks across the cluster. Cluster size is 10 nodes with 1 
> > > > > NN/JT
> and
> > > > > 9 DN/TT.
> > > > > >
> > > > > > Everytime Im running Mahout MatrixMultiplicationJob through

> > > > > > commandline,
> > > > > I can see on JobTracker WebUI that only 1 map task is launched.
> > > > > According to my understanding of Inputsplit, there should be 
> > > > > 11 map
> > > > tasks launched.
> > > > > > Apart from this Map task stays at 0.99% completion and in 
> > > > > > the Tasks Logs
> > > > > , I can see that map task is spilling the map output.
> > > > > >
> > > > > > Mahout Command:
> > > > > >
> > > > > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > > > > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 
> > > > > > -Dio.sort.mb=200
> > > > > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA
> --numRowsA
> > > > > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB

> > > > > > 100 --numColsB
> > > > > > 100000 --tempDir /test/temp
> > > > > >
> > > > > > Now here I want to know that why only 1 map task is launched

> > > > > > everytime
> > > > > and how can I performance tune the cluster so that I can 
> > > > > perform
> the
> > > > > dense matrix multiplication of the order [90K x 1 Million] .
> > > > > >
> > > > > > Thanks
> > > > > > Stuti
> > > > > >
> > > > > >
> > > > > > ::DISCLAIMER::
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --------
> > > > > >
> > > > > > The contents of this e-mail and any attachment(s) are
> confidential
> > > > > > and
> > > > > intended for the named recipient(s) only.
> > > > > > E-mail transmission is not guaranteed to be secure or 
> > > > > > error-free as information could be intercepted, corrupted, 
> > > > > > lost, destroyed, arrive late or incomplete, or may contain 
> > > > > > viruses in
> transmission.
> > > > > > The e mail
> > > > > and its contents (with or without referred errors) shall 
> > > > > therefore not attach any liability on the originator or HCL or 
> > > > > its
> affiliates.
> > > > > > Views or opinions, if any, presented in this email are 
> > > > > > solely those of the author and may not necessarily reflect 
> > > > > > the views or opinions of HCL or its affiliates. Any form of

> > > > > > reproduction, dissemination, copying, disclosure, 
> > > > > > modification, distribution
> and
> > > > > > / or publication of
> > > > > this message without the prior written consent of authorized 
> > > > > representative of HCL is strictly prohibited. If you have 
> > > > > received this email in error please delete it and notify the 
> > > > > sender
> > immediately.
> > > > > > Before opening any email and/or attachments, please check 
> > > > > > them
> for
> > > > > viruses and other defects.
> > > > > >
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --------
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > thanks
> > > ashish
> > >
> > > Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: 
> > > http://www.pbase.com/ashishpaliwal
> > >
> >
> >
> >
> > --
> > thanks
> > ashish
> >
> > Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: 
> > http://www.pbase.com/ashishpaliwal
> >
>


::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named
recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted,
corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e
mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator
or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may
not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying,
disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized
representative of
HCL is strictly prohibited. If you have received this email in error please delete it and
notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------
Mime
View raw message