mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eshwaran Vijaya Kumar <evijayaku...@mozilla.com>
Subject Re: Errors in SSVD
Date Tue, 16 Aug 2011 20:57:51 GMT
I have decided to do something similar: Do the pipeline in memory and not invoke map-reduce
for small datasets which I think will handle the issue. 
Thanks again for clearing that up.
Esh

 Aug 16, 2011, at 1:45 PM, Dmitriy Lyubimov wrote:

> PPS Mahout also has in-memory SVD Colt-migrated solver which is BTW what i
> am using int local tests to assert SSVD results. Although it starts to feel
> slow pretty quickly and sometimes produces errors (i think i starts feeling
> slow at 10k x 1k inputs)
> 
> On Tue, Aug 16, 2011 at 12:52 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:
> 
>> also, with data as small as this, stochastic noise ratio would be
>> significant (as in 'big numbers' law) so if you really think you might need
>> to handle inputs that small, you better write a pipeline that detects this
>> as a corner case and just runs in-memory decomposition. In fact, i think
>> dense matrices up to 100,000 rows can be quite comfortably computed
>> in-memory (Ted knows much more on practical limits of tools like R or even
>> as simple as apache.math)
>> 
>> -d
>> 
>> 
>> On Tue, Aug 16, 2011 at 12:46 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>wrote:
>> 
>>> yep that's what i figured. you have 193 rows or so but distributed between
>>> 7 files so they are small and would generate several mappers and there are
>>> probably some there  with a small row count.
>>> 
>>> See my other email. This method is for big data, big files. If you want to
>>> automate handling of small files, you can probably do some intermediate step
>>> with some heuristic that merges together all files say shorter than 1Mb.
>>> 
>>> -d
>>> 
>>> 
>>> 
>>> On Tue, Aug 16, 2011 at 12:43 PM, Eshwaran Vijaya Kumar <
>>> evijayakumar@mozilla.com> wrote:
>>> 
>>>> Number of mappers is 7. DFS block size is 128 MB, the reason I think
>>>> there are 7 mappers being used is that I am using a Pig script to generate
>>>> the sequence file of Vectors and that script generates 7 reducers. I am not
>>>> setting minSplitSize though.
>>>> 
>>>> On Aug 16, 2011, at 12:15 PM, Dmitriy Lyubimov wrote:
>>>> 
>>>>> Hm. This is not common at all.
>>>>> 
>>>>> This error would surface if map split can't accumulate at least k+p
>>>> rows.
>>>>> 
>>>>> That's another requirement which usually is non-issue -- any
>>>> precomputed
>>>>> split must contain at least k+p rows, which normally would not be the
>>>> case
>>>>> only if matrix is extra wide and dense, in which case --minSplitSize
>>>> must be
>>>>> used to avoid this.
>>>>> 
>>>>> But in your case, the matrix is so small it must fit in one split. Can
>>>> you
>>>>> please verify how many mappers the job generates?
>>>>> 
>>>>> if it's more than 1 than something is going fishy with hadoop.
>>>> Otherwise,
>>>>> something is fishy with input (it's either not 293 rows, or k+p is more
>>>> than
>>>>> 293).
>>>>> 
>>>>> -d
>>>>> 
>>>>> On Tue, Aug 16, 2011 at 11:39 AM, Eshwaran Vijaya Kumar <
>>>>> evijayakumar@mozilla.com> wrote:
>>>>> 
>>>>>> 
>>>>>> On Aug 16, 2011, at 10:35 AM, Dmitriy Lyubimov wrote:
>>>>>> 
>>>>>>> This is unusually small input. What's the block size? Use large
>>>> blocks
>>>>>> (such
>>>>>>> as 30,000). Block size can't be less than k+p.
>>>>>>> 
>>>>>> 
>>>>>> I did set blockSize to 30,000 (as recommended in the PDF that you
>>>> wrote
>>>>>> up). As far as input size, the reason to do that is because it is
>>>> easier to
>>>>>> test and verify the map-reduce pipeline with my in-memory
>>>> implementation of
>>>>>> the algorithm.
>>>>>> 
>>>>>>> Can you please cut and paste actual log of qjob tasks that failed?
>>>> This
>>>>>> is
>>>>>>> front end error, but the actual problem is actually in the backend
>>>>>> ranging
>>>>>>> anywhere from hadoop problems to algorithm problems.
>>>>>> Sure. Refer http://esh.pastebin.mozilla.org/1302059
>>>>>> Input is a DistributedRowMatrix 293 X 236, k = 4, p = 40,
>>>> numReduceTasks =
>>>>>> 1, blockHeight = 30,000. Reducing p = 20 ensures job goes through...
>>>>>> 
>>>>>> Thanks again
>>>>>> Esh
>>>>>> 
>>>>>> 
>>>>>>> On Aug 16, 2011 9:44 AM, "Eshwaran Vijaya Kumar" <
>>>>>> evijayakumar@mozilla.com>
>>>>>>> wrote:
>>>>>>>> Thanks again. I am using 0.5 right now. We will try to patch
it up
>>>> and
>>>>>> see
>>>>>>> how it performs. In the mean time, I am having another (possibly
>>>> user?)
>>>>>>> error: I have a 260 X 230 matrix. I set k+p = 40, it fails with
>>>>>>>> 
>>>>>>>> Exception in thread "main" java.io.IOException: Q job unsuccessful.
>>>>>>>> at
>>>> org.apache.mahout.math.hadoop.stochasticsvd.QJob.run(QJob.java:349)
>>>>>>>> at
>>>>>>> 
>>>>>> 
>>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:262)
>>>>>>>> at
>>>>>>> 
>>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.run(SSVDCli.java:91)
>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>>>>> at
>>>>>>> 
>>>>>> 
>>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.main(SSVDCli.java:131)
>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>> at
>>>>>>> 
>>>>>> 
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>>> at
>>>>>>> 
>>>>>> 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>> at
>>>>>>> 
>>>>>> 
>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>>>> at
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>> at
>>>>>>> 
>>>>>> 
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>>> at
>>>>>>> 
>>>>>> 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Suppose I set k+p to be much lesser say around 20, it works
fine. Is
>>>> it
>>>>>>> just that my dataset is of low rank or is there something else
going
>>>> on
>>>>>>> here?
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Esh
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 14, 2011, at 1:47 PM, Dmitriy Lyubimov wrote:
>>>>>>>> 
>>>>>>>>> ... i need to let some time for review before pushing
to ASF repo
>>>> )..
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sun, Aug 14, 2011 at 1:47 PM, Dmitriy Lyubimov <
>>>> dlieu.7@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> patch is posted as MAHOUT -786.
>>>>>>>>>> 
>>>>>>>>>> also 0.6 trunk with patch applied is here :
>>>>>>>>>> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786
>>>>>>>>>> 
>>>>>>>>>> <https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786>I
>>>> will
>>>>>>> commit
>>>>>>>>>> to ASF repo tomorrow night (even that it is extremely
simple, i
>>>> need
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sat, Aug 13, 2011 at 1:48 PM, Eshwaran Vijaya
Kumar <
>>>>>>>>>> evijayakumar@mozilla.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Dmitriy,
>>>>>>>>>>> That sounds great. I eagerly await the patch.
>>>>>>>>>>> Thanks
>>>>>>>>>>> Esh
>>>>>>>>>>> On Aug 13, 2011, at 1:37 PM, Dmitriy Lyubimov
wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Ok, i got u0 working.
>>>>>>>>>>>> 
>>>>>>>>>>>> The problem is of course that something called
BBt job is to be
>>>>>>> coerced
>>>>>>>>>>> to
>>>>>>>>>>>> have 1 reducer (it's fine, every mapper won't
yeld more than
>>>>>>>>>>>> upper-triangular matrix of k+p x k+p geometry,
so even if you
>>>> end up
>>>>>>>>>>> having
>>>>>>>>>>>> thousands of them, reducer would sum them
up just fine.
>>>>>>>>>>>> 
>>>>>>>>>>>> it worked before apparently because configuration
hold 1 reducer
>>>> by
>>>>>>>>>>> default
>>>>>>>>>>>> if not set explicitly, i am not quite sure
if that's something
>>>> in
>>>>>>> hadoop
>>>>>>>>>>> mr
>>>>>>>>>>>> client or mahout change that now precludes
it from working.
>>>>>>>>>>>> 
>>>>>>>>>>>> anyway, i got a patch (really a one-liner)
and an example
>>>> equivalent
>>>>>>> to
>>>>>>>>>>>> yours worked fine for me with 3 reducers.
>>>>>>>>>>>> 
>>>>>>>>>>>> Also, in the tests, it also requests 3 reducers,
but the reason
>>>> it
>>>>>>> works
>>>>>>>>>>> in
>>>>>>>>>>>> tests and not in distributed mapred is because
local mapred
>>>> doesn't
>>>>>>>>>>> support
>>>>>>>>>>>> multiple reducers. I investigated this issue
before and
>>>> apparently
>>>>>>> there
>>>>>>>>>>>> were a couple of patches floating around
but for some reason
>>>> those
>>>>>>>>>>> changes
>>>>>>>>>>>> did not take hold in cdh3u0.
>>>>>>>>>>>> 
>>>>>>>>>>>> I will publish patch in a jira shortly and
will commit it
>>>>>> Sunday-ish.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> -d
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Aug 5, 2011 at 7:06 PM, Eshwaran
Vijaya Kumar <
>>>>>>>>>>>> evijayakumar@mozilla.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> OK. So to add more info to this, I tried
setting the number of
>>>>>>> reducers
>>>>>>>>>>> to
>>>>>>>>>>>>> 1 and now I don't get that particular
error. The singular
>>>> values
>>>>>> and
>>>>>>>>>>> left
>>>>>>>>>>>>> and right singular vectors appear to
be correct though
>>>> (verified
>>>>>>> using
>>>>>>>>>>>>> Matlab).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Aug 5, 2011, at 1:55 PM, Eshwaran
Vijaya Kumar wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> All,
>>>>>>>>>>>>>> I am trying to test Stochastic SVD
and am facing some errors
>>>> where
>>>>>>> it
>>>>>>>>>>>>> would be great if someone could clarifying
what is going on. I
>>>> am
>>>>>>>>>>> trying to
>>>>>>>>>>>>> feed the solver a DistributedRowMatrix
with the exact same
>>>>>> parameters
>>>>>>>>>>> that
>>>>>>>>>>>>> the test in LocalSSVDSolverSparseSequentialTest
uses, i.e,
>>>> Generate
>>>>>> a
>>>>>>>>>>> 1000
>>>>>>>>>>>>> X 100 DRM with SequentialSparseVectors
and then ask for
>>>> blockHeight
>>>>>>>>>>> 251, p
>>>>>>>>>>>>> (oversampling) = 60, k (rank) = 40. I
get the following error:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Exception in thread "main" java.io.IOException:
Unexpected
>>>> overrun
>>>>>>> in
>>>>>>>>>>>>> upper triangular matrix files
>>>>>>>>>>>>>> at
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.loadUpperTriangularMatrix(SSVDSolver.java:471)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:268)
>>>>>>>>>>>>>> at com.mozilla.SSVDCli.run(SSVDCli.java:89)
>>>>>>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>>>>>>>>>>>> at com.mozilla.SSVDCli.main(SSVDCli.java:129)
>>>>>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Also, I am using CDH3 with Mahout
recompiled to work with CDH3
>>>>>> jars.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Esh
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 


Mime
View raw message