mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: SSVD + PCA
Date Mon, 20 Aug 2012 18:09:58 GMT
On Mon, Aug 20, 2012 at 11:03 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> Ok this just means that something in the A input is not really
> adhering to <Writable,VectorWritable> specification. In particular,
> there seems to be a file in the input path that has <?,VectorWritable>
> pair in its input.

sorry this should read

> there seems to be a file in the input path that has <?,Text>
> pair in its input.

Input seems to have Text values somewhere.

>
> Can you check your input files for key/value types? Note that includes
> entire subtree of sequence files, not just files in the input
> directory.
>
> Usually it is visible in the header of the sequence file (usually even
> if it is using compression).
>
> I am not quite sure what you mean by "rowid" processing.
>
>
>
> On Sun, Aug 19, 2012 at 7:40 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>> Getting an odd error on SSVD.
>>
>> Starting with the QJob I get 9 map tasks for the data set, 8 are run on the mini
cluster in parallel. Most of them complete with no errors but there usually two map task failures
for each QJob, they die with the error:
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.mahout.math.VectorWritable
>>         at org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.map(QJob.java:74)
>>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:416)
>>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>> The data was created using seq2sparse and then running rowid to create the input
matrix. The data was encoded as named vectors. These are the two differences I could think
of between how I ran it from the API and from the CLI.
>>
>>
>> On Aug 18, 2012, at 7:29 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>>
>> -t Param
>>
>> I'm no hadoop expert but there are a couple parameters for each node in a cluster
that specifies the default number of mappers and reducers for that node. There is a rule of
thumb about how many mappers and reducers per core. You can tweak them either way depending
on your typical jobs.
>>
>> No idea what you mean about the total reducers being 1 for most configs. My very
small cluster at home with 10 cores in three machines is configured to produce a conservative
10 mappers and 10 reducers, which is about what happens with balanced jobs. The reducers =
1 is probably for a non-clustered one machine setup.
>>
>> I'm suspicious that the -t  parameter is not needed but would definitely defer to
a hadoop master. In any case I set it to 10 for my mini cluster.
>>
>> Variance Retained
>>
>> If one batch of data yields a greatly different estimate of VR than another, it would
be worth noticing, even if we don't know the actual error in it. To say that your estimate
of VR is valueless would require that we have some experience with it, no?
>>
>> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>
>> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <pat.ferrel@gmail.com> wrote:
>>>
>>> Switching from API to CLI
>>>
>>> the parameter -t is described in the PDF
>>>
>>> --reduceTasks <int-value> optional. The number of reducers to use (where
>> applicable): depends on the size of the hadoop cluster. At this point it
>> could also be overwritten by a standard hadoop property using -D option
>>> 4. Probably always needs to be speciļ¬ed as by default Hadoop would set it
>> to 1, which is certainly far below the cluster capacity. Recommended value
>> for this option ~ 95% or ~190% of available reducer capacity to allow for
>> opportunistic executions.
>>>
>>> The description above seems to say it will be taken from the hadoop
>> config if not specified, which is probably all most people would every
>> want. I am unclear why this is needed? I cannot run SSVD without specifying
>> it, in other words it does not seem to be optional?
>>
>> This parameter was made mandatory because people were repeatedly forgetting
>> set the number of reducers and kept coming back with questions like why it
>> is running so slow. So there was an issue in 0.7 where i made it mandatory.
>> I am actually not sure now other mahout methods ensure reducer
>> specification is always specified other than 1
>>
>>>
>>> As a first try using the CLI I'm running with 295625 rows and 337258
>> columns using the following parameters to get a sort of worst case run time
>> result with best case data output. The parameters will be tweaked later to
>> get better dimensional reduction and runtime.
>>>
>>>   mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends
>> on cluster)
>>>
>>> Is there work being done to calculate the variance retained for the
>> output or should I calculate it myself?
>>
>> No theres no work done since it implies your are building your own pipeline
>> for a particular purpose. It also takes a lot of assumptions that may or
>> may not hold in a  particular case, such that you do something repeatedly
>> and corpuses are of similar nature. Also, i know no paper that would do it
>> exactly the way i described, so theres no error estimate on either
>> inequality approach or any sort of decay interpolation.
>>
>> It is not very difficult to experiment a little with your data though with
>> a subset of the corpus and see what may work.
>>
>>

Mime
View raw message