mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: [jira] [Commented] (MAHOUT-904) SplitInput should support randomizing the input
Date Wed, 21 Dec 2011 02:36:51 GMT
Heck if I know. Stick with what you wrote until you decide it does not work.

On Mon, Dec 19, 2011 at 12:08 AM, Raphael Cendrillon
<cendrillon1978@gmail.com> wrote:
> That's a very good point.  Using this type of framework will make things much cleaner.
>
> This comment (from the top of the TupleWritable file) is what makes me a little concerned:
>
> This is *not* a general-purpose tuple type. In almost all cases, users are encouraged
to implement their own serializable types, which can perform better validation and provide
more efficient encodings than this class is capable. TupleWritable relies on the join framework
for type safety and assumes its instances will rarely be persisted, assumptions not only incompatible
with, but contrary to the general case.
>
> If we don't mind storing the class name, would it be better to use ObjectWritable for
the vector, or whatever else happens to be there?
>
>
> On 18 Dec, 2011, at 11:26 PM, Lance Norskog wrote:
>
>> But the Writables in each tuple include a vector which could be
>> hundreds of doubles. It's not a big deal.
>>
>> On Sun, Dec 18, 2011 at 9:29 PM, Raphael Cendrillon
>> <cendrillon1978@gmail.com> wrote:
>>> Yes, but tuplewritable is pretty inefficient since it stores the classname with
every record.  This seems wasteful given that the class is always the same.
>>>
>>> On 18 Dec, 2011, at 9:19 PM, Lance Norskog wrote:
>>>
>>>> JIRA is acting up, so posting here instead.
>>>>
>>>> You have already made RandomPermuteJob extend AbstractJob. Never mind.
>>>>
>>>> bq. Does this seem like a reasonable approach? It would require that a
>>>> class be created for each object type of interest which is somewhat
>>>> painfull. However I can't see a simpler approach since
>>>> setMapOutputValueClass() needs to take a class that has a default
>>>> constructor (and PairWritable doesn't have a default constructor since
>>>> it doesn't know how to call new for first and second since it doesn't
>>>> know what class first and second belong to).
>>>>
>>>> TupleWritable handles this by writing the classname. Looking at this
>>>> again, can't this just use TupleWritable?
>>>>
>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-3/org/apache/hadoop/mapred/join/TupleWritable.java
>>>>
>>>> On Sun, Dec 18, 2011 at 7:48 PM, Raphael Cendrillon (Commented) (JIRA)
>>>> <jira@apache.org> wrote:
>>>>>
>>>>>    [ https://issues.apache.org/jira/browse/MAHOUT-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172021#comment-13172021
]
>>>>>
>>>>> Raphael Cendrillon commented on MAHOUT-904:
>>>>> -------------------------------------------
>>>>>
>>>>> Hi Lance. Is that a general comment, or specifically for the issue regarding
PairWritable/IntVectorWritable?
>>>>>
>>>>>> SplitInput should support randomizing the input
>>>>>> -----------------------------------------------
>>>>>>
>>>>>>                 Key: MAHOUT-904
>>>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-904
>>>>>>             Project: Mahout
>>>>>>          Issue Type: Improvement
>>>>>>            Reporter: Grant Ingersoll
>>>>>>            Assignee: Raphael Cendrillon
>>>>>>              Labels: MAHOUT_INTRO_CONTRIBUTE
>>>>>>         Attachments: MAHOUT-904.patch, MAHOUT-904.patch, MAHOUT-904.patch
>>>>>>
>>>>>>
>>>>>> For some learning tasks, we need the input to be randomized (SGD)
instead of blocks of labels all at once.  SplitInput is a useful tool for setting up train/test
files but it currently doesn't support randomizing the input.
>>>>>
>>>>> --
>>>>> This message is automatically generated by JIRA.
>>>>> If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message