mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marty Kube <marty.kube.apa...@gmail.com>
Subject Re: Mahout Suggestions - Refactoring Effort
Date Wed, 27 Mar 2013 06:47:17 GMT
Hey Ted,

Here are the JIRA tickets...

https://issues.apache.org/jira/browse/MAHOUT-1163
https://issues.apache.org/jira/browse/MAHOUT-1164



On 03/27/2013 12:37 AM, Ted Dunning wrote:
> Can you post a list of those patches?
>
> I haven't been tracking carefully and unless I have a moment when the email
> comes through (<10% chance lately) then I lose track.
>
> On Wed, Mar 27, 2013 at 7:30 AM, Marty Kube <marty.kube.apache@gmail.com>wrote:
>
>> So I'd like to continue to improve the RF classifier code. I've been
>> posting patches along the lines of the refactoring discussed here. The
>> patches are not being looked at. Someone should be considering patches in
>> this area.  Maybe I could handle that :-)
>>
>>
>> Sent from my iPhone
>>
>> On Mar 27, 2013, at 12:14 AM, Sebastian Schelter <ssc@apache.org> wrote:
>>
>>> Totally agree on that. The impact of making Mahout more usable is much
>>> higher than that of adding a new algorithm.
>>>
>>> On 27.03.2013 05:41, Ted Dunning wrote:
>>>> It is critically important.
>>>>
>>>> On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube <
>>>> martykube@beavercreekconsulting.com> wrote:
>>>>
>>>>> IMHO usability is really important.    I've posted a couple of patches
>>>>> recently around making the RF classifiers easier to use.  I found
>> myself
>>>>> working on consistent data format and command line option support.
>> It's not
>>>>> glamorous but it's important.
>>>>>
>>>>>
>>>>> On 3/26/2013 8:26 PM, Ted Dunning wrote:
>>>>>
>>>>>> Gokhan,
>>>>>>
>>>>>> I think that the general drift of your recommendation is an excellent
>>>>>> suggestion and it is something that we have wrestled with a lot over
>> time.
>>>>>>   The recommendations side of the house has more coherence in this
>> matter
>>>>>> than other parts largely because there was a clear flow early on.
>>>>>>
>>>>>> Now, however, the flow is becoming more clear for non-recommendation
>> parts
>>>>>> of the system.
>>>>>>
>>>>>> - we have 2-3 external kinds of input.  These include text and
>> matrices.
>>>>>>   Text comes in two major forms, those being text in files with
>>>>>> unspecified
>>>>>> separators and text in Lucene/Solr indexes.  Matrices come in several
>>>>>> forms
>>>>>> including triples, CSV files, binary matrices and sequence files
of
>>>>>> vectors.
>>>>>>
>>>>>> - there are currently only a few ways to convert text and external
>> data to
>>>>>> matrices.  The two most prominent are dictionary based and hashed
>>>>>> encoding.
>>>>>>   Hashed encoding is currently not as invertible as it should be.
>>>>>>   Dictionary based has the virtue of being invertible, but hashed
>> encoding
>>>>>> has considerably more generality.  We have almost no support for
>> multiple
>>>>>> fields in dictionary based encoding.
>>>>>>
>>>>>> - good conversion backwards and forwards depends on having schema
>>>>>> information that we don't retain or specify well.
>>>>>>
>>>>>> - knowledge discovery pathways need more flexibility than
>> recommendation
>>>>>> pathways regarding input and visualization.
>>>>>>
>>>>>> - key knowledge discovery pathways that I know about include (a)
input
>>>>>> summarization, (b) vectorization, (c) unsupervised analysis such
as
>> LDA,
>>>>>> LLL, clustering, SVD, (d) supervised training such as SGD, Naive
>> Bayes and
>>>>>> random forest, and (e) visualization of results
>>>>>>
>>>>>> I see that the major problems in Mahout are what Gokhan said, but
>> with a
>>>>>> few extras
>>>>>>
>>>>>> 1) as Gokhan said, the exploratory pathways are inconsistent
>>>>>>
>>>>>> 2) I think that our visualization pathways are also hideous
>>>>>>
>>>>>> 3) I think that we need a good document format with a reasonable
>> schema.
>>>>>>   Rather than create such a thing, I would nominate Lucene/Solr indexes
>>>>>> as a
>>>>>> first class object in Mahout.
>>>>>>
>>>>>> 4) our current command lines with all the (many) different options
>> with
>>>>>> incompatible conventions is a bit of a shambles
>>>>>>
>>>>>> Expressed this way, I think that these usability issues are fixable.
>>>>>>
>>>>>> What does everybody else think?  Would this leave us with a
>> significantly
>>>>>> better system?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 26, 2013 at 9:35 PM, Gokhan Capan <gkhncpn@gmail.com>
>> wrote:
>>>>>> I am moving my email that I wrote to Call to Action upon request.
>>>>>>> I'll start with an example that I experience when I use Mahout,
and
>> list
>>>>>>> my
>>>>>>> humble suggestions.
>>>>>>>
>>>>>>> When I try to run Latent Dirichlet Allocation for topic discovery,
>> here
>>>>>>> are
>>>>>>> the steps  to follow:
>>>>>>>
>>>>>>> 1- First I use seq2sparse to convert text to vectors. The output
is
>> Text,
>>>>>>> VectorWritable pairs (If I have a csv data file --which is
>>>>>>> understandable-,
>>>>>>> which has lines of id, text pairs, I need to develop my own tool
to
>>>>>>> convert
>>>>>>> it to vectors.)
>>>>>>>
>>>>>>> 2- I run LDA on data I transformed, but it doesn't work, because
LDA
>>>>>>> needs
>>>>>>> IntWritable, VectorWritable pairs.
>>>>>>>
>>>>>>> 3- I convert Text keys to IntWritable ones with a custom tool.
>>>>>>>
>>>>>>> 4- Then I run LDA, and to see the results, I need to run vectordump
>> with
>>>>>>> sort flag (It usually throws OutOfMemoryError). An ldadump tool
does
>> not
>>>>>>> exist. What I see is fairly different from clusterdump results,
so I
>>>>>>> spend
>>>>>>> some time to understand what that means. (And I need to know
there
>>>>>>> exists a
>>>>>>> vectordump tool to see the results)
>>>>>>>
>>>>>>> 5- After running LDA, when I have a document that I want to assign
>> to a
>>>>>>> topic, there is no way -or I am not aware- to use my learned
LDA
>> model to
>>>>>>> assign this document to a topic.
>>>>>>>
>>>>>>> I can give further examples, but I believe this will make my
point
>> clear.
>>>>>>>
>>>>>>> Would you consider to refactor Mahout, so that the project follows
a
>>>>>>> clear,
>>>>>>> layered structure for all algorithms, and to document it?
>>>>>>>
>>>>>>> IMO Knowledge Discovery process has a certain path, and Mahout
can
>> define
>>>>>>> rules, those would force developers and guide users. For example:
>>>>>>>
>>>>>>>
>>>>>>>     - All algorithms take Mahout matrices as input and output.
>>>>>>>     - All preprocessing tools should be generic enough, so that
they
>>>>>>> produce
>>>>>>>     appropriate input for mahout algorithms.
>>>>>>>     - All algorithms should output a model that users can use
them
>> beyond
>>>>>>>     training and testing
>>>>>>>     - Tools those dump results should follow a strictly defined
format
>>>>>>>     suggested by community
>>>>>>>     - All similar kinds of algorithms should use same evaluation
tools
>>>>>>>     - ...
>>>>>>>
>>>>>>> There may be separated layers: preprocessing layer, algorithms
layer,
>>>>>>> evaluation layer, and so on.
>>>>>>>
>>>>>>> This way users would be aware of the steps they need to perform,
and
>> one
>>>>>>> step can be replaced by an alternative.
>>>>>>>
>>>>>>> Developers would contribute to the layer they feel comfortable
with,
>> and
>>>>>>> would satisfy the expected input and output, to preserve the
>> integrity.
>>>>>>> Mahout has tools for nearly all of these layers, but personally
when
>> I
>>>>>>> use
>>>>>>> Mahout (and I've been using it for a long time), I feel lost
in the
>>>>>>> steps I
>>>>>>> should follow.
>>>>>>>
>>>>>>> Moreover, the refactoring may eliminate duplicate data structures,
>> and
>>>>>>> stick to Mahout matrices if available. All similarity measures
>> operate on
>>>>>>> Mahout Vectors, for example.
>>>>>>>
>>>>>>> We, in the lab and in our company, do some of that. An example:
>>>>>>>
>>>>>>> We implemented an HBase backed Mahout Matrix, which we use for
our
>>>>>>> projects
>>>>>>> where online learning algorithms operate on large input and learn
a
>> big
>>>>>>> parameter matrix (one needs this for matrix factorization based
>>>>>>> recommenders). Then the persistent parameter matrix becomes an
input
>> for
>>>>>>> the live system. Then we used the same matrix implementation
as the
>>>>>>> underlying data store of Recommender DataModels. This was
>> advantageous in
>>>>>>> many ways:
>>>>>>>
>>>>>>>     - Everyone knows that any dataset should be in Mahout matrix
>> format,
>>>>>>> and
>>>>>>>     applies appropriate preprocessing, or writes one
>>>>>>>     - We can use different recommenders interchangeably
>>>>>>>     - Any optimization on matrix operations apply everywhere
>>>>>>>     - Different people can work on different parts (evaluation,
model
>>>>>>>     optimization, recommender algorithms) without bothering others
>>>>>>>
>>>>>>> Apart from all, I should say that I am always eager to contribute
to
>>>>>>> Mahout, as some of committers already know.
>>>>>>>
>>>>>>> Best Regards
>>>>>>>
>>>>>>> Gokhan


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message