mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Mahout Suggestions - Refactoring Effort
Date Wed, 27 Mar 2013 04:41:16 GMT
It is critically important.

On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube <> wrote:

> IMHO usability is really important.    I've posted a couple of patches
> recently around making the RF classifiers easier to use.  I found myself
> working on consistent data format and command line option support. It's not
> glamorous but it's important.
> On 3/26/2013 8:26 PM, Ted Dunning wrote:
>> Gokhan,
>> I think that the general drift of your recommendation is an excellent
>> suggestion and it is something that we have wrestled with a lot over time.
>>   The recommendations side of the house has more coherence in this matter
>> than other parts largely because there was a clear flow early on.
>> Now, however, the flow is becoming more clear for non-recommendation parts
>> of the system.
>> - we have 2-3 external kinds of input.  These include text and matrices.
>>   Text comes in two major forms, those being text in files with
>> unspecified
>> separators and text in Lucene/Solr indexes.  Matrices come in several
>> forms
>> including triples, CSV files, binary matrices and sequence files of
>> vectors.
>> - there are currently only a few ways to convert text and external data to
>> matrices.  The two most prominent are dictionary based and hashed
>> encoding.
>>   Hashed encoding is currently not as invertible as it should be.
>>   Dictionary based has the virtue of being invertible, but hashed encoding
>> has considerably more generality.  We have almost no support for multiple
>> fields in dictionary based encoding.
>> - good conversion backwards and forwards depends on having schema
>> information that we don't retain or specify well.
>> - knowledge discovery pathways need more flexibility than recommendation
>> pathways regarding input and visualization.
>> - key knowledge discovery pathways that I know about include (a) input
>> summarization, (b) vectorization, (c) unsupervised analysis such as LDA,
>> LLL, clustering, SVD, (d) supervised training such as SGD, Naive Bayes and
>> random forest, and (e) visualization of results
>> I see that the major problems in Mahout are what Gokhan said, but with a
>> few extras
>> 1) as Gokhan said, the exploratory pathways are inconsistent
>> 2) I think that our visualization pathways are also hideous
>> 3) I think that we need a good document format with a reasonable schema.
>>   Rather than create such a thing, I would nominate Lucene/Solr indexes
>> as a
>> first class object in Mahout.
>> 4) our current command lines with all the (many) different options with
>> incompatible conventions is a bit of a shambles
>> Expressed this way, I think that these usability issues are fixable.
>> What does everybody else think?  Would this leave us with a significantly
>> better system?
>> On Tue, Mar 26, 2013 at 9:35 PM, Gokhan Capan <> wrote:
>>  I am moving my email that I wrote to Call to Action upon request.
>>> I'll start with an example that I experience when I use Mahout, and list
>>> my
>>> humble suggestions.
>>> When I try to run Latent Dirichlet Allocation for topic discovery, here
>>> are
>>> the steps  to follow:
>>> 1- First I use seq2sparse to convert text to vectors. The output is Text,
>>> VectorWritable pairs (If I have a csv data file –which is
>>> understandable-,
>>> which has lines of id, text pairs, I need to develop my own tool to
>>> convert
>>> it to vectors.)
>>> 2- I run LDA on data I transformed, but it doesn’t work, because LDA
>>> needs
>>> IntWritable, VectorWritable pairs.
>>> 3- I convert Text keys to IntWritable ones with a custom tool.
>>> 4- Then I run LDA, and to see the results, I need to run vectordump with
>>> sort flag (It usually throws OutOfMemoryError). An ldadump tool does not
>>> exist. What I see is fairly different from clusterdump results, so I
>>> spend
>>> some time to understand what that means. (And I need to know there
>>> exists a
>>> vectordump tool to see the results)
>>> 5- After running LDA, when I have a document that I want to assign to a
>>> topic, there is no way -or I am not aware- to use my learned LDA model to
>>> assign this document to a topic.
>>> I can give further examples, but I believe this will make my point clear.
>>> Would you consider to refactor Mahout, so that the project follows a
>>> clear,
>>> layered structure for all algorithms, and to document it?
>>> IMO Knowledge Discovery process has a certain path, and Mahout can define
>>> rules, those would force developers and guide users. For example:
>>>     - All algorithms take Mahout matrices as input and output.
>>>     - All preprocessing tools should be generic enough, so that they
>>> produce
>>>     appropriate input for mahout algorithms.
>>>     - All algorithms should output a model that users can use them beyond
>>>     training and testing
>>>     - Tools those dump results should follow a strictly defined format
>>>     suggested by community
>>>     - All similar kinds of algorithms should use same evaluation tools
>>>     - ...
>>> There may be separated layers: preprocessing layer, algorithms layer,
>>> evaluation layer, and so on.
>>> This way users would be aware of the steps they need to perform, and one
>>> step can be replaced by an alternative.
>>> Developers would contribute to the layer they feel comfortable with, and
>>> would satisfy the expected input and output, to preserve the integrity.
>>> Mahout has tools for nearly all of these layers, but personally when I
>>> use
>>> Mahout (and I’ve been using it for a long time), I feel lost in the
>>> steps I
>>> should follow.
>>> Moreover, the refactoring may eliminate duplicate data structures, and
>>> stick to Mahout matrices if available. All similarity measures operate on
>>> Mahout Vectors, for example.
>>> We, in the lab and in our company, do some of that. An example:
>>> We implemented an HBase backed Mahout Matrix, which we use for our
>>> projects
>>> where online learning algorithms operate on large input and learn a big
>>> parameter matrix (one needs this for matrix factorization based
>>> recommenders). Then the persistent parameter matrix becomes an input for
>>> the live system. Then we used the same matrix implementation as the
>>> underlying data store of Recommender DataModels. This was advantageous in
>>> many ways:
>>>     - Everyone knows that any dataset should be in Mahout matrix format,
>>> and
>>>     applies appropriate preprocessing, or writes one
>>>     - We can use different recommenders interchangeably
>>>     - Any optimization on matrix operations apply everywhere
>>>     - Different people can work on different parts (evaluation, model
>>>     optimization, recommender algorithms) without bothering others
>>> Apart from all, I should say that I am always eager to contribute to
>>> Mahout, as some of committers already know.
>>> Best Regards
>>> Gokhan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message