mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Razon, Oren" <oren.ra...@intel.com>
Subject RE: Mahout beginner questions...
Date Sun, 25 Mar 2012 13:04:45 GMT
Thanks for the detailed answer Sean.
I want to understand more clearly the non-distributed code limitations.
I saw that you advise that for more than 100,000,000 ratings the non-distributed engine won't
do the job.
The question is why? Is it memory issue (and then if I will have a bigger machine, meaning
I could scale up), or is it because of the recommendation time it takes?

Thanks,
Oren

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Thursday, March 22, 2012 17:57
To: user@mahout.apache.org
Subject: Re: Mahout beginner questions...

A distributed and non-distributed recommender are really quite
separate. They perform the same task in quite different ways. I don't
think you would mix them per se.

Depends on what you mean by a model-based recommender... I would call
the matrix-factorization-based and clustering-based approaches
"model-based" in the sense that they assume the existence of some
underlying structure and discover it. There's no Bayesian-style
approaches in the code.

They scale in different ways; I am not sure they are unilaterally a
solution to scale, no. I do agree in general that these have good
scaling properties for real-world use cases, like the
matrix-factorization approaches.


A "real" scalable architecture would have a real-time component and a
big distributed computation component. Mahout has elements of both and
can be the basis for piecing that together, but it's not a question of
strapping together the distributed and non-distributed implementation.
It's a bit harder than that.


I am actually quite close to being ready to show off something in this
area -- I have been working separately on a more complete rec system
that has both the real-time element but integrated directly with a
distributed element to handle the large-scale computation. I think
this is typical of big data architectures. You have (at least) a
real-time distributed "Serving Layer" and a big distributed batch
"Computation Layer". More on this in about... 2 weeks.


On Thu, Mar 22, 2012 at 3:16 PM, Razon, Oren <oren.razon@intel.com> wrote:
> Hi Sean,
> Thanks for your fast response, I really appreciate the quality of your book ("Mahout
in action"), and the support you give in such forums.
> Just to clear my second question...
> I want to build a recommender framework that will support different use cases.  So my
intention is to have both distributed and non-distributed solution in one framework, the question
is, is it a good design to put them both in the same machine (one of the machines in the Hadoop
cluster)?
>
> BTW... another question, it seem that a good solution to the recommender scalability
will be to use model based recommenders.
> Saying this, I wonder why there is such few model based recommenders, especially considering
the fact that Mahout contain several data mining models implemented already?
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Thursday, March 22, 2012 13:51
> To: user@mahout.apache.org
> Subject: Re: Mahout beginner questions...
>
> 1. These are the JDBC-related classes. For example see
> MySQLJDBCDiffStorage or MySQLJDBCDataModel in integration/
>
> 2. The distributed and non-distributed code are quite separate. At
> this scale I don't think you can use the non-distributed code to a
> meaningful degree. For example you could pre-compute item-item
> similarities over this data and use a non-distributed item-based
> recommender but you probably have enough items that this will strain
> memory. You would probably be looking at pre-computing recommendations
> in batch.
>
> 3. I don't think Netezza will help much here. It's still not fast
> enough at this scale to use with a real-time recommender (nothing is).
> If it's just a place you store data to feed into Hadoop it's not
> adding value. All the JDBC-related integrations ultimately load data
> into memory and that's out of the question with 500M data points.
>
> I'd also suggest you have a think about whether you "really" have 500M
> data points. Often you can know that most of the data is noise or not
> useful, and can get useful recommendations on a fraction of the data
> (maybe 5M). That makes a lot of things easier.
>
> On Thu, Mar 22, 2012 at 11:35 AM, Razon, Oren <oren.razon@intel.com> wrote:
>> Hi,
>> As a data mining developer who need to build a recommender engine POC (Proof Of Concept)
to support several future use cases, I've found Mahout framework as an appealing place to
start with. But as I'm new to Mahout and Hadoop in general I've a couple of questions...
>>
>> 1.      In "Mahout in action" under section 3.2.5 (Database-based data) it says:
"...Several classes in Mahout's recommender implementation will attempt to push computations
into the database for performance...". I've looked in the documents and inside the code itself,
but didn't found anywhere a reference to what are those calculations that are pushed into
the DB. Could you please explain what could be done inside the DB?
>> 2.      My future use will include use cases with small-medium data volumes (where
I guess the non-distributed algorithms will do the job), but also use cases that include huge
amounts of data (over 500,000,000 ratings). From my understanding this is where the distributed
code should be come handy. My question here is, because I will need to use both distributed
& non-distributed how could I build a good design here?
>>      Should I build two different solutions on different machines? Could I do
part of the job distributed (for example similarity calculation) and the output will be used
for the non-distributed code? Is it a BKM? Also if I deploy entire mahout code on an Hadoop
environment, what does it mean for the non-distributed code, will it all run as a different
java process on the name node?
>> 3.      As for now, beside of the Hadoop cluster we are building we have some
strong SQL machines (Netezza appliance) that can handle big (structure) data and include good
integration with 3'rd party analytics providers or developing on java platform but don't include
such reach recommender framework like Mahout. I'm trying to understand how could I utilize
both solutions (Netezza & Mahout) to handle big data recommender system use cases. Thought
maybe to move data into Netezza, do there all data manipulation and transformation, and in
the end to prepare a file that contain the classic data model structure needed for Mahout.
But could you think on better solution \ architecture? Maybe keeping the data only inside
Netezza and extracting it to Mahout using JDBC when needed? I will be glad to hear your ideas
:)
>>
>> Thanks,
>> Oren
>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Mime
View raw message