mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Bates <>
Subject Re: Question about data warehousing and mining through Mahout
Date Tue, 31 Aug 2010 22:38:02 GMT
>From my experience (merging machine learning with business goals), I'll
offer a few pieces of advice that may help guide you.

1.  First determine what data you have (and how much of it), and how you
want to store/ query it.
-  If you have 1.5 TB of log data, you are in the realm of Hadoop.  If you
find however that you only need to operate on a subset of this data
(~100mb), you may just want to stick with loading it up in memory and using
something like Octave, R, Matlab, Python to run algorithms against it.
Probably the easiest.  In fact, I'd say do that first before you go
whole-hog on the distributed system.

2.  Second, come up with questions about your data that you want to answer
(or have someone give those questions to you).  Make those questions as
specific as possible.
- The type of question will tell you what tool you need to use. Sometimes
this means querying with Hive (ie. How many unique users viewed this type of
page?) if the data is too much/too sparse to put into MySQL.  Sometimes this
means just writing a Python/Ruby script with a few Regex's and hunting
through the data.  If the questions are predictive in nature, you may need
to use some machine learning tools.

3.  Simple techniques often will get you 80% of the way to your goal.
 Machine Learning gets you the other 20% (or sometimes only 5%!).
- I would say to use machine learning once you know the domain of the
problem you're trying to solve extremely well.  Because it will take effort
and you should be immediately skeptical of any result you get back.  It's a
black box that you should really know the inner workings of, so my advice is
to exhaust all non-machine learning options first, then go for that extra
accuracy if its warranted.

Good luck!

On Tue, Aug 31, 2010 at 6:03 PM, Sean Owen <> wrote:

> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <> wrote:
> > Per my understanding of hive, we can do some statistical reporting, like
> > frequency of user sessions, which geographical region, which device he is
> > using the most etc.
> Yes that's about what Hive is good for, if you're looking for some
> open-source libraries along those lines.
> >
> > But we also want to mine this data to get some predictive capabilities
> like
> > what is the likelihood that the user will use the same device again or if
> we
> > get sales/marketing data (on the roadmap for future), we want to possibly
> > predict which region to put more marketing/sales efforts. What is the
> > pattern for growth of user base, in which geographical regions etc. What
> is
> > the pattern of user requests failing and a number of requirements like
> these
> > from the business.
> This is pretty broad but I can try to give you the names of problems
> this sounds like, to guide your search.
> Predicting user usage of device sounds like a classification problem,
> like developing a probabilistic model of behavior.
> Deciding where to put marketing dollars sounds like a business
> problem, not machine learning. I don't think a computer can tell you
> that. Some techniques might help you identify trends in sales, but
> this is simple regression, not really machine learning.
> Looking for patterns in failure sounds a bit like frequent pattern
> mining -- trying to find events that go together unusually often.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message