mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hdev ml <>
Subject Re: Question about data warehousing and mining through Mahout
Date Tue, 31 Aug 2010 23:03:26 GMT
Thanks Chris for the answers.

1. The data is just going to grow. This 1.5TB of data is just from one
module. There are other modules, which may have similar kind of data in the
log. After discarding the data this 3.0TB becomes 1.5TB. It is still huge.
Since this is just a months data, and we would want to make use of atleast
past 6-12 months, the data size goes in the 10TB-20TB area. So I am guessing
Hadoop is the right answer. I am just not sure which sub-project to use in
this case.

2. When you say querying with Hive, note that I want to use the same hive
data for future data mining, so my question was -- can that be done with
Mahout integrating with Hive layer, instead of Hadoop layer directly. or
Maybe we can use the Hive data directly if at all I can reverse engineer the
data format that hive uses internally. Hopefully it does not have compressed

3. Hhhmm..That seems like a very good suggestion. I am not averse to the
idea of writing my own implementation of mining algorithms. I am just
worried about their accuracy and stability. So summary is basically do the
transformation and statistical part first. When it comes to data mining,
write your own algorithms or use Mahout (if at all hive integration is
possible, or maybe reuse the raw text files or output dump of Hive tables)

On Tue, Aug 31, 2010 at 3:38 PM, Chris Bates <> wrote:

> From my experience (merging machine learning with business goals), I'll
> offer a few pieces of advice that may help guide you.
> 1.  First determine what data you have (and how much of it), and how you
> want to store/ query it.
> -  If you have 1.5 TB of log data, you are in the realm of Hadoop.  If you
> find however that you only need to operate on a subset of this data
> (~100mb), you may just want to stick with loading it up in memory and using
> something like Octave, R, Matlab, Python to run algorithms against it.
> Probably the easiest.  In fact, I'd say do that first before you go
> whole-hog on the distributed system.
> 2.  Second, come up with questions about your data that you want to answer
> (or have someone give those questions to you).  Make those questions as
> specific as possible.
> - The type of question will tell you what tool you need to use. Sometimes
> this means querying with Hive (ie. How many unique users viewed this type
> of
> page?) if the data is too much/too sparse to put into MySQL.  Sometimes
> this
> means just writing a Python/Ruby script with a few Regex's and hunting
> through the data.  If the questions are predictive in nature, you may need
> to use some machine learning tools.
> 3.  Simple techniques often will get you 80% of the way to your goal.
>  Machine Learning gets you the other 20% (or sometimes only 5%!).
> - I would say to use machine learning once you know the domain of the
> problem you're trying to solve extremely well.  Because it will take effort
> and you should be immediately skeptical of any result you get back.  It's a
> black box that you should really know the inner workings of, so my advice
> is
> to exhaust all non-machine learning options first, then go for that extra
> accuracy if its warranted.
> Good luck!
> On Tue, Aug 31, 2010 at 6:03 PM, Sean Owen <> wrote:
> > On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <> wrote:
> > > Per my understanding of hive, we can do some statistical reporting,
> like
> > > frequency of user sessions, which geographical region, which device he
> is
> > > using the most etc.
> >
> > Yes that's about what Hive is good for, if you're looking for some
> > open-source libraries along those lines.
> >
> > >
> > > But we also want to mine this data to get some predictive capabilities
> > like
> > > what is the likelihood that the user will use the same device again or
> if
> > we
> > > get sales/marketing data (on the roadmap for future), we want to
> possibly
> > > predict which region to put more marketing/sales efforts. What is the
> > > pattern for growth of user base, in which geographical regions etc.
> What
> > is
> > > the pattern of user requests failing and a number of requirements like
> > these
> > > from the business.
> >
> > This is pretty broad but I can try to give you the names of problems
> > this sounds like, to guide your search.
> >
> > Predicting user usage of device sounds like a classification problem,
> > like developing a probabilistic model of behavior.
> >
> > Deciding where to put marketing dollars sounds like a business
> > problem, not machine learning. I don't think a computer can tell you
> > that. Some techniques might help you identify trends in sales, but
> > this is simple regression, not really machine learning.
> >
> > Looking for patterns in failure sounds a bit like frequent pattern
> > mining -- trying to find events that go together unusually often.
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message