mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Question about data warehousing and mining through Mahout
Date Wed, 01 Sep 2010 07:58:54 GMT
Hive does something fairly unrelated to Mahout. It's an indexing and
query system. Both might start from the same source data, but to do
different things. There is no common format, no. Mahout generally
operates on text files or "Vectors" in SequenceFiles. So there's some
translation there at least.

But I think a message here is that there's more preparation and
thought necessary to start data mining. It's not like you point a data
mining tool at some data and answers start flowing automatically.
You'd have to be deliberately extracting and preparing data anyhow.

On Tue, Aug 31, 2010 at 11:41 PM, hdev ml <> wrote:
> Thanks Sean for the answers. Thanks for Ted for validation.
> Now my question is, since I want to do both reporting of large data/
> datawarehouse, let's assume I choose Hive for that.
> Now can Mahout integrate with Hive to make use of this data for learning,
> mining etc.? or do I have to export the hive data into text files which can
> be hosted by Haddop/HDFS which later on Mahout can use for data mining.
> In short, can data warehousing part be done by Hive and then can data mining
> part be done by Mahout on this hive data?
> -H
> On Tue, Aug 31, 2010 at 3:03 PM, Sean Owen <> wrote:
>> On Tue, Aug 31, 2010 at 10:55 PM, hdev ml <> wrote:
>> > Per my understanding of hive, we can do some statistical reporting, like
>> > frequency of user sessions, which geographical region, which device he is
>> > using the most etc.
>> Yes that's about what Hive is good for, if you're looking for some
>> open-source libraries along those lines.
>> >
>> > But we also want to mine this data to get some predictive capabilities
>> like
>> > what is the likelihood that the user will use the same device again or if
>> we
>> > get sales/marketing data (on the roadmap for future), we want to possibly
>> > predict which region to put more marketing/sales efforts. What is the
>> > pattern for growth of user base, in which geographical regions etc. What
>> is
>> > the pattern of user requests failing and a number of requirements like
>> these
>> > from the business.
>> This is pretty broad but I can try to give you the names of problems
>> this sounds like, to guide your search.
>> Predicting user usage of device sounds like a classification problem,
>> like developing a probabilistic model of behavior.
>> Deciding where to put marketing dollars sounds like a business
>> problem, not machine learning. I don't think a computer can tell you
>> that. Some techniques might help you identify trends in sales, but
>> this is simple regression, not really machine learning.
>> Looking for patterns in failure sounds a bit like frequent pattern
>> mining -- trying to find events that go together unusually often.

View raw message