hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-672) Integrate weka with Hive
Date Tue, 28 Jul 2009 06:06:15 GMT

     [ https://issues.apache.org/jira/browse/HIVE-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Zheng Shao updated HIVE-672:

    Attachment: HIVE-672.1.not.to.be.included.patch


This patch successfully integrates Weka LogisticRegression with Hive. It contains an example
query, which trains a model and use the model to predict.
It does not support classifier options, model evaluation like cross validation / ROC etc yet.

During implementing this, I found several problems:
1. GenericUDAF/GenericUDF are not easy to use (although they have superior performance). I
don't think we should ask our users to implement GenericUDAF/GenericUDF just because they
need variable-length arguments. We should be able to pass java primitive objects to a UDF
like Object evaluate(Object[] parameters). This is not efficient but it's OK for machine learning/data
mining stuff since the learning process takes much longer time. (HIVE-699)
2. No way to "create temporary function" for a GenericUDAF (HIVE-698).
3. A bug in GroupByOperator initlaization order (HIVE-697)

I will work on these 3 items first.

> Integrate weka with Hive
> ------------------------
>                 Key: HIVE-672
>                 URL: https://issues.apache.org/jira/browse/HIVE-672
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>         Attachments: HIVE-672.1.not.to.be.included.patch
> Weka is one of the most popular data mining package on the planet. It's used by numerous
people around the world. Since weka is in Java, it should be pretty straight-forward to integrate
weka with Hive.
> We just need to create some GenericUDAF functions that maps to Weka classifier training
process. The output of the GenericUDAF can just be the serialized version of the trained classifiers.
> We should add another GenericUDF to load the classifier to classify new instances.
> The hive syntax can be as simple as this: (Note: In the example above, most of the "table."
can be omitted. I put it there just for easier understanding of the query semantics.)
> The query builds a model (logistic regression) for predicting the CTR of each link on
each page, based on user information, and evaluates the model on some data.
> {code}
> SELECT logdata.pageid, logdata.linkid, LogisticRegression( logdata.clicked, userinfo.age,
userinfo.gender, userinfo.country, userinfo.interests ) as model
> FROM logdata JOIN userinfo
> ON logdata.userid = userinfo.userid
> GROUP BY logdata.pageid, logdata.linkid;
> SELECT logdata.pageid, logdata.linkid, logdata.clicked, LogisticRegressionEvaluate(classifiers.model,
userinfo.age, userinfo.gender, userinfo.country, userinfo.interests) AS predicted
> FROM logdata JOIN userinfo
> ON logdata.userid = userinfo.userid
> JOIN classifiers
> ON logdata.pageid = classifiers.pageid AND logdata.linkid = classifiers.linkid
> {code}
> References:
> Use Weka in your Java Code: http://weka.wiki.sourceforge.net/Use+Weka+in+your+Java+code
> Note:
> Weka is under GPL license. We won't be able to include the code directly into Hive, but
we can keep the discussions here.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message