hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <>
Subject [jira] Created: (HIVE-672) Integrate weka with Hive
Date Wed, 22 Jul 2009 06:18:15 GMT
Integrate weka with Hive

                 Key: HIVE-672
             Project: Hadoop Hive
          Issue Type: New Feature
            Reporter: Zheng Shao

Weka is one of the most popular data mining package on the planet. It's used by numerous people
around the world. Since weka is in Java, it should be pretty straight-forward to integrate
weka with Hive.

We just need to create some GenericUDAF functions that maps to Weka classifier training process.
The output of the GenericUDAF can just be the serialized version of the trained classifiers.
We should add another GenericUDF to load the classifier to classify new instances.

The hive syntax can be as simple as this:

SELECT logdata.pageid, logdata.linkid, LogisticRegression( logdata.clicked, userinfo.age,
userinfo.gender,, userinfo.interests ) as model
FROM logdata JOIN userinfo
ON logdata.userid = userinfo.userid
GROUP BY logdata.pageid, logdata.linkid;

SELECT logdata.pageid, logdata.linkid, logdata.clicked, LogisticRegressionEvaluate(classifiers.model,
userinfo.age, userinfo.gender,, userinfo.interests) AS predicted
FROM logdata JOIN userinfo
ON logdata.userid = userinfo.userid
JOIN classifiers
ON logdata.pageid = classifiers.pageid AND logdata.linkid = classifiers.linkid

Use Weka in your Java Code:

Weka is under GPL license. We won't be able to include the code directly into Hive, but we
can keep the discussions here.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message