hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chezou <...@git.apache.org>
Subject [GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Date Fri, 31 Aug 2018 02:55:32 GMT
Github user chezou commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/158#discussion_r214233514
  
    --- Diff: docs/gitbook/supervised_learning/tutorial.md ---
    @@ -0,0 +1,461 @@
    +<!--
    +  Licensed to the Apache Software Foundation (ASF) under one
    +  or more contributor license agreements.  See the NOTICE file
    +  distributed with this work for additional information
    +  regarding copyright ownership.  The ASF licenses this file
    +  to you under the Apache License, Version 2.0 (the
    +  "License"); you may not use this file except in compliance
    +  with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing,
    +  software distributed under the License is distributed on an
    +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +  KIND, either express or implied.  See the License for the
    +  specific language governing permissions and limitations
    +  under the License.
    +-->
    +
    +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
    +
    +<!-- toc -->
    +
    +## What is Hivemall?
    +
    +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us
to process large-scale data in the form of SQL easily. Assume that you have a table named
`purchase_history` which can be artificially created as:
    +
    +```sql
    +create table if not exists purchase_history as
    +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as
category, 1 as label
    +union all
    +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports"
as category, 0 as label
    +union all
    +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment"
as category, 0 as label
    +union all
    +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as
category, 0 as label
    +union all
    +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics"
as category, 1 as label
    +;
    +```
    +
    +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
    +
    +```sql
    +select count(1) from purchase_history;
    +```
    +
    +> 5
    +
    +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined
functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data
science. To give an example, you can efficiently build a logistic regression model with the
stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query:
    +
    +```sql
    +SELECT
    +  train_classifier(
    +    features,
    +    label,
    +    '-loss_function logloss -optimizer SGD'
    +  ) as (feature, weight)
    +FROM
    +  training
    +;
    +```
    +
    +
    +Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows current Hivemall
version, for example:
    +
    +```sql
    +select hivemall_version();
    +```
    +
    +> "0.5.1-incubating-SNAPSHOT"
    +
    +Below we list ML and relevant problems that Hivemall can solve:
    +
    +- [Binary and multi-class classification](../binaryclass/general.html)
    +- [Regression](../regression/general.html)
    +- [Recommendation](../recommend/cf.html)
    +- [Anomaly detection](../anomaly/lof.html)
    +- [Natural language processing](../misc/tokenizer.html)
    +- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
    +- [Data sketching](../misc/funcs.html#sketching)
    +- Evaluation
    +
    +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) would be helpful
to understand more about an overview of Hivemall.
    +
    +This tutorial explains the basic usage of Hivemall with examples of supervised learning
of simple regressor and binary classifier.
    +
    +## Binary classification
    +
    +Imagine a scenario that we like to build a binary classifier from the mock `purchase_history`
data and predict unforeseen purchases to conduct a new campaign effectively:
    +
    +| day\_of\_week | gender | price | category | label |
    +|:---:|:---:|:---:|:---:|:---|
    +|Saturday | male | 600 | book | 1 |
    +|Friday | female | 4800 | sports | 0 |
    +|Friday | other | 18000  | entertainment | 0 |
    +|Thursday | male | 200 | food | 0 |
    +|Wednesday | female | 1000 | electronics | 1 |
    +
    --- End diff --
    
    Added 0f593c4


---

Mime
View raw message