hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From takuti <...@git.apache.org>
Subject [GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Date Thu, 30 Aug 2018 04:05:56 GMT
Github user takuti commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/158#discussion_r213896045
  
    --- Diff: docs/gitbook/getting_started/tutorial.md ---
    @@ -0,0 +1,493 @@
    +<!--
    +  Licensed to the Apache Software Foundation (ASF) under one
    +  or more contributor license agreements.  See the NOTICE file
    +  distributed with this work for additional information
    +  regarding copyright ownership.  The ASF licenses this file
    +  to you under the Apache License, Version 2.0 (the
    +  "License"); you may not use this file except in compliance
    +  with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing,
    +  software distributed under the License is distributed on an
    +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +  KIND, either express or implied.  See the License for the
    +  specific language governing permissions and limitations
    +  under the License.
    +-->
    +
    +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
    +
    +<!-- toc -->
    +
    +## What is Hivemall?
    +
    +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us
to process large-scale data in the form of SQL easily. Assume that you have a table named
`purchase_history` which can be artificially created as:
    +
    +```sql
    +create table if not exists purchase_history
    +(id bigint, day_of_week string, price int, category string, label int)
    +;
    +```
    +
    +
    +```sql
    +insert overwrite table purchase_history
    +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as
category, 1 as label
    +union all
    +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports"
as category, 0 as label
    +union all
    +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment"
as category, 0 as label
    +union all
    +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as
category, 0 as label
    +union all
    +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics"
as category, 1 as label
    +;
    +```
    +
    +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
    +
    +```sql
    +select count(1) from purchase_log
    +```
    +
    +> 5
    +
    +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined
functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data
science. To give an example, you can efficiently build a logistic regression model with the
stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query:
    +
    +```sql
    +SELECT
    +  train_classifier(
    +    features,
    +    label,
    +    '-loss_function logloss -optimizer SGD'
    +  ) as (feature, weight)
    +FROM
    +  training
    +;
    +```
    +
    +
    +On the TD console, Hivemall function [`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
shows current Hivemall version that is available on TD, for example:
    +
    +```sql
    +select hivemall_version()
    +```
    +
    +> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
    +
    +Below we list ML and relevant problems that Hivemall and TD can solve:
    +
    +- Binary and multi-class classification
    +- Regression
    +- Recommendation
    +- Anomaly detection
    +- Natural language processing
    +- Clustering (i.e., topic modeling)
    +- Data sketching
    +- Evaluation
    +
    +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) would be helpful
to understand more about an overview of Hivemall.
    +
    +This tutorial explains the basic usage of Hivemall with examples of supervised learning
of simple regressor and binary classifier.
    +
    +## Binary classification
    +
    +Imagine a scenario that we like to build a binary classifier from the mock `purchase_history`
data and predict unforeseen purchases to conduct a new campaign effectively:
    +
    +| day\_of\_week | gender | price | category | label |
    +|:---:|:---:|:---:|:---:|:---|
    +|Saturday | male | 600 | book | 1 |
    +|Friday | female | 4800 | sports | 0 |
    +|Friday | other | 18000  | entertainment | 0 |
    +|Thursday | male | 200 | food | 0 |
    +|Wednesday | female | 1000 | electronics | 1 |
    +
    +Use Hivemall [`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
UDF to tackle the problem as follows.
    +
    +### Step 1. Feature representation
    +
    +First of all, we have to convert the records into pairs of the feature vector and corresponding
target value. Here, Hivemall requires you to represent input features in a specific format.
    +
    +To be more precise, Hivemall represents single feature in a concatenation of **index**
(i.e., **name**) and its **value**:
    +
    +- Quantitative feature: `<index>:<value>`
    +  - e.g., `price:600.0`
    +- Categorical feature: `<index>#<value>`
    +  - e.g., `gender#male`
    +
    +Each of those features is a string value in Hive, and "feature vector" means an array
of string values like:
    +
    +```
    +["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
    +```
    +
    +Therefore, what we first need to do is to convert the records into an array of feature
strings, and Hivemall functions [`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
[`categorical_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#categorical-features)
and [`array_concat()`](https://hivemall.incubator.apache.org/userguide/misc/generic_funcs.html#array)
provide a simple way to create the pairs of feature vector and target value:
    +
    +```sql
    +create table if not exists training
    +(id bigint, features array<string>, label int)
    +;
    +```
    +
    +```sql
    +insert overwrite table training
    +select
    +  id,
    +  array_concat( -- concatenate two arrays of quantitative and categorical features into
single array
    +    quantitative_features(
    +      array("price"), -- quantitative feature names
    +      price -- corresponding column names
    +    ),
    +    categorical_features(
    +      array("day of week", "gender", "category"), -- categorical feature names
    +      day_of_week, gender, category -- corresponding column names
    +    )
    +  ) as features,
    +  label
    +from
    +  purchase_history
    +;
    +```
    +
    +|id | features |  label |
    +|:---:|:---|:---|
    +|1 |["price:600.0","day of week#Saturday","gender#male","category#book"] | 1 |
    +|2 |["price:4800.0","day of week#Friday","gender#female","category#sports"] |  0 |
    +|3 |["price:18000.0","day of week#Friday","gender#other","category#entertainment"]| 0
|
    +|4 |["price:200.0","day of week#Thursday","gender#male","category#food"] | 0 |
    +|5 |["price:1000.0","day of week#Wednesday","gender#female","category#electronics"]|
1 |
    +
    +The output table `training` will be directly used as an input to Hivemall's ML functions
in the next step.
    --- End diff --
    
    `s/The output table training/The output of the above query/` due to the above deletion
of CREATE and INSERT training.


---

Mime
View raw message