hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From myui <...@git.apache.org>
Subject [GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...
Date Thu, 30 Aug 2018 02:52:38 GMT
Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890012
  
    --- Diff: docs/gitbook/getting_started/tutorial.md ---
    @@ -0,0 +1,493 @@
    +<!--
    +  Licensed to the Apache Software Foundation (ASF) under one
    +  or more contributor license agreements.  See the NOTICE file
    +  distributed with this work for additional information
    +  regarding copyright ownership.  The ASF licenses this file
    +  to you under the Apache License, Version 2.0 (the
    +  "License"); you may not use this file except in compliance
    +  with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing,
    +  software distributed under the License is distributed on an
    +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +  KIND, either express or implied.  See the License for the
    +  specific language governing permissions and limitations
    +  under the License.
    +-->
    +
    +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
    +
    +<!-- toc -->
    +
    +## What is Hivemall?
    +
    +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us
to process large-scale data in the form of SQL easily. Assume that you have a table named
`purchase_history` which can be artificially created as:
    +
    +```sql
    +create table if not exists purchase_history
    +(id bigint, day_of_week string, price int, category string, label int)
    +;
    +```
    +
    +
    +```sql
    +insert overwrite table purchase_history
    +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as
category, 1 as label
    +union all
    +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports"
as category, 0 as label
    +union all
    +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment"
as category, 0 as label
    +union all
    +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as
category, 0 as label
    +union all
    +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics"
as category, 1 as label
    +;
    +```
    +
    +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
    +
    +```sql
    +select count(1) from purchase_log
    +```
    +
    +> 5
    +
    +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined
functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data
science. To give an example, you can efficiently build a logistic regression model with the
stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query:
    +
    +```sql
    +SELECT
    +  train_classifier(
    +    features,
    +    label,
    +    '-loss_function logloss -optimizer SGD'
    +  ) as (feature, weight)
    +FROM
    +  training
    +;
    +```
    +
    +
    +On the TD console, Hivemall function [`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
shows current Hivemall version that is available on TD, for example:
    --- End diff --
    
    `TD console` should not appear here.


---

Mime
View raw message