From issues-return-2465-archive-asf-public=cust-asf.ponee.io@hivemall.incubator.apache.org Thu Aug 30 04:52:43 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2641318067B for ; Thu, 30 Aug 2018 04:52:42 +0200 (CEST) Received: (qmail 54451 invoked by uid 500); 30 Aug 2018 02:52:42 -0000 Mailing-List: contact issues-help@hivemall.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hivemall.incubator.apache.org Delivered-To: mailing list issues@hivemall.incubator.apache.org Received: (qmail 54442 invoked by uid 99); 30 Aug 2018 02:52:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Aug 2018 02:52:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id B51D8C04F3 for ; Thu, 30 Aug 2018 02:52:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4 X-Spam-Level: X-Spam-Status: No, score=-4 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id tlS8w1K73i5p for ; Thu, 30 Aug 2018 02:52:40 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id B0FBB5F2A9 for ; Thu, 30 Aug 2018 02:52:39 +0000 (UTC) Received: (qmail 54361 invoked by uid 99); 30 Aug 2018 02:52:38 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Aug 2018 02:52:38 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 29BB7E118D; Thu, 30 Aug 2018 02:52:38 +0000 (UTC) From: myui To: issues@hivemall.incubator.apache.org Reply-To: issues@hivemall.incubator.apache.org References: In-Reply-To: Subject: [GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori... Content-Type: text/plain Message-Id: <20180830025238.29BB7E118D@git1-us-west.apache.org> Date: Thu, 30 Aug 2018 02:52:38 +0000 (UTC) Github user myui commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890012 --- Diff: docs/gitbook/getting_started/tutorial.md --- @@ -0,0 +1,493 @@ + + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + + + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history +(id bigint, day_of_week string, price int, category string, label int) +; +``` + + +```sql +insert overwrite table purchase_history +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_log +``` + +> 5 + +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query: + +```sql +SELECT + train_classifier( + features, + label, + '-loss_function logloss -optimizer SGD' + ) as (feature, weight) +FROM + training +; +``` + + +On the TD console, Hivemall function [`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others) shows current Hivemall version that is available on TD, for example: --- End diff -- `TD console` should not appear here. ---