Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3486D200CAA for ; Sat, 3 Jun 2017 00:24:39 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 33401160BDD; Fri, 2 Jun 2017 22:24:39 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7B7BC160BD2 for ; Sat, 3 Jun 2017 00:24:38 +0200 (CEST) Received: (qmail 60733 invoked by uid 500); 2 Jun 2017 22:24:37 -0000 Mailing-List: contact issues-help@hivemall.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hivemall.incubator.apache.org Delivered-To: mailing list issues@hivemall.incubator.apache.org Received: (qmail 60724 invoked by uid 99); 2 Jun 2017 22:24:37 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Jun 2017 22:24:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3D97F181338 for ; Fri, 2 Jun 2017 22:24:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -3.771 X-Spam-Level: X-Spam-Status: No, score=-3.771 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ccjvpneYmKZk for ; Fri, 2 Jun 2017 22:24:36 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 2CE345F2FD for ; Fri, 2 Jun 2017 22:24:35 +0000 (UTC) Received: (qmail 60072 invoked by uid 99); 2 Jun 2017 22:24:34 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Jun 2017 22:24:34 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 46B13DFD70; Fri, 2 Jun 2017 22:24:34 +0000 (UTC) From: takuti To: issues@hivemall.incubator.apache.org Reply-To: issues@hivemall.incubator.apache.org Message-ID: Subject: [GitHub] incubator-hivemall pull request #83: [HIVEMALL-109][HIVEMALL-112] Fix topic ... Content-Type: text/plain Date: Fri, 2 Jun 2017 22:24:34 +0000 (UTC) archived-at: Fri, 02 Jun 2017 22:24:39 -0000 GitHub user takuti opened a pull request: https://github.com/apache/incubator-hivemall/pull/83 [HIVEMALL-109][HIVEMALL-112] Fix topic model and tokenize UDFs ## What changes were proposed in this pull request? #82 - Topic mode: `train_plsa` and `train_lda` - Fix bugs caused by multi-byte input - Fix wrong `recordBytes` calculation for iteration utilizing file IO - Refactor and update unit tests accordingly - `tokenize()` - Support NULL input; the UDF simply returns NULL itself ## What type of PR is it? Bug Fix ## What is the Jira issue? - https://issues.apache.org/jira/browse/HIVEMALL-109 - https://issues.apache.org/jira/browse/HIVEMALL-112 ## How was this patch tested? - Unit tests - Manual tests on EMR You can merge this pull request into a Git repository by running: $ git pull https://github.com/takuti/incubator-hivemall fix-topicmodel Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/83.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #83 ---- commit 988666a58801e1cf62b0c91c5815e973084ba972 Author: Takuya Kitazawa Date: 2017-06-02T07:40:12Z Fix multi-byte-related issue in topic model UDFs and validate it as unit test commit b08f73aed98064059773ba8c2342814d03b991ff Author: Takuya Kitazawa Date: 2017-06-02T08:07:19Z Use `char`s instead of `byte`s commit c1239fe7938a724147554d0c1c769ec7c3025013 Author: Takuya Kitazawa Date: 2017-06-02T08:24:20Z Fix record bytes calculation commit accee7a938c8034bd3c2a250bbdd27d57871092d Author: Takuya Kitazawa Date: 2017-06-02T09:15:53Z Use NIOUtils for writing strings to a byte buffer commit ceff765de725cddc5e9f556433ab76272e4d9720 Author: Takuya Kitazawa Date: 2017-06-02T09:52:25Z Fix record size related to iteration using temporary file Since now iteration works correctly, manual for-loops are removed from unit tests. commit e9ec0f31ea2a6b5b67c89a141be197a734f66567 Author: Takuya Kitazawa Date: 2017-06-02T10:06:45Z Fix `tokenize` for null input commit dda972405c893277edb13add5fc2b4e7a5a96d83 Author: Takuya Kitazawa Date: 2017-06-02T11:35:20Z Refactor on `recordTrainSampleToTempFile` ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---