From dev-return-2242-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Tue Feb 27 02:18:32 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 90B9418064A for ; Tue, 27 Feb 2018 02:18:31 +0100 (CET) Received: (qmail 53821 invoked by uid 500); 27 Feb 2018 01:18:30 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 53805 invoked by uid 99); 27 Feb 2018 01:18:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Feb 2018 01:18:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id ECA141A055D for ; Tue, 27 Feb 2018 01:18:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.148 X-Spam-Level: *** X-Spam-Status: No, score=3.148 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_REPLY=1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id NjdtR_TrC3DC for ; Tue, 27 Feb 2018 01:18:27 +0000 (UTC) Received: from mail-io0-f182.google.com (mail-io0-f182.google.com [209.85.223.182]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 654A75F183 for ; Tue, 27 Feb 2018 01:18:27 +0000 (UTC) Received: by mail-io0-f182.google.com with SMTP id e7so19393046ioj.1 for ; Mon, 26 Feb 2018 17:18:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=+dOf78PyuNQlM2Yq1HV+5iBimq7Z6l7LB4xzDKBavWM=; b=K9NryyzliIM9y4DKHZTVCDpSdi7FojppAs8gfQYMjumk3lmEgirQvJvdM2PmqYX+Fa D9bnph3LrC8NnePeOnntYTuWW4EcIs+1QS0Wnq8kak7CJd730VjSUR09pzkI+idtCeVH HB9TxqhD9ujGsy7Ll76cBsnysG+CU29J4spn/yjZ0zlBWTsjt2fTF1/TP/vrBNZvKUi3 9ZeGNqydzNESTKVyhEuHygEcuXAeuvwO7kPCBJU7g6HwMTo3kdAJwENtzkUhlpJsvcbQ FEAX2Wquh+9jVFiUXV8lA0iUqiSZJv4HxyoXhUvc1pVSYmDsBfueV/kAD0to/Iq/8IkZ BQ4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=+dOf78PyuNQlM2Yq1HV+5iBimq7Z6l7LB4xzDKBavWM=; b=FNfkwZ2SKRh600oR+HRi2gSKDZBTfdcqH6a8Hr0JwzIElBXMxz73v3xaGKQyquhEzR 1I0ncegwSDyRaQQ+RYcL2CMX3W2EtoHqlp5X29GLSAdn//22TpJeJVgMkHhU0GF+pWXe BrIDAfhGxi1RE7fX7uhz3+knITE9JXzmC9Qxf4jqxw4rQv2NE8LzPe0hqUvshrPiY4IL gsvN58woeehNJuIVA6B63iCk7O57v/PM35blwmqDZqWf8n6MBZiPMnXQi/jKsQEZfrH7 UK1yVAgCv9wpP5PzpNCtG2rRYlCLRmWrQJy4AV704w/+2Sr7hYMlnOZ4AWDdBAgTkTCB YHSg== X-Gm-Message-State: APf1xPDvbu1W5Vo9z/OCkFfsQ3DdJZs/Dm3VYTRx4A+bNsgWwJJzAacb ukHUfK6AdcTTxY9LBejYwTlgmtA5HYsNqzCFAy56jA== X-Google-Smtp-Source: AG47ELvr18IhOfym6jL9UJ7w81QTBN4/hqgOdtMzUj8yspNgOr8INydWeoyocwo7srkoXaId8Z/QWQDV/SYLzk90+Vo= X-Received: by 10.107.178.70 with SMTP id b67mr14703811iof.55.1519694306350; Mon, 26 Feb 2018 17:18:26 -0800 (PST) MIME-Version: 1.0 Received: by 10.79.146.144 with HTTP; Mon, 26 Feb 2018 17:18:25 -0800 (PST) In-Reply-To: <46E9B12B-48EF-4161-81D6-33589FB7F79C@amazon.com> References: <46E9B12B-48EF-4161-81D6-33589FB7F79C@amazon.com> From: Anirudh Date: Mon, 26 Feb 2018 17:18:25 -0800 Message-ID: Subject: Re: UTF-8 Support for TextParser To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="001a114ca1144c1aac05662767ec" --001a114ca1144c1aac05662767ec Content-Type: text/plain; charset="UTF-8" Hi Marco, I understand that there needs to be a different discussion on strong dependency of mxnet and dmlc-core and how to fix it. Having said that, I think the goals of dmlc-core and mxnet are somewhat aligned. Posting in the MXNet dev list for this case is a good way to gather feedback from both the communities since I consider the MXNet community to be mostly a superset of the dmlc-core community. Anirudh On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh wrote: > Hi Tianqi, > > The UTF-8 support would enable other formats like CSV more usable. > Otherwise, they have to handle normalizing their data in some way before > using mxnet. > I understand that there is a tradeoff here because of the efficiency gains > from the parser but the expectation of having to normalize their UTF-8 > files may turn users away. > > Anirudh > > On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" < > workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote: > > Since LibSVM format is only going to involve numbers and possibly ascii > characters, is there any reason adding UTF-8 support? Note that > generalization always comes with cost of efficiency and there is some > effort spent on making parser fast > > Tianqi > > On Mon, Feb 26, 2018 at 3:38 PM, Anirudh > wrote: > > > Hi all, > > > > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text > parsers. > > I am currently working on adding UTF-8 support for Text parsers. > Since C++ > > doesn't have a great built-in support for UTF-8, I am looking at > > third-party libraries which provide Unicode support. I am > considering ICU > > currently. Any comments, suggestions, past experience, gotchas about > > unicode third party libraries or adding unicode support in general is > > highly appreciated. > > > > I have created an issue about the same: > > https://github.com/dmlc/dmlc-core/issues/372 > > Please feel free to reply to this email or comment on the github > issue if > > you have any inputs. > > > > Anirudh > > > > > --001a114ca1144c1aac05662767ec--