From dev-return-2372-archive-asf-public=cust-asf.ponee.io@mxnet.incubator.apache.org Fri Mar 9 21:43:12 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3576518064A for ; Fri, 9 Mar 2018 21:43:12 +0100 (CET) Received: (qmail 86718 invoked by uid 500); 9 Mar 2018 20:43:11 -0000 Mailing-List: contact dev-help@mxnet.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mxnet.incubator.apache.org Delivered-To: mailing list dev@mxnet.incubator.apache.org Received: (qmail 86702 invoked by uid 99); 9 Mar 2018 20:43:10 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Mar 2018 20:43:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 1AB641A07D7 for ; Fri, 9 Mar 2018 20:43:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.129 X-Spam-Level: *** X-Spam-Status: No, score=3.129 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_REPLY=1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id vIW2TGaei80z for ; Fri, 9 Mar 2018 20:43:08 +0000 (UTC) Received: from mail-it0-f45.google.com (mail-it0-f45.google.com [209.85.214.45]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 3B6205F5B0 for ; Fri, 9 Mar 2018 20:43:07 +0000 (UTC) Received: by mail-it0-f45.google.com with SMTP id n128-v6so3804419ith.1 for ; Fri, 09 Mar 2018 12:43:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=3D97pdvPLl0tNQPUgN5wV5LWsSUlQBMVAr145dOtCKs=; b=W6Lo6qfMXiiBNdcsZ35IJ/t/cEczpbXF9Jc/euvOxkyY1q0dxtqilI9CYvPmv7AjPP AbbywqOgyOC1hmnwfM08Z6fLuzxbgYMf7apnytMIl9roAWNf4GO1ntCeM9t4VDGNxH5/ HrUfumq/Y/SxL/JO8TMpvztiNz9Su2CH5PZ7vqPZwnSu7h6Iy9eVxYvy5LCPhZB02sYe NYZVC9Dw4tnt9zCkNESCkdjPwsEmPFVbCV2025r/oJ1wMKoqmBHVbXwbsr+ksX4eaf3Q mtjYPaxu3+2M0BmvrV6RTaN6aGzAFLr1bYCshXB0JUCN8W++n6BrkKxyJoVA9zyAZcIj d07w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=3D97pdvPLl0tNQPUgN5wV5LWsSUlQBMVAr145dOtCKs=; b=cl5ZXkxSpKiM0ZEU4eJtDi6wsz1gQOZu0sB9E5m7kq5dXeB6EZNBmeKVS8h2JkdhmO Btd2pLhZvKnSnRmTwkOr2+mudo1fwnbdsbpMN5gw/94/AOt/5jBVC5Iyvso8F++WEA1g FFxxi+EKAXwcQGnd7aV7r/yqVPIQjFhOiCTouYFX1/Hzvhywa6SWSdmDiF69eIfSA9tG KCVCpnRuAK0YQXN5BcWJb1xX+5nQ0xA5hxMxEXQWawiFeQe0XIzomY8UA8wH8ncZVieO A4A/cXGjtLmAvRJqNY/0VirEmnQfw7TFNUt48twHORPKOpSSYJSBQk04t2McWAwzdwqQ +nIw== X-Gm-Message-State: AElRT7GVx3PmtlcfI/w7t+P6cAlpiIMGCnGh74G377NbkPpXUovqcOYQ JhlL0lFTGpEGJg33JucdX9uDl8uAvgtm/zSwSUTdyw== X-Google-Smtp-Source: AG47ELsIApKm/IH7TQslY8IZ7TR5ROdeC0QJzA7MSE3kLYM305URc/1/r6OOyoURP/eiUg3+9j0A1tZaPruVixHr7yM= X-Received: by 10.36.46.22 with SMTP id i22mr240654ita.59.1520628185461; Fri, 09 Mar 2018 12:43:05 -0800 (PST) MIME-Version: 1.0 Received: by 10.79.146.144 with HTTP; Fri, 9 Mar 2018 12:43:04 -0800 (PST) In-Reply-To: References: <46E9B12B-48EF-4161-81D6-33589FB7F79C@amazon.com> From: Anirudh Date: Fri, 9 Mar 2018 12:43:04 -0800 Message-ID: Subject: Re: UTF-8 Support for TextParser To: dev@mxnet.incubator.apache.org Content-Type: multipart/alternative; boundary="001a114a98bad479b7056700d63f" --001a114a98bad479b7056700d63f Content-Type: text/plain; charset="UTF-8" Hi, Upon deeper understanding of customer requirement we found out that the customer uses only ASCII data with MXNet, just that they want the files containing UTF-8 BOM at the start and files with different control characters for newline to play well. dmlc-core already supports control characters for newline. Since, the UTF-8 BOM in files is a common use case for other users of MXNet too (for example, saving excel as UTF-8 csv) I will add support for handling the UTF-8 BOM in dmlc-core. I won't be working on UTF8CSVParser unless there is a customer requirement that comes up later on. Anirudh On Wed, Feb 28, 2018 at 11:50 PM, Anirudh wrote: > Hi Tianqi, > > What do you think about adding a separate parser for CSV with UTF8 support > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the > UTF8 or the ASCII parser based on this flag. (This idea was suggested by > Mu). > > I think there will be some small changes required to the base class > "TextParserBase" as the method "BackFindEndLine" will have more logic in it > to check for other code-points for line-breaks, which can be refactored. > This approach will likely retain the performance of the existing ASCII CSV > Parser, while allowing MXNet users to make the decision w.r.t usability > with UTF-8 CSV parser / performance with ASCII CSV parser. > > Thanks, > Anirudh > > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh wrote: > >> Hi Marco, >> >> I understand that there needs to be a different discussion on strong >> dependency of mxnet and dmlc-core and how to fix it. >> >> Having said that, I think the goals of dmlc-core and mxnet are somewhat >> aligned. Posting in the MXNet dev list for this case >> is a good way to gather feedback from both the communities since I >> consider the MXNet community to be mostly a superset of the dmlc-core >> community. >> >> Anirudh >> >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh >> wrote: >> >>> Hi Tianqi, >>> >>> The UTF-8 support would enable other formats like CSV more usable. >>> Otherwise, they have to handle normalizing their data in some way before >>> using mxnet. >>> I understand that there is a tradeoff here because of the efficiency >>> gains from the parser but the expectation of having to normalize their UTF-8 >>> files may turn users away. >>> >>> Anirudh >>> >>> On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" < >>> workcrow@gmail.com on behalf of tqchen@cs.washington.edu> wrote: >>> >>> Since LibSVM format is only going to involve numbers and possibly >>> ascii >>> characters, is there any reason adding UTF-8 support? Note that >>> generalization always comes with cost of efficiency and there is some >>> effort spent on making parser fast >>> >>> Tianqi >>> >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh >>> wrote: >>> >>> > Hi all, >>> > >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text >>> parsers. >>> > I am currently working on adding UTF-8 support for Text parsers. >>> Since C++ >>> > doesn't have a great built-in support for UTF-8, I am looking at >>> > third-party libraries which provide Unicode support. I am >>> considering ICU >>> > currently. Any comments, suggestions, past experience, gotchas >>> about >>> > unicode third party libraries or adding unicode support in general >>> is >>> > highly appreciated. >>> > >>> > I have created an issue about the same: >>> > https://github.com/dmlc/dmlc-core/issues/372 >>> > Please feel free to reply to this email or comment on the github >>> issue if >>> > you have any inputs. >>> > >>> > Anirudh >>> > >>> >>> >>> >> > --001a114a98bad479b7056700d63f--