Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D958B200BAB for ; Sat, 22 Oct 2016 11:45:34 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D7F4C160AEF; Sat, 22 Oct 2016 09:45:34 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 027CB160ADF for ; Sat, 22 Oct 2016 11:45:33 +0200 (CEST) Received: (qmail 21005 invoked by uid 500); 22 Oct 2016 09:45:28 -0000 Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.incubator.apache.org Delivered-To: mailing list dev@systemml.incubator.apache.org Received: (qmail 20993 invoked by uid 99); 22 Oct 2016 09:45:27 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Oct 2016 09:45:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 728E31806F8 for ; Sat, 22 Oct 2016 09:45:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.629 X-Spam-Level: X-Spam-Status: No, score=0.629 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=googlemail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id lttF2HBPz_Vo for ; Sat, 22 Oct 2016 09:45:25 +0000 (UTC) Received: from mail-wm0-f46.google.com (mail-wm0-f46.google.com [74.125.82.46]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 2F4805F1F5 for ; Sat, 22 Oct 2016 09:45:25 +0000 (UTC) Received: by mail-wm0-f46.google.com with SMTP id c78so27219799wme.1 for ; Sat, 22 Oct 2016 02:45:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=from:subject:to:references:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding; bh=CEeXuuDogiT+gw/XVToQNR2/n3KB03z4G+rjdO0drmM=; b=NlZ1A/fEsIB5bIrS/IOSIwYg5EUU8OnDmdApq0Thc4w8GmMBg0H5VBmOQqTF+owZKA Wlcx4A1J4DFaeVm7qASYIuc7GewMOaioT5IOf5sm2AJXNY1VVj2p6hqJngtpMTjKA0Ql rL9+FLhni9Knn8imQITtDbOmKgASEN/SxWep/Z91V8aUDuFEXmpRbYvldunxmU33bubY +KzLtLgxooFZDYiU2AyFssFeLnqMzaRsk+YYl9arBxOiuBGZ7ZJYWNE1h6uMTJH3Iubo OSkYdsZaYkIbkanmgeiymKZ7HQefmbhrmW1//iQPrrHH8DoBtplX20aFImybi5Duin89 jlKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:subject:to:references:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=CEeXuuDogiT+gw/XVToQNR2/n3KB03z4G+rjdO0drmM=; b=HgA7om2ITw0CTPz6KvPL+PDTXQ/kHhqjjzBIo7i56VIftZiZ1oDS/5ttFQiwDMbLS0 0eBYNQSoyuoKOtkP/gpO3cHbXv58sCR9lVFNNHK+6zgD8ciVVQTMNePPOIEM5lur1JPX wVtnv3h/aNLT7hixE6mbqOsupBKxxxpIkGjzwZdJsrO2LjdG7gnNAhfcLVkZiQQs+jDN qVPnmTXH3ir7TTJmqAQWmtAh9muHPJ3mZ9SOgLlEIJ/HI6ErbMQqlx8tq861dqp2KLMr po36cJOFHCJnbtaMFVbp8KSrar01DY9c/+xAqf7ETTewXNZLozEImrMs7zzvka8AEtVV 8piQ== X-Gm-Message-State: AA6/9RmuwhpiyFp8nYnXhjHunmtu4pbhh8thIF1mlO4L1dLreLyGRgI1UsNN9Iqi4492Hw== X-Received: by 10.28.46.15 with SMTP id u15mr13890315wmu.61.1477129520689; Sat, 22 Oct 2016 02:45:20 -0700 (PDT) Received: from [192.168.11.44] (p548D50A4.dip0.t-ipconnect.de. [84.141.80.164]) by smtp.googlemail.com with ESMTPSA id s204sm2916206wmd.1.2016.10.22.02.45.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 22 Oct 2016 02:45:19 -0700 (PDT) From: Matthias Boehm X-Google-Original-From: Matthias Boehm Subject: Re: [Discuss] String requirements for data passed to SystemML Frames. To: dev@systemml.incubator.apache.org References: Message-ID: Date: Sat, 22 Oct 2016 11:45:19 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit archived-at: Sat, 22 Oct 2016 09:45:35 -0000 ok let me clarify a couple of things and provide an easy solution that resolves this issue altogether. 1) Escaping: transformencode, transformdecode, and transformapply do not remove quotes to provide easy to understand semantics. If users want to match strings with different escaping policies to the same entry it's the user's responsibility to handle the unquoting. The nice side effect is that transformencode/transformapply and transformdecode are truly inverse operations, at least for reversible transformations like recoding and dummy coding. 2) Metadata frames: The schema for meta data frames is a string column per original column where each transformation type has its special serialization format. For example, for recoding, we serialize distinct (one entry per row). The reason why we use the quote-aware splitting on parsing this meta data is a best effort to handle cases where delim occurs inside the quoted token. A simply splitting on (as done in the "fix" by PR 274) would fail in this situation. 3) Solution: We could, however, simply flip the serialization format to which allows splitting on the first occurrence of because is guaranteed not to include . Note that this would loose binary backwards compatibility to existing meta data frames though. Regards, Matthias On 10/22/2016 11:14 AM, Berthold Reinwald wrote: > Reading SystemML frames from CSV files, and splitting strings honoring > quotes, separators, and escaping rules follows the RFC 4180 > specification (https://tools.ietf.org/html/rfc4180#page-2). Populating > SystemML frames from CSV files is one way, but we can also bind and > pass Spark DataFrames with string columns to SystemML frames. Today, > we take the Spark DataFrame strings *as is* without any checking > whether these string values e.g. contain quotes or separator symbols, > and whether they are escaped accordingly. Our transform capabilities > can deal with this situation but I am a little uneasy about the fact > that depending on where the data strings in our frames come from, they > comply with different rules. In the case of CSV files, the fields > comply with RFC 4180, and in the case of Spark Dataframes, the strings > are any Java/Scala string. > > This may or may not be an issue but I wanted to collect some thoughts on > this topic. Things to consider are: > > - reading and writing a CSV file with and without > transformencode/transformdecode ... should it result in the same > input file? > > - through MLContext we receive a Spark Dataframe with strings, and in > SystemML, we write out the CSV file, and a subsequent DML script > wants to read the CSV file? Would you expect the CSV file to be > readable by SystemML? Keep in mind that the original scala/java > strings may not be properly escaped. > > Thoughts? > > Regards, > Berthold Reinwald > IBM Almaden Research Center > office: (408) 927 2208; T/L: 457 2208 > e-mail: reinwald@us.ibm.com > >