systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Acs S <ac...@yahoo.com.INVALID>
Subject Re: [Discuss] String requirements for data passed to SystemML Frames.
Date Sat, 22 Oct 2016 16:32:22 GMT
1. I don't believe we are suggesting to remove quotes. If we are following RFC 4180 based string
format, then input coming from any source to be converted to RC 4180 based string format before
any processing should occur. This will avoid exception if string is not compliant to RFC 4180
based format. We may have at least three options:    a. Make input string compliant to RFC
4180 format within code by escaping quotes etc. (No burden on user, but we need to be consistent
to avoid different output based on input source.    b. Ask user (programmer) to put data
in RFC 4180 based string format (Either through some utility we add in SystemML but user will
leverage this utility on needed input data. (User may not be happy with, as they have to do
additional steps)   c.  Output error message and exit processing (Again user may not be
happy with)   2. We know limitation of fix which is documented in code. Its unicode which
may occur in original string (token), very uncommon scenario, but possible. 
3. WRT backward compatibility of existing meta data frames. Are these meta data frames in
SystemML earlier released code? File based transform generates such meta data frames?
-Arvind

      From: Matthias Boehm <mboehm7@googlemail.com>
 To: dev@systemml.incubator.apache.org 
 Sent: Saturday, October 22, 2016 3:23 AM
 Subject: Re: [Discuss] String requirements for data passed to SystemML Frames.
   
ok let me clarify a couple of things and provide an easy solution that 
resolves this issue altogether.

1) Escaping: transformencode, transformdecode, and transformapply do not 
remove quotes to provide easy to understand semantics. If users want to 
match strings with different escaping policies to the same entry it's 
the user's responsibility to handle the unquoting. The nice side effect 
is that transformencode/transformapply and transformdecode are truly 
inverse operations, at least for reversible transformations like 
recoding and dummy coding.

2) Metadata frames: The schema for meta data frames is a string column 
per original column where each transformation type has its special 
serialization format. For example, for recoding, we serialize distinct 
{token}{delim}{code} (one entry per row). The reason why we use the 
quote-aware splitting on parsing this meta data is a best effort to 
handle cases where {delim} occurs inside the quoted token. A simply 
splitting on {delim} (as done in the "fix" by PR 274) would fail in this 
situation.

3) Solution: We could, however, simply flip the serialization format to 
{code}{delim}{token} which allows splitting on the first occurrence of 
{delim} because {code} is guaranteed not to include {delim}. Note that 
this would loose binary backwards compatibility to existing meta data 
frames though.

Regards,
Matthias


On 10/22/2016 11:14 AM, Berthold Reinwald wrote:
> Reading SystemML frames from CSV files, and splitting strings honoring
> quotes, separators, and escaping rules follows the RFC 4180
> specification (https://tools.ietf.org/html/rfc4180#page-2). Populating
> SystemML frames from CSV files is one way, but we can also bind and
> pass Spark DataFrames with string columns to SystemML frames. Today,
> we take the Spark DataFrame strings *as is* without any checking
> whether these string values e.g. contain quotes or separator symbols,
> and whether they are escaped accordingly. Our transform capabilities
> can deal with this situation but I am a little uneasy about the fact
> that depending on where the data strings in our frames come from, they
> comply with different rules. In the case of CSV files, the fields
> comply with RFC 4180, and in the case of Spark Dataframes, the strings
> are any Java/Scala string.
>
> This may or may not be an issue but I wanted to collect some thoughts on
> this topic. Things to consider are:
>
> - reading and writing a CSV file with and without
>  transformencode/transformdecode ... should it result in the same
>  input file?
>
> - through MLContext we receive a Spark Dataframe with strings, and in
>  SystemML, we write out the CSV file, and a subsequent DML script
>  wants to read the CSV file? Would you expect the CSV file to be
>  readable by SystemML? Keep in mind that the original scala/java
>  strings may not be properly escaped.
>
> Thoughts?
>
> Regards,
> Berthold Reinwald
> IBM Almaden Research Center
> office: (408) 927 2208; T/L: 457 2208
> e-mail: reinwald@us.ibm.com
>
>


   
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message