systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Frederick R Reiss" <frre...@us.ibm.com>
Subject Re: [Discuss] String requirements for data passed to SystemML Frames.
Date Wed, 26 Oct 2016 23:29:09 GMT

The standard way to deal with escape characters in CSV files is to convert
every string in the CSV file into an unambiguous binary format before
performing any data wrangling tasks.

By "standard", I mean:
In Python, standard practice is to read CSV data into a dataframe with
pandas.read_csv(), passing appropriate arguments to pandas.read_csv() to
de-escape any escaped characters in the file. Default behavior is to
de-escape quoted strings.
In R, standard practice is to read CSV data into a data frame with read.csv
(), passing appropriate arguments to read.csv() to de-escape any escaped
characters in the file. Default behavior is to de-escape quoted strings.
In all flavors of Spark, standard practice is to read CSV data into a
dataframe with com.databricks.spark-csv, passing appropriate arguments to
spark-csv to de-escape any escaped characters in the file. Default behavior
is to de-escape quoted strings.
In Java, standard practice is to read strings in CSV files into Java
Strings with commons-CSV or a similar library, passing appropriate
arguments to commons-CSV to de-escape any escaped characters in the file.
Default behavior is to de-escape quoted strings.
In DB2, standard practice is to read CSV files into tables with IMPORT FROM
<file> OF DEL, passing appropriate arguments to IMPORT to de-escape any
escaped characters in the file. Default behavior is to de-escape quoted
strings.

I could go on, but hopefully people get the point.

There are multiple different ways of representing a given string using
escapes, but all of those representations represent the same string.
Treating different representations of the same string as different strings
is not correct. Perhaps in a different world with different conventions it
would be considered correct, but in the world we inhabit, users expect to
translate strings into an unambiguous canonical representation when
importing them into a system for data wrangling.

When we read a string from a CSV file, we should convert that string into
its binary representation as a Java string, removing any escapes that were
added to the CSV file in order to write the contents of the CSV file as a
CSV file. Similarly, when we write a string to a CSV file, we should escape
special characters according to the same policy applied in reverse.

It would be better to support more than one policy for escaping and
de-escaping, but it is sufficient for now to have a single global policy
that we document clearly and implement consistently.

I think the correct way forward here is:
1. Document exactly the type of string escaping that SystemML supports in
CSV files. It would be best to adopt the industry-standard policy of using
double quotes to delimit strings and escaping any double quotes within in a
string. Since we want to read files from multiple threads, we'll want to
require newlines within strings to be escaped.
2. Ensure that the escaping policy is implemented in a consistent way
across all methods that read and write CSV data.
3. Add tests to ensure that the escaping policy is correctly implemented
for different combinations of read and write operations. Be sure to test
cases like writing strings from Spark dataframes into CSV files via
SystemML dataframes.

Fred



From:	Acs S <acs_s@yahoo.com.INVALID>
To:	"dev@systemml.incubator.apache.org"
            <dev@systemml.incubator.apache.org>
Date:	10/22/2016 09:33 AM
Subject:	Re: [Discuss] String requirements for data passed to SystemML
            Frames.



1. I don't believe we are suggesting to remove quotes. If we are following
RFC 4180 based string format, then input coming from any source to be
converted to RC 4180 based string format before any processing should
occur. This will avoid exception if string is not compliant to RFC 4180
based format. We may have at least three options:    a. Make input string
compliant to RFC 4180 format within code by escaping quotes etc. (No burden
on user, but we need to be consistent to avoid different output based on
input source.    b. Ask user (programmer) to put data in RFC 4180 based
string format (Either through some utility we add in SystemML but user will
leverage this utility on needed input data. (User may not be happy with, as
they have to do additional steps)   c.  Output error message and exit
processing (Again user may not be happy with)   2. We know limitation of
fix which is documented in code. Its unicode which may occur in original
string (token), very uncommon scenario, but possible.
3. WRT backward compatibility of existing meta data frames. Are these meta
data frames in SystemML earlier released code? File based transform
generates such meta data frames?
-Arvind

      From: Matthias Boehm <mboehm7@googlemail.com>
 To: dev@systemml.incubator.apache.org
 Sent: Saturday, October 22, 2016 3:23 AM
 Subject: Re: [Discuss] String requirements for data passed to SystemML
Frames.

ok let me clarify a couple of things and provide an easy solution that
resolves this issue altogether.

1) Escaping: transformencode, transformdecode, and transformapply do not
remove quotes to provide easy to understand semantics. If users want to
match strings with different escaping policies to the same entry it's
the user's responsibility to handle the unquoting. The nice side effect
is that transformencode/transformapply and transformdecode are truly
inverse operations, at least for reversible transformations like
recoding and dummy coding.

2) Metadata frames: The schema for meta data frames is a string column
per original column where each transformation type has its special
serialization format. For example, for recoding, we serialize distinct
{token}{delim}{code} (one entry per row). The reason why we use the
quote-aware splitting on parsing this meta data is a best effort to
handle cases where {delim} occurs inside the quoted token. A simply
splitting on {delim} (as done in the "fix" by PR 274) would fail in this
situation.

3) Solution: We could, however, simply flip the serialization format to
{code}{delim}{token} which allows splitting on the first occurrence of
{delim} because {code} is guaranteed not to include {delim}. Note that
this would loose binary backwards compatibility to existing meta data
frames though.

Regards,
Matthias


On 10/22/2016 11:14 AM, Berthold Reinwald wrote:
> Reading SystemML frames from CSV files, and splitting strings honoring
> quotes, separators, and escaping rules follows the RFC 4180
> specification (https://tools.ietf.org/html/rfc4180#page-2). Populating
> SystemML frames from CSV files is one way, but we can also bind and
> pass Spark DataFrames with string columns to SystemML frames. Today,
> we take the Spark DataFrame strings *as is* without any checking
> whether these string values e.g. contain quotes or separator symbols,
> and whether they are escaped accordingly. Our transform capabilities
> can deal with this situation but I am a little uneasy about the fact
> that depending on where the data strings in our frames come from, they
> comply with different rules. In the case of CSV files, the fields
> comply with RFC 4180, and in the case of Spark Dataframes, the strings
> are any Java/Scala string.
>
> This may or may not be an issue but I wanted to collect some thoughts on
> this topic. Things to consider are:
>
> - reading and writing a CSV file with and without
>  transformencode/transformdecode ... should it result in the same
>  input file?
>
> - through MLContext we receive a Spark Dataframe with strings, and in
>  SystemML, we write out the CSV file, and a subsequent DML script
>  wants to read the CSV file? Would you expect the CSV file to be
>  readable by SystemML? Keep in mind that the original scala/java
>  strings may not be properly escaped.
>
> Thoughts?
>
> Regards,
> Berthold Reinwald
> IBM Almaden Research Center
> office: (408) 927 2208; T/L: 457 2208
> e-mail: reinwald@us.ibm.com
>
>





Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message