spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jose Torres (JIRA)" <>
Subject [jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.
Date Tue, 06 Mar 2018 19:37:00 GMT


Jose Torres commented on SPARK-23325:

How hard would it be to just declare that InternalRow is stable? The file has been touched
about once per year for the past 3 years, and I doubt we'd be able to change it to any significant
degree without risking serious regressions.

>From my perspective, and I think (but correct me if I'm wrong) the perspective of the
SPIP, a stable interface which can match the performance of Spark's internal data sources
is one of the core goals of DataSourceV2. If high-performance sources must implement InternalRow
reads and writes, then DataSourceV2 isn't stable until InternalRow is stable anyway.

> DataSourceV2 readers should always produce InternalRow.
> -------------------------------------------------------
>                 Key: SPARK-23325
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Ryan Blue
>            Priority: Major
> DataSourceV2 row-oriented implementations are limited to producing either {{Row}} instances or
{{UnsafeRow}} instances by implementing {{SupportsScanUnsafeRow}}. Instead, I think that implementations
should always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither one is
appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by {{Row}}, like {{java.sql.Timestamp}}
and {{java.sql.Date}}. An implementation that uses {{Row}} instances must produce data that
is immediately translated from the representation that was just produced by Spark. In my experience,
it made little sense to translate a timestamp in microseconds to a (milliseconds, nanoseconds)
pair, create a {{Timestamp}} instance, and pass that instance to Spark for immediate translation
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is already
held in memory. Even the Parquet support built into Spark deserializes to {{InternalRow}} and
then uses {{UnsafeProjection}} to produce unsafe rows. When I went to build an implementation
that deserializes Parquet or Avro directly to {{UnsafeRow}} (I tried both), I found that
it couldn't be done without first deserializing into memory because the size of an array must
be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use {{UnsafeProjection}}
to convert to unsafe. There are two problems with this: first, this is Scala and was difficult
to call from Java (it required reflection), and second, this causes double projection in
the physical plan (a copy for unsafe to unsafe) if there is a projection that wasn't fully
pushed to the data source.
> I think the solution is to have a single interface for readers that expects {{InternalRow}}.
Then, a projection should be added in the Spark plan to convert to unsafe and avoid projection
in the plan and in the data source. If the data source already produces unsafe rows by deserializing
directly, this still minimizes the number of copies because the unsafe projection will check
whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message