spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe Halliwell (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-7366) Support multi-line JSON objects via a depth hint
Date Tue, 05 May 2015 10:59:59 GMT

     [ https://issues.apache.org/jira/browse/SPARK-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joe Halliwell updated SPARK-7366:
---------------------------------
    Description: 
.h2 Background

The present object-per-line format for ingesting JSON data has a couple of deficiencies:
1. It's not itself JSON
2. It's often harder for humans to read

The object-per-file format addresses these, but at a cost of producing many files which can
be unwieldy.

Since it is feasible to read and write large JSON files via streaming (and many systems do)
it seems reasonable to support them directly as an input format.

.h2 Suggest approach

The key challenge is to find record boundaries without parsing the file from the start i.e.
given an offset, locate a nearby boundary. In the general case this is impossible as you can't
be sure you've identified the start of a top-level record without tracing back to the start
of a file.

However, if you know something about the format of the file i.e. maximum object depth it seems
plausible that we can do better.

  was:
The present object-per-line format for ingesting JSON data has a couple of deficiencies:
1. It's not itself JSON
2. It's often harder for humans to read

The object-per-file format addresses these, but at a cost of producing many files which can
be unwieldy.

Since it is feasible to read and write large JSON files via streaming (and many systems do)
it seems reasonable to support them directly as an input format.

The key challenge is to find record boundaries without parsing the file from the start i.e.
given an offset, locate a nearby boundary. In the general case this is impossible as you can't
be sure you've identified the start of a top-level record without tracing back to the start
of a file.

However, if you know something about the format of the file i.e. maximum object depth it seems
plausible that we can do better.


> Support multi-line JSON objects via a depth hint 
> -------------------------------------------------
>
>                 Key: SPARK-7366
>                 URL: https://issues.apache.org/jira/browse/SPARK-7366
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>            Reporter: Joe Halliwell
>            Priority: Minor
>
> .h2 Background
> The present object-per-line format for ingesting JSON data has a couple of deficiencies:
> 1. It's not itself JSON
> 2. It's often harder for humans to read
> The object-per-file format addresses these, but at a cost of producing many files which
can be unwieldy.
> Since it is feasible to read and write large JSON files via streaming (and many systems
do) it seems reasonable to support them directly as an input format.
> .h2 Suggest approach
> The key challenge is to find record boundaries without parsing the file from the start
i.e. given an offset, locate a nearby boundary. In the general case this is impossible as
you can't be sure you've identified the start of a top-level record without tracing back to
the start of a file.
> However, if you know something about the format of the file i.e. maximum object depth
it seems plausible that we can do better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message