spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Why the json file used by sparkSession.read.json must be a valid json object per line
Date Thu, 20 Oct 2016 10:28:34 GMT

> On 19 Oct 2016, at 21:46, Jakob Odersky <jakob@odersky.com> wrote:
> 
> Another reason I could imagine is that files are often read from HDFS,
> which by default uses line terminators to separate records.
> 
> It is possible to implement your own hdfs delimiter finder, however
> for arbitrary json data, finding that delimiter would require stateful
> parsing of the file and would be difficult to parallelize across a
> cluster.
> 


good point. 

If you are creating your own files of a list of JSON files, then you could do your own encoding,
one with say a header for each record (say 'J'+'S'+'O'+'N' + int64 length, and split on that:
you don't need to scan a record to know its length, and you can scan a large document counting
its records simply though a sequence of skip + read(byte[8]) operations.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message