spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <>
Subject Re: Why the json file used by must be a valid json object per line
Date Thu, 20 Oct 2016 10:28:34 GMT

> On 19 Oct 2016, at 21:46, Jakob Odersky <> wrote:
> Another reason I could imagine is that files are often read from HDFS,
> which by default uses line terminators to separate records.
> It is possible to implement your own hdfs delimiter finder, however
> for arbitrary json data, finding that delimiter would require stateful
> parsing of the file and would be difficult to parallelize across a
> cluster.

good point. 

If you are creating your own files of a list of JSON files, then you could do your own encoding,
one with say a header for each record (say 'J'+'S'+'O'+'N' + int64 length, and split on that:
you don't need to scan a record to know its length, and you can scan a large document counting
its records simply though a sequence of skip + read(byte[8]) operations.

To unsubscribe e-mail:

View raw message