hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Loading data containing newlines
Date Fri, 15 Jan 2016 22:37:24 GMT

> You can open a file as an RDD of lines, and map whatever custom
>tokenisation function you want over it;

That's what a SerDe does in Hive (like OpenCSVSerDe).

Once your record gets split into multiple lines, then the problem becomes
more complex since Spark's functional nature demands side-effect free
map() operations.

You cannot depend on the previous row for any map(), particularly because
the natural operations are unordered lazy.

The only way to get proper contractual ordering is to get something via
OrderedRDDFunctions [1].

> alternatively you can partition down to a reasonable size and use
>map_partitions to map the standard python csv parser over the partitions.

Yet again, this does not work once you get to delimiter interspersed data
like the case discussed.

Partitioning the RDD, when you have a mixed newlines will give you half a
row as the first item in a partition.

> In general, the advantage of spark is that you can do anything you like
>rather than being limited to a specific set of primitives.

That is true once you've cut up the record boundaries into an RDD, but as
long as you're using .textFile() you have the exact same annoyances that
hadoop TextInputFormat has.

HadoopRDD has the same issues with unescaped interspersed delimiters,
where a "\n" cannot be identified as a delimiter purely by itself or its
previous byte.

Once you write an InputFormat which has stateful readers across lines for
Hadoop, then that can be used by Spark too - but Spark by itself can't fix
delimiter interspersing.

I've had to implement this before and it wasn't simple when you scale it
up past 1 HDFS block of input without losing data.

However, when you're doing data cleansing Spark makes it really easy to
drop partial rows and move ahead unlike something like Hive.

[1] -

View raw message