flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Regarding json/xml/csv file splitting
Date Mon, 04 Feb 2019 19:25:33 GMT
Normally parallel processing of text input files is handled via Hadoop TextInputFormat, which
support splitting of files on line boundaries at (roughly) HDFS block boundaries.

There are various XML Hadoop InputFormats available, which try to sync up with splittable
locations. The one I’ve used in the past <https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java>
is part of the Mahout project.

If each JSON record is on its own line, then you can just use a regular source, and parse
it in a subsequent map function 

Otherwise you can still create a custom input format, as long as there’s some unique JSON
that identifies the beginning/end of each record.

See https://stackoverflow.com/questions/18593595/custom-inputformat-for-reading-json-in-hadoop

And failing that, you can always build a list of file paths as your input, and then in your
map function explicitly open/read each file and process it as you would any JSON file. In
a past project where we had a similar requirement, the only interesting challenge was building
N lists of files (for N mappers) where the sum of file sizes was roughly equal for each parallel
map parser, as there was significant skew in the file sizes.

— Ken

> On Feb 4, 2019, at 9:12 AM, madan <madan.yellanki@gmail.com> wrote:
> Hi,
> Can someone please tell me how to split json/xml data file. Since they are structured
form (i.e., parent/child hierarchy), is it possible to split the file and process in parallel
with 2 or more instances of source operator ?
> Also please confirm if my understanding of csv splitting is correct as mentioned below,
> When used parallelism greater than 1, file will be split into equal parts more or less
and each operator instance will have respective start position of file partition. There can
be possibility that start position of file partition can come in the middle of the delimited
line as shown below. And when file reading is started initial partial record will be ignored
by respective operator instance and reads full records which are coming afterwards. ie.,
> # Operator1 reads emp1, emp2 records (reads emp2 since record's starting char position
fell in its reading range)
> # Operator2 ignores partial emp2 rec and reads emp3 and emp4
> # Operator3 ignores partial emp4 and reads emp5
> Record delimiter is used to skip partial record and identifying new record.
> <csv_reader_positions.jpg>
> -- 
> Thank you,
> Madan.

Ken Krugler
+1 530-210-6378
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

View raw message