spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Brown <>
Subject Re: Parsing a large XML file using Spark
Date Fri, 21 Nov 2014 18:46:25 GMT
Unfortunately, unless you impose restrictions on the XML file (e.g., where
namespaces are declared, whether entity replacement is used, etc.), you
really can't parse only a piece of it even if you have start/end elements
grouped together.  If you want to deal effectively (and scalably) with
large XML files consisting of many records, the right thing to do is to
write them as one XML document per line just like the one JSON document per
line, at which point the data can be split effectively.  Something like
Woodstox and a little custom code should make an effective pre-processor.

Once you have the line-delimited XML, you can shred it however you want:
 JAXB, Jackson XML, etc.

— | Multifarious, Inc. |

On Fri, Nov 21, 2014 at 3:38 AM, Prannoy <>

> Hi,
> Parallel processing of xml files may be an issue due to the tags in the
> xml file. The xml file has to be intact as while parsing it matches the
> start and end entity and if its distributed in parts to workers possibly it
> may or may not find start and end tags within the same worker which will
> give an exception.
> Thanks.
> On Wed, Nov 19, 2014 at 6:26 AM, ssimanta [via Apache Spark User List] <[hidden
> email] <http://user/SendEmail.jtp?type=node&node=19477&i=0>> wrote:
>> If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
>> that all revision information also) that is stored in HDFS, is it possible
>> to parse it in parallel/faster using Spark? Or do we have to use something
>> like a PullParser or Iteratee?
>> My current solution is to read the single XML file in the first pass -
>> write it to HDFS and then read the small files in parallel on the Spark
>> workers.
>> Thanks
>> -Soumya
>> ------------------------------
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>  To start a new topic under Apache Spark User List, email [hidden email]
>> <http://user/SendEmail.jtp?type=node&node=19477&i=1>
>> To unsubscribe from Apache Spark User List, click here.
>> <>
> ------------------------------
> View this message in context: Re: Parsing a large XML file using Spark
> <>
> Sent from the Apache Spark User List mailing list archive
> <> at

View raw message