Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3E2C610D7E for ; Fri, 21 Nov 2014 18:48:00 +0000 (UTC) Received: (qmail 50082 invoked by uid 500); 21 Nov 2014 18:47:57 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 50013 invoked by uid 500); 21 Nov 2014 18:47:57 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 50003 invoked by uid 99); 21 Nov 2014 18:47:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Nov 2014 18:47:57 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.223.181] (HELO mail-ie0-f181.google.com) (209.85.223.181) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Nov 2014 18:47:53 +0000 Received: by mail-ie0-f181.google.com with SMTP id tp5so5457110ieb.40 for ; Fri, 21 Nov 2014 10:46:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=/Z0GwwLDiaNyr2Le6iiY1EX5tHy0pP5Ws0bCe6h6ays=; b=mZIZCPy9heIR9TNGtURFFGig9gNUD1pcrfTbMpqYGKPtLLlWQL8kPyO8gg5Dk+3rJk KqKtqeyUbA1A3InHnHNVtO+fs7/Uhf5nNpeATIBeMQnpy49AgWG8xgzgfpDLoHLdunoW bC2OklIt2ljoktK6cCqNLgOZ1w0cQ1M89/b18kCfQgSGvIxVW1q2P0YwJsDOnVj/xssd 6nMNs8D0c9wB3EY1bvZU4TbaEWjwvzQKFnwzw96UeJJ4xJx2VrBE70Ya/yeXgWvJVNMl cX/Z+KtM9tnEtengZu4aeg6rteXhV+mrMbfzZQL45yHP3FgiJVAZKTGncgXOusBjnWF5 S/+w== X-Gm-Message-State: ALoCoQnSILAvvA30HC+OXk3XLO/jbU6ZVE6E+N0bO7aKhXIUzQkfplhaqYjAweVI0lfXF2k5ZyYd X-Received: by 10.51.17.2 with SMTP id ga2mr5029690igd.39.1416595607218; Fri, 21 Nov 2014 10:46:47 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.37.149 with HTTP; Fri, 21 Nov 2014 10:46:25 -0800 (PST) In-Reply-To: References: From: Paul Brown Date: Fri, 21 Nov 2014 10:46:25 -0800 Message-ID: Subject: Re: Parsing a large XML file using Spark To: user@spark.incubator.apache.org Content-Type: multipart/alternative; boundary=001a113491e4f58829050862dd3a X-Virus-Checked: Checked by ClamAV on apache.org --001a113491e4f58829050862dd3a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Unfortunately, unless you impose restrictions on the XML file (e.g., where namespaces are declared, whether entity replacement is used, etc.), you really can't parse only a piece of it even if you have start/end elements grouped together. If you want to deal effectively (and scalably) with large XML files consisting of many records, the right thing to do is to write them as one XML document per line just like the one JSON document per line, at which point the data can be split effectively. Something like Woodstox and a little custom code should make an effective pre-processor. Once you have the line-delimited XML, you can shred it however you want: JAXB, Jackson XML, etc. =E2=80=94 prb@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Nov 21, 2014 at 3:38 AM, Prannoy wrote: > Hi, > > Parallel processing of xml files may be an issue due to the tags in the > xml file. The xml file has to be intact as while parsing it matches the > start and end entity and if its distributed in parts to workers possibly = it > may or may not find start and end tags within the same worker which will > give an exception. > > Thanks. > > On Wed, Nov 19, 2014 at 6:26 AM, ssimanta [via Apache Spark User List] <[= hidden > email] > wrote: > >> If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dum= p >> that all revision information also) that is stored in HDFS, is it possib= le >> to parse it in parallel/faster using Spark? Or do we have to use somethi= ng >> like a PullParser or Iteratee? >> >> My current solution is to read the single XML file in the first pass - >> write it to HDFS and then read the small files in parallel on the Spark >> workers. >> >> Thanks >> -Soumya >> >> >> >> >> >> ------------------------------ >> If you reply to this email, your message will be added to the >> discussion below: >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-= file-using-Spark-tp19239.html >> To start a new topic under Apache Spark User List, email [hidden email] >> >> To unsubscribe from Apache Spark User List, click here. >> NAML >> >> > > > ------------------------------ > View this message in context: Re: Parsing a large XML file using Spark > > Sent from the Apache Spark User List mailing list archive > at Nabble.com. > --001a113491e4f58829050862dd3a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Unfortunately, unless you impose restrictio= ns on the XML file (e.g., where namespaces are declared, whether entity rep= lacement is used, etc.), you really can't parse only a piece of it even= if you have start/end elements grouped together.=C2=A0 If you want to deal= effectively (and scalably) with large XML files consisting of many records= , the right thing to do is to write them as one XML document per line just = like the one JSON document per line, at which point the data can be split e= ffectively.=C2=A0 Something like Woodstox and a little custom code should m= ake an effective pre-processor.

Once you have the line-d= elimited XML, you can shred it however you want: =C2=A0JAXB, Jackson XML, e= tc.

=E2=80=94
prb@mult.ifario.us | Multi= farious, Inc. | http:/= /mult.ifario.us/

On Fri, Nov 21, 2014 at 3:38 AM, Prannoy <prannoy@sigmoidanalytics.com> wrote:
Hi,

Parallel proc= essing of xml files may be an issue due to the tags in the xml file. The xm= l file has to be intact as while parsing it matches the start and end entit= y and if its distributed in parts to workers possibly it may or may not fin= d start and end tags within the same worker which will give an exception.

Thanks.

<= div class=3D"gmail_quote">
On Wed, Nov 19, 2014 at 6:26 AM, ssiman= ta [via Apache Spark User List] <[hidden email]> wrote:
<= /div>
If there a one big XML file (e.g., Wikipedia dump 44GB or= the larger dump that all revision information also) that is stored in HDFS= , is it possible to parse it in parallel/faster using Spark? Or do we have = to use something like a PullParser or Iteratee?=C2=A0

My= current solution is to read the single XML file in the first pass - write = it to HDFS and then read the small files in parallel on the Spark workers.= =C2=A0

Thanks
-Soumya



=09 =09 =09


If you reply to this email, your message = will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Pars= ing-a-large-XML-file-using-Spark-tp19239.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

=09 =09 =09

View this message in context: Re: Parsing a large XML file using Spark
Sent from the Apache Spark User List mailing list archive at Na= bble.com.

--001a113491e4f58829050862dd3a--