Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4E8FD10886 for ; Sat, 3 Jan 2015 16:39:34 +0000 (UTC) Received: (qmail 36752 invoked by uid 500); 3 Jan 2015 16:39:30 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 36649 invoked by uid 500); 3 Jan 2015 16:39:30 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 36638 invoked by uid 99); 3 Jan 2015 16:39:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Jan 2015 16:39:28 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mohajeri@gmail.com designates 209.85.214.182 as permitted sender) Received: from [209.85.214.182] (HELO mail-ob0-f182.google.com) (209.85.214.182) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Jan 2015 16:39:23 +0000 Received: by mail-ob0-f182.google.com with SMTP id wo20so56001629obc.13 for ; Sat, 03 Jan 2015 08:38:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=+r1wEJneAR9Lz2IY+ABX3W0t1TV8noSQj6V+CXcvP10=; b=xkEVkv8s15D8ZjglIl37KQAnDrjN11hxpOJhV7B0CLjqv86UwRNzgxln5XPLle+sw8 hglkT7BXyeuDd0E123qojUX6O5kKEo8FKf/CnIfhgfaA1xDj/VzwFgWEBwtprEh93d+9 rsfb9jY3GSLxFINVHHsCUFMl/sd8lLLe0SZ4kuWhoWPsBSXGti0EGtOOff1c9fv0Ap4r L+e7J0wKMw3jQoHPrKF9axTk91ATy9kIqDl4NnWLFu26/Q/CSndAlXVs43V9JO8HuFUo dXid71Qh8hx+OaVocXv63roVEIAxvylLTFQO8kcDsvm8I5lcoUOouD6Q3yhkSDGaCKNO xOtQ== MIME-Version: 1.0 X-Received: by 10.60.42.208 with SMTP id q16mr48849586oel.20.1420303097641; Sat, 03 Jan 2015 08:38:17 -0800 (PST) Received: by 10.76.28.73 with HTTP; Sat, 3 Jan 2015 08:38:17 -0800 (PST) In-Reply-To: References: <54A81568.6040004@gmail.com> Date: Sat, 3 Jan 2015 08:38:17 -0800 Message-ID: Subject: Re: XML files in Hadoop From: Peyman Mohajerian To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b67713a9bc175050bc21598 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b67713a9bc175050bc21598 Content-Type: text/plain; charset=UTF-8 You can land the data in HDFS as XML files and use 'hive xml serde' to read the data and write it back in a more optimal format, e.g. ORC or parquet (depending somewhat on your choice of Hadoop distro). Querying XML data directly via Hive is also doable but slow. Converting to Avro is also doable but in my experience not as fast as ORC or Parquet. Columnar formats work give you better performance but Avro has its own strength, e.g. managing schema changes better. You can also convert the format before you land the data in HDFS, e.g. using Flume or some other tool for changing the format in flight. On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao wrote: > Sorry , not Hive files but xml files to some Avro format and store these > into Hive will be fast . > > On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao > wrote: > >> Hi, >> >> Exact number of files is not known but it will run into millions of files >> depending on client's request who collects terabytes of xml data every day. >> Basically, storing is just one part but the main part will be how to query >> these data like aggregation, count and do some analytics over these data. >> Fast retrieval is required , say for e.g for a particular year what are the >> top 10 products, top ten manufacturers and top ten stores etc. >> >> Will Hive be a better choice ? And will converting these Hive files to >> some format work out. >> >> Thanks >> Shashi >> >> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher < >> wilm.schumacher@gmail.com> wrote: >> >>> Hi, >>> >>> how many xml files are you planning to store? Perhaps it is possible to >>> store them directly on hdfs and save meta data in hbase. This sounds >>> more reasonable to me. >>> >>> If the number of xml files is to large (millions and billions), then you >>> can use hadoop map files to put files together. E.g. based on years, or >>> month. >>> >>> Regards, >>> >>> Wilm >>> >>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao: >>> > Hi, >>> > >>> > Can someone help me by suggesting the best way to solve this use case >>> > >>> > 1. XML files keep flowing from external system and need to be stored >>> > into HDFS. >>> > 2. These files can be directly stored using NoSql database e.g any >>> > xml supported NoSql. or >>> > 3. These files need to be processed and stored in one of the database >>> > HBase, Hive etc. >>> > 4. There won't be any updates only read and has to be retrieved based >>> > on some queries and a dashboard has to be created , bits of analytics >>> > >>> > The xml files are huge and expected number of nodes is roughly around >>> > 12 nodes. >>> > I am stuck in the storage part say if I convert xml to json and store >>> > it into HBase , the processing part from xml to json will be huge. >>> > >>> > It will be only reading and no updates. >>> > >>> > Please suggest how to store these xml files. >>> > >>> > Thanks >>> > Shashi >>> >>> >> > --047d7b67713a9bc175050bc21598 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
You can land the data in HDFS as XML files and use 'hi= ve xml serde' to read the data and write it back in a more optimal form= at, e.g. ORC or parquet (depending somewhat on your choice of Hadoop distro= ). Querying XML data directly via Hive is also doable but slow. Converting = to Avro is also doable but in my experience not as fast as ORC or Parquet. = Columnar formats work give you better performance but Avro has its own stre= ngth, e.g. managing schema changes better.=C2=A0
You can also convert t= he format before you land the data in HDFS, e.g. using Flume or some other = tool for changing the format in flight.


=

On Sat, Jan= 3, 2015 at 8:33 AM, Shashidhar Rao <raoshashidhar123@gmail.com> wrote:
Sor= ry , not Hive files but xml files to some Avro format and store these into = Hive will be fast .

On Sat, Jan 3, 2015 at = 9:59 PM, Shashidhar Rao <raoshashidhar123@gmail.com> wrote:
=
Hi,

Exact number of files is not known but it = will run into millions of files depending on client's request who colle= cts terabytes of xml data every day. Basically, storing is just one part bu= t the main part will be how to query these data like=C2=A0 aggregation, cou= nt and do some analytics over these data. Fast retrieval is required , say = for e.g for a particular year what are the top 10 products, top ten manufac= turers and top ten stores etc.

Will Hive be a better choice ? = And will converting these Hive files to some format work out.

= Thanks
Shashi

On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wilm.schumacher@gmail.com> wrote:
Hi,

how many xml files are you planning to store? Perhaps it is possible to
store them directly on hdfs and save meta data in hbase. This sounds
more reasonable to me.

If the number of xml files is to large (millions and billions), then you can use hadoop map files to put files together. E.g. based on years, or
month.

Regards,

Wilm

Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
> Hi,
>
> Can someone help me by suggesting the best way to solve this use case<= br> >
> 1. XML files keep flowing from external system and need to be stored > into HDFS.
> 2. These files=C2=A0 can be directly stored using NoSql database e.g a= ny
> xml supported NoSql. or
> 3. These files need to be processed and stored in one of the database<= br> > HBase, Hive etc.
> 4. There won't be any updates only read and has to be retrieved ba= sed
> on some queries and a dashboard has to be created , bits of analytics<= br> >
> The xml files are huge and expected number of nodes is roughly around<= br> > 12 nodes.
> I am stuck in the storage part say if I convert xml to json and store<= br> > it into HBase , the processing part from xml to json will be huge.
>
> It will be only reading and no updates.
>
> Please suggest how to store these xml files.
>
> Thanks
> Shashi




--047d7b67713a9bc175050bc21598--