Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EFB7711AF5 for ; Sun, 20 Jul 2014 19:09:18 +0000 (UTC) Received: (qmail 3312 invoked by uid 500); 20 Jul 2014 19:09:14 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 3180 invoked by uid 500); 20 Jul 2014 19:09:14 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 3169 invoked by uid 99); 20 Jul 2014 19:09:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Jul 2014 19:09:13 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 74.125.82.170 is neither permitted nor denied by domain of mark.kerzner@shmsoft.com) Received: from [74.125.82.170] (HELO mail-we0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Jul 2014 19:09:11 +0000 Received: by mail-we0-f170.google.com with SMTP id w62so6590596wes.1 for ; Sun, 20 Jul 2014 12:08:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=07TRyzvBLzXvey0msTrQRKKD28zfYYOsFZazL+s2D9k=; b=KfdK8YdmYDVMCNK31DqxCxVfqrKdAFvcp7gHpcoOiW1CD1RT2mCfKVBdtIPTJYVc+A GaApdnVwxxyUSdj/fmYDMrH5iVdP3KEVYCr8BS82up/nmfrqR94zBMXn/3M8gj3gX1uK ai3mb8cUfsmxs8pZgaCy6tTN0nG78n1tn5+o0Iy/4H2uYe1jwEOsm9lKYGw1zUUozgXU swoNfILoQYFFXrP5CjDg527TsDqxO9lx5uazEJ/afyNnUmNVNnY/Y/fLkkC+8ZJI+90c ksYum+l1d1QCqS7HErgXAblavPhS1yfLUIheMxt/jnHeZ8WCw/fVdAagecV50k4vxgmR Npcg== X-Gm-Message-State: ALoCoQmRUJu94q2vqCPncYr/uqfwh7FhLFreCP+2HWqgNcAINFl4E02/Z1ZIpsk1Q6f3YOiqmSQN MIME-Version: 1.0 X-Received: by 10.180.104.42 with SMTP id gb10mr48546009wib.65.1405883325961; Sun, 20 Jul 2014 12:08:45 -0700 (PDT) Received: by 10.194.59.110 with HTTP; Sun, 20 Jul 2014 12:08:45 -0700 (PDT) X-Originating-IP: [50.162.32.156] In-Reply-To: References: Date: Sun, 20 Jul 2014 14:08:45 -0500 Message-ID: Subject: Re: Merging small files From: Mark Kerzner To: Hadoop User Content-Type: multipart/alternative; boundary=f46d041826f63d7ed304fea4b8d6 X-Virus-Checked: Checked by ClamAV on apache.org --f46d041826f63d7ed304fea4b8d6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Bob, you don't have to wait for batch. Here is my project (under development) where I am using Storm for continuous file processing, https://github.com/markkerzner/3VEed Mark On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob" Wakefield, MBA < adaryl.wakefield@hotmail.com> wrote: > Yeah I=E2=80=99m sorry I=E2=80=99m not talking about processing the fi= les in Oracle. I > mean collect/store invoices in Oracle then flush them in a batch to Hadoo= p. > This is not real time right? So you take your EDI,CSV and XML from their > sources. Store them in Oracle. Once you have a decent size, flush them to > Hadoop in one big file, process them, then store the results of the > processing in Oracle. > > Source file =E2=80=93> Oracle =E2=80=93> Hadoop =E2=80=93> Oracle > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics > 913.938.6685 > www.linkedin.com/in/bobwakefieldmba > > *From:* Shashidhar Rao > *Sent:* Sunday, July 20, 2014 12:47 PM > *To:* user@hadoop.apache.org > *Subject:* Re: Merging small files > > Spring batch is used to process the files which come in EDI ,CSV & XML > format and store it into Oracle after processing, but this is for a very > small division. Imagine invoices generated roughly by 5 million custome= rs > every week from all stores plus from online purchases. Time to process > such massive data would be not acceptable even though Oracle would be a > good choice as Adaryl Bob has suggested. Each invoice is not even 10 k an= d > we have no choice but to use Hadoop, but need further processing of input > files just to make hadoop happy . > > > On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "Bob" Wakefield, MBA < > adaryl.wakefield@hotmail.com> wrote: > >> =E2=80=9CEven if we kept the discussion to the mailing list's technica= l Hadoop >> usage focus, any company/organization looking to use a distro is going t= o >> have to consider the costs, support, platform, partner ecosystem, market >> share, company strategy, etc.=E2=80=9D >> >> Yeah good point. >> >> Adaryl "Bob" Wakefield, MBA >> Principal >> Mass Street Analytics >> 913.938.6685 >> www.linkedin.com/in/bobwakefieldmba >> >> *From:* Shahab Yunus >> *Sent:* Sunday, July 20, 2014 11:32 AM >> *To:* user@hadoop.apache.org >> *Subject:* Re: Merging small files >> >> Why it isn't appropriate to discuss too much vendor specific topics on >> a vendor-neutral apache mailing list? Checkout this thread: >> >> http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201309.mb= ox/%3CCAJ1NbZcoCw1RSNCf3H-ikjKK4uqxQXT7avsJ-6NahQ_e4dXYGA@mail.gmail.com%3E >> >> You can always discuss vendor specific issues in their respective mailin= g >> lists. >> >> As for merging files, Yes one can use HBase but then you have to keep in >> mind that you are adding overhead of development and maintenance of a >> another store (i.e. HBase). If your use case could be satisfied with HDF= S >> alone then why not keep it simple? And given the knowledge of the >> requirements that the OP provided, I think Sequence File format should w= ork >> as I suggested initially. Of course, if things get too complicated from >> requirements perspective then one might try out HBase. >> >> Regards, >> Shahab >> >> >> On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "Bob" Wakefield, MBA < >> adaryl.wakefield@hotmail.com> wrote: >> >>> It isn=E2=80=99t? I don=E2=80=99t wanna hijack the thread or anything= but it seems to >>> me that MapR is an implementation of Hadoop and this is a great place t= o >>> discuss it=E2=80=99s merits vis a vis the Hortonworks or Cloudera offer= ing. >>> >>> A little bit more on topic: Every single thing I read or watch about >>> Hadoop says that many small files is a bad idea and that you should mer= ge >>> them into larger files. I=E2=80=99ll take this a step further. If your = invoice data >>> is so small, perhaps Hadoop isn=E2=80=99t the proper solution to whatev= er it is you >>> are trying to do and a more traditional RDBMS approach would be more >>> appropriate. Someone suggested HBase and I was going to suggest maybe o= ne >>> of the other NoSQL databases, however, I remember that Eddie Satterly o= f >>> Splunk says that financial data is the ONE use case where a traditional >>> approach is more appropriate. You can watch his talk here: >>> >>> https://www.youtube.com/watch?v=3D-N9i-YXoQBE&index=3D77&list=3DWL >>> >>> Adaryl "Bob" Wakefield, MBA >>> Principal >>> Mass Street Analytics >>> 913.938.6685 >>> www.linkedin.com/in/bobwakefieldmba >>> >>> *From:* Kilaru, Sambaiah >>> *Sent:* Sunday, July 20, 2014 3:47 AM >>> *To:* user@hadoop.apache.org >>> *Subject:* Re: Merging small files >>> >>> This is not place to discuss merits or demerits of MapR, Small files >>> screw up very badly with Mapr. >>> Small files go into one container (to fill up 256MB or what ever >>> container size) and with locality most >>> Of the mappers go to three datanodes. >>> >>> You should be looking into sequence file format. >>> >>> Thanks, >>> Sam >>> >>> From: "M. C. Srivas" >>> Reply-To: "user@hadoop.apache.org" >>> Date: Sunday, July 20, 2014 at 8:01 AM >>> To: "user@hadoop.apache.org" >>> Subject: Re: Merging small files >>> >>> You should look at MapR .... a few 100's of billions of small files is >>> absolutely no problem. (disc: I work for MapR) >>> >>> >>> On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao < >>> raoshashidhar123@gmail.com> wrote: >>> >>>> Hi , >>>> >>>> Has anybody worked in retail use case. If my production Hadoop cluster >>>> block size is 256 MB but generally if we have to process retail invoic= e >>>> data , each invoice data is merely let's say 4 KB . Do we merge the in= voice >>>> data to make one large file say 1 GB . What is the best practice in th= is >>>> scenario >>>> >>>> >>>> Regards >>>> Shashi >>>> >>> >>> >> >> > > --f46d041826f63d7ed304fea4b8d6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Bob,

you don't have to wait for bat= ch. Here is my project (under development) where I am using Storm for conti= nuous file processing,=C2=A0https://github.com/markkerzner/3VEed

Mark


On Sun, Jul 20, 2014 at 1:31 PM, Adaryl "Bob&qu= ot; Wakefield, MBA <adaryl.wakefield@hotmail.com>= wrote:
Yeah=C2=A0 I=E2=80=99m sorry I=E2=80=99m not talking about processing = the files in Oracle.=20 I mean collect/store invoices in Oracle then flush them in a batch to Hadoo= p.=20 This is not real time right? So you take your EDI,CSV and XML from their=20 sources. Store them in Oracle. Once you have a decent size, flush them to H= adoop=20 in one big file, process them, then store the results of the processing in= =20 Oracle.
=C2=A0
Source file =E2=80=93> Oracle =E2=80=93> Hadoop =E2=80=93> Or= acle
=C2=A0
A= daryl=20 "Bob" Wakefield, MBA
Principal
Mass Street=20 Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
=C2=A0
Sent: Sunday, July 20, 2014 12:47 PM
Subject: Re: Merging small files
=C2=A0
Spring batch is used to process the files which come in ED= I ,CSV=20 & XML format and store it into Oracle after processing, but this is for= a=20 very small division. Imagine invoices generated=C2=A0 roughly=C2=A0 by 5 mi= llion=20 customers every week from=C2=A0 all stores plus from online purchases. Time= to=20 process such massive data would be not acceptable even though Oracle would = be a=20 good choice as Adaryl Bob has suggested. Each invoice is not even 10 k and = we=20 have no choice but to use Hadoop, but need further processing of input file= s=20 just to make hadoop happy .


On Sun, Jul 20, 2014 at 10:07 PM, Adaryl "B= ob" Wakefield,=20 MBA <adaryl.wakefield@hotmail.com> wrote:
=E2=80=9CEven if we kept the discussion to the mailing list's te= chnical Hadoop=20 usage focus, any company/organization looking to use a distro is going to= have=20 to consider the costs, support, platform, partner ecosystem, market share= ,=20 company strategy, etc.=E2=80=9D
=C2=A0
Yeah good point.
=C2=A0
Adaryl=20 "Bob" Wakefield, MBA
Principal
Mass Street=20 Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
=
=C2=A0
Sent: Sunday, July 20, 2014 11:32 AM
Subject: Re: Merging small files
=C2=A0
Why it isn't appropriate to discuss too much vendor = specific=20 topics on a vendor-neutral apache mailing list? Checkout this thread:=20
=C2=A0
You can always discuss vendor specific issues in their respective ma= iling=20 lists.
=C2=A0
As for merging files, Yes one can use HBase but then you have to kee= p in=20 mind that you are adding overhead of development and maintenance of a ano= ther=20 store (i.e. HBase). If your use case could be satisfied with HDFS alone t= hen=20 why not keep it simple? And given the knowledge of the requirements that = the=20 OP provided, I think Sequence File format should work as I suggested=20 initially. Of course, if things get too complicated from requirements=20 perspective then one might try out HBase.
=C2=A0
Regards,
Shahab


On Sun, Jul 20, 2014 at 12:24 PM, Adaryl "= ;Bob"=20 Wakefield, MBA <adaryl.wakefield@hotmail.com> w= rote:
It isn=E2=80=99t? I don=E2=80=99t wanna hijack the thread or anyth= ing but it seems to=20 me that MapR is an implementation of Hadoop and this is a great place t= o=20 discuss it=E2=80=99s merits vis a vis the Hortonworks or Cloudera offer= ing.
=C2=A0
A little bit more on topic: Every single thing I read or watch abo= ut=20 Hadoop says that many small files is a bad idea and that you should mer= ge=20 them into larger files. I=E2=80=99ll take this a step further. If your = invoice data=20 is so small, perhaps Hadoop isn=E2=80=99t the proper solution to whatev= er it is you=20 are trying to do and a more traditional RDBMS approach would be more=20 appropriate. Someone suggested HBase and I was going to suggest maybe o= ne of=20 the other NoSQL databases, however, I remember that Eddie Satterly of S= plunk=20 says that financial data is the ONE use case where a traditional approa= ch is=20 more appropriate. You can watch his talk here:
=C2=A0
=C2=A0
Adaryl=20 "Bob" Wakefield, MBA
Principal
Mass Street Analytics913.= 938.6685
www.linkedin.com/in/bobwakefieldmba
=C2=A0
Sent: Sunday, July 20, 2014 3:47 AM
Subject: Re: Merging small files
=C2=A0
This is not place to discuss merits or demerits of MapR, Small fil= es=20 screw up very badly with Mapr.
Small files go into one container (to fill up 256MB or what ever= =20 container size) and with locality most
Of the mappers go to three datanodes.
=C2=A0
You should be looking into sequence file format.
=C2=A0
Thanks,
Sam
=C2=A0
From: "M. C. Srivas" <= mcsrivas@gmail.com<= /a>>
Reply-To: "
user@hadoop.apache.org= " <= user@hadoop.apache.org>
Date: Sunday, July 20, 2014 at 8:01= =20 AM
To: "user@hadoop.apache.org&quo= t; <user@had= oop.apache.org>
Subject: Re: Merging small files
=C2=A0
You should look at MapR .... a few 100's of billio= ns of small=20 files is absolutely no problem. (disc: I work for MapR)


On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar= Rao=20 <raoshashidhar123@gmail.com> wrote:
Hi ,

Has anybody worked in retail use case. If my= =20 production Hadoop cluster block size is 256 MB but generally if we ha= ve to=20 process retail invoice data , each invoice data is merely let's s= ay 4 KB .=20 Do we merge the invoice data to make one large file say 1 GB . What i= s the=20 best practice in this=20 scenario


Regards
Shashi
=C2=A0
=C2=A0
<= /div>
=C2=A0

--f46d041826f63d7ed304fea4b8d6--