Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2219418BA7 for ; Fri, 24 Apr 2015 09:35:19 +0000 (UTC) Received: (qmail 7068 invoked by uid 500); 24 Apr 2015 09:35:03 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 6924 invoked by uid 500); 24 Apr 2015 09:35:03 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 6914 invoked by uid 99); 24 Apr 2015 09:35:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Apr 2015 09:35:03 +0000 X-ASF-Spam-Status: No, hits=2.9 required=5.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: encountered temporary error during SPF processing of domain of Ananda.Murugan@honeywell.com) Received: from [54.76.25.247] (HELO mx1-eu-west.apache.org) (54.76.25.247) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Apr 2015 09:34:36 +0000 Received: from na01-bn1-obe.outbound.protection.outlook.com (mail-bn1on0136.outbound.protection.outlook.com [157.56.110.136]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 2D7E224E9F for ; Fri, 24 Apr 2015 09:34:14 +0000 (UTC) Received: from BY2PR07CA052.namprd07.prod.outlook.com (10.141.251.27) by DM2PR0701MB763.namprd07.prod.outlook.com (10.242.126.20) with Microsoft SMTP Server (TLS) id 15.1.136.25; Fri, 24 Apr 2015 09:33:53 +0000 Received: from BY2FFO11FD008.protection.gbl (2a01:111:f400:7c0c::195) by BY2PR07CA052.outlook.office365.com (2a01:111:e400:2c61::27) with Microsoft SMTP Server (TLS) id 15.1.148.16 via Frontend Transport; Fri, 24 Apr 2015 09:33:52 +0000 Authentication-Results: spf=neutral (sender IP is 199.64.221.172) smtp.mailfrom=honeywell.com; hadoop.apache.org; dkim=none (message not signed) header.d=none; Received-SPF: Neutral (protection.outlook.com: 199.64.221.172 is neither permitted nor denied by domain of honeywell.com) Received: from AZ18W1047.honeywell.com (199.64.221.172) by BY2FFO11FD008.mail.protection.outlook.com (10.1.14.159) with Microsoft SMTP Server (TLS) id 15.1.154.14 via Frontend Transport; Fri, 24 Apr 2015 09:33:52 +0000 Received: from az18ex5005.global.ds.honeywell.com (10.192.44.103) by AZ18W1047.honeywell.com (10.197.248.31) with Microsoft SMTP Server id 14.3.224.2; Fri, 24 Apr 2015 02:33:34 -0700 Received: from IE1AEX5001.global.ds.honeywell.com (199.63.219.241) by AZ18EX5005.global.ds.honeywell.com (10.192.24.136) with Microsoft SMTP Server (TLS) id 14.3.224.2; Fri, 24 Apr 2015 02:33:35 -0700 Received: from IE1AEX3007.global.ds.honeywell.com ([169.254.11.26]) by IE1AEX5001.global.ds.honeywell.com ([199.63.219.241]) with mapi id 14.03.0224.002; Fri, 24 Apr 2015 15:03:04 +0530 From: "Chandra Mohan, Ananda Vel Murugan" To: "user@hadoop.apache.org" Subject: RE: Large number of small files Thread-Topic: Large number of small files Thread-Index: AQHQfmxRvSL9VwKMxUeDO39ITWTnOJ1b3VQw//+msQCAAAJYgIAAXXuQ Date: Fri, 24 Apr 2015 09:33:03 +0000 Message-ID: References: <553A0489.8050407@nissatech.com> <553A0890.90309@nissatech.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [199.63.219.248] Content-Type: multipart/alternative; boundary="_000_E5CD9C0EEBC5954A95554ACE79F770BB015F671BIE1AEX3007globa_" MIME-Version: 1.0 X-CFilter-Loop: Forwarded X-EOPAttributedMessage: 0 X-Forefront-Antispam-Report: CIP:199.64.221.172;CTRY:US;IPV:NLI;EFV:NLI;BMV:1;SFV:NSPM;SFS:(10019020)(11905935001)(43544003)(76104003)(199003)(377454003)(189002)(13464003)(450100001)(18717965001)(19300405004)(66066001)(106466001)(106116001)(105586002)(46102003)(19617315012)(107886001)(2351001)(110136001)(33656002)(19625215002)(2501003)(2656002)(19580395003)(19580405001)(87936001)(6806004)(2950100001)(2920100001)(2900100001)(92566002)(512954002)(54356999)(76176999)(50986999)(86362001)(102836002)(15975445007)(93886004)(77156002)(62966003)(16601075003)(16236675004)(84326002);DIR:OUT;SFP:1102;SCL:1;SRVR:DM2PR0701MB763;H:AZ18W1047.honeywell.com;FPR:;SPF:Neutral;MLV:sfv;A:1;MX:1;LANG:en; X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DM2PR0701MB763; X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(5002010)(5005006)(3002001);SRVR:DM2PR0701MB763;BCL:0;PCL:0;RULEID:;SRVR:DM2PR0701MB763; X-Forefront-PRVS: 05568D1FF7 X-OriginatorOrg: Honeywell.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Apr 2015 09:33:52.4966 (UTC) X-MS-Exchange-CrossTenant-Id: 96ece526-9c7d-48b0-8daf-8b93c90a5d18 X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=96ece526-9c7d-48b0-8daf-8b93c90a5d18;Ip=[199.64.221.172];Helo=[AZ18W1047.honeywell.com] X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM2PR0701MB763 X-Virus-Checked: Checked by ClamAV on apache.org --_000_E5CD9C0EEBC5954A95554ACE79F770BB015F671BIE1AEX3007globa_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Marko, Parquet file would be created once when you load the data. You don't have t= o store your small files in HDFS just for the reason of subseting the data = by time range. You can store data and metadata in same Parquet file. As alr= eady pointed out, parquet files work well other tools in Hadoop ecosystem. = Apart from performance of your map reduce jobs, other aspect is storage eff= iciency. Serialization formats like Avro and Parquet provide better compres= sion and hence data occupies less space. Regards, Anand From: Alexander Alten-Lorenz [mailto:wget.null@gmail.com] Sent: Friday, April 24, 2015 2:49 PM To: user@hadoop.apache.org Subject: Re: Large number of small files Marko, Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be= discussed here. Parquet is an columnar based storage format. It is - high level - a bit lik= e a NoSQL DB, but on the storage level. it allows users to "query" the data= with MR, Pig or similar tools. Additionally, Parquet works perfectly with = Hive and Cloudera Impala as well as Apache Dremel. https://parquet.incubator.apache.org/documentation/latest/ http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v= 2-0-x/topics/impala_parquet.html https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Dat= a-into-Impala-as-a-Parquet-Table -- Alexander Alten-Lorenz m: wget.null@gmail.com b: mapredit.blogspot.com On Apr 24, 2015, at 11:10 AM, Marko Dinic > wrote: Anand, Thank you for your answer, but wouldn't that mean that I would have to seri= alize the files each time I need to run the job? And I would still need to = save the original files, so the NameNode still needs to take care of them? Please correct me if I'm missing something, I'm not very experienced with H= adoop. What do you think about using Cassandra? Thanks On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrot= e: Apart from databases like Cassandra, you may check serialization formats li= ke Avro or Parquet Regards, Anand -----Original Message----- From: Marko Dinic [mailto:marko.dinic@nissatech.com] Sent: Friday, April 24, 2015 2:23 PM To: user@hadoop.apache.org Subject: Large number of small files Hello, I'm not sure if this is the place to ask this question, but I'm still hoppi= ng for an answer/advice. Large number of small files are uploaded, about 8KB. I am aware that this i= s not something that you're hopping for when working with Hadoop. I was thinking about using HAR files and combined input, or sequence files.= The problem is, files are timestamped, and I need different subset in diff= erent time, for example - one job needs to run on files that are uploaded d= uring last 3 months, while next job might consider last 6 months. Naturally= , as time passes different subset of files is needed. This means that I would need to make a sequence file (or a HAR) each time I= run a job, to have smaller number of mappers. On the other hand, I need th= e original files so I could subset them. This means that DataNode is at con= stant pressure, saving all of this in its memory. How can I solve this problem? I was also considering using Cassandra, or something like that, and to save= the file content inside of it, instead of saving it to files on HDFS. FIle= content is actually some measurement, that is, a vector of numbers, with s= ome metadata. Thanks --_000_E5CD9C0EEBC5954A95554ACE79F770BB015F671BIE1AEX3007globa_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Marko,

 <= /p>

Parquet file would be cre= ated once when you load the data. You don’t have to store your small = files in HDFS just for the reason of subseting the data by time range. You can store data and metadata in same Parquet file. As already po= inted out, parquet files work well other tools in Hadoop ecosystem. Apart f= rom performance of your map reduce jobs, other aspect is storage efficiency= . Serialization formats like Avro and Parquet provide better compression and hence data occupies less space.=

 <= /p>

Regards,

Anand  

 <= /p>

From: Alexande= r Alten-Lorenz [mailto:wget.null@gmail.com]
Sent: Friday, April 24, 2015 2:49 PM
To: user@hadoop.apache.org
Subject: Re: Large number of small files

 

Marko,

 

Cassandra is an noSQL DB like HBase for Hadoop is. P= ro and cons wouldn't be discussed here.

 

Parquet is an columnar based storage format. It is -= high level - a bit like a NoSQL DB, but on the storage level. it allows us= ers to "query" the data with MR, Pig or similar tools. Additional= ly, Parquet works perfectly with Hive and Cloudera Impala as well as Apache Dremel.

 

 


--

Alexander Alten-Lorenz m: wget.null@gmail.com
b: mapredit.blogspot.com<= /o:p>

 

On Apr 24, 2015, at 11:10 AM, Marko Dinic <marko.dinic@nissatech.com> wr= ote:

 

Anand,

Thank you for your answer, but wouldn't that mean that I would have to seri= alize the files each time I need to run the job? And I would still need to = save the original files, so the NameNode still needs to take care of them?<= br>
Please correct me if I'm missing something, I'm not very experienced with H= adoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrot= e:

Apart from databases like Cassandra, you may check s= erialization formats like Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:mark= o.dinic@nissatech.com]
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hoppi= ng for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this i= s not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files.= The problem is, files are timestamped, and I need different subset in diff= erent time, for example - one job needs to run on files that are uploaded d= uring last 3 months, while next job might consider last 6 months. Naturally, as time passes different subs= et of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I= run a job, to have smaller number of mappers. On the other hand, I need th= e original files so I could subset them. This means that DataNode is at con= stant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save= the file content inside of it, instead of saving it to files on HDFS. FIle= content is actually some measurement, that is, a vector of numbers, with s= ome metadata.

Thanks

 

--_000_E5CD9C0EEBC5954A95554ACE79F770BB015F671BIE1AEX3007globa_--