Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: error (nike.apache.org: encountered temporary error during SPF
 processing of domain of Ananda.Murugan@honeywell.com)
Received-SPF: Neutral (protection.outlook.com: 199.64.221.172 is neither
 permitted nor denied by domain of honeywell.com)
From: "Chandra Mohan, Ananda Vel Murugan" <Ananda.Murugan@honeywell.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: RE: Large number of small files
Thread-Topic: Large number of small files
Thread-Index: AQHQfmxRvSL9VwKMxUeDO39ITWTnOJ1b3VQw//+msQCAAAJYgIAAXXuQ
Date: Fri, 24 Apr 2015 09:33:03 +0000
Message-ID: 
 <E5CD9C0EEBC5954A95554ACE79F770BB015F671B@IE1AEX3007.global.ds.honeywell.com>
References: <553A0489.8050407@nissatech.com>
 <E5CD9C0EEBC5954A95554ACE79F770BB015F66E3@IE1AEX3007.global.ds.honeywell.com>
 <553A0890.90309@nissatech.com>
 <F1F08493-E794-43CB-84D0-FE5421DB3C41@gmail.com>
In-Reply-To: <F1F08493-E794-43CB-84D0-FE5421DB3C41@gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_E5CD9C0EEBC5954A95554ACE79F770BB015F671BIE1AEX3007globa_"
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 24 Apr 2015 09:33:52.4966
 (UTC)
X-MS-Exchange-CrossTenant-Id: 96ece526-9c7d-48b0-8daf-8b93c90a5d18
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=96ece526-9c7d-48b0-8daf-8b93c90a5d18;Ip=[199.64.221.172];Helo=[AZ18W1047.honeywell.com]
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM2PR0701MB763
X-Virus-Checked: Checked by ClamAV on apache.org

--_000_E5CD9C0EEBC5954A95554ACE79F770BB015F671BIE1AEX3007globa_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Marko,

Parquet file would be created once when you load the data. You don't have t=
o store your small files in HDFS just for the reason of subseting the data =
by time range. You can store data and metadata in same Parquet file. As alr=
eady pointed out, parquet files work well other tools in Hadoop ecosystem. =
Apart from performance of your map reduce jobs, other aspect is storage eff=
iciency. Serialization formats like Avro and Parquet provide better compres=
sion and hence data occupies less space.

Regards,
Anand

From: Alexander Alten-Lorenz [mailto:wget.null@gmail.com]
Sent: Friday, April 24, 2015 2:49 PM
To: user@hadoop.apache.org
Subject: Re: Large number of small files

Marko,

Cassandra is an noSQL DB like HBase for Hadoop is. Pro and cons wouldn't be=
 discussed here.

Parquet is an columnar based storage format. It is - high level - a bit lik=
e a NoSQL DB, but on the storage level. it allows users to "query" the data=
 with MR, Pig or similar tools. Additionally, Parquet works perfectly with =
Hive and Cloudera Impala as well as Apache Dremel.

https://parquet.incubator.apache.org/documentation/latest/
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v=
2-0-x/topics/impala_parquet.html
https://zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Dat=
a-into-Impala-as-a-Parquet-Table


--
Alexander Alten-Lorenz
m: wget.null@gmail.com<mailto:wget.null@gmail.com>
b: mapredit.blogspot.com<http://mapredit.blogspot.com>

On Apr 24, 2015, at 11:10 AM, Marko Dinic <marko.dinic@nissatech.com<mailto=
:marko.dinic@nissatech.com>> wrote:

Anand,

Thank you for your answer, but wouldn't that mean that I would have to seri=
alize the files each time I need to run the job? And I would still need to =
save the original files, so the NameNode still needs to take care of them?

Please correct me if I'm missing something, I'm not very experienced with H=
adoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrot=
e:

Apart from databases like Cassandra, you may check serialization formats li=
ke Avro or Parquet

Regards,
Anand

-----Original Message-----
From: Marko Dinic [mailto:marko.dinic@nissatech.com]
Sent: Friday, April 24, 2015 2:23 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Large number of small files

Hello,

I'm not sure if this is the place to ask this question, but I'm still hoppi=
ng for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that this i=
s not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence files.=
 The problem is, files are timestamped, and I need different subset in diff=
erent time, for example - one job needs to run on files that are uploaded d=
uring last 3 months, while next job might consider last 6 months. Naturally=
, as time passes different subset of files is needed.

This means that I would need to make a sequence file (or a HAR) each time I=
 run a job, to have smaller number of mappers. On the other hand, I need th=
e original files so I could subset them. This means that DataNode is at con=
stant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to save=
 the file content inside of it, instead of saving it to files on HDFS. FIle=
 content is actually some measurement, that is, a vector of numbers, with s=
ome metadata.

Thanks


--_000_E5CD9C0EEBC5954A95554ACE79F770BB015F671BIE1AEX3007globa_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"Generator" content=3D"Microsoft Word 12 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:Helvetica;
	panose-1:2 11 6 4 2 2 2 2 2 4;}
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0in;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
span.EmailStyle19
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-size:10.0pt;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div class=3D"WordSection1">
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Marko,
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Parquet file would be cre=
ated once when you load the data. You don&#8217;t have to store your small =
files in HDFS just for the reason of subseting the data by time
 range. You can store data and metadata in same Parquet file. As already po=
inted out, parquet files work well other tools in Hadoop ecosystem. Apart f=
rom performance of your map reduce jobs, other aspect is storage efficiency=
. Serialization formats like Avro
 and Parquet provide better compression and hence data occupies less space.=
<o:p></o:p></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Regards,<o:p></o:p></span=
></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D">Anand &nbsp;<o:p></o:p></=
span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span><=
/p>
<div>
<div style=3D"border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in =
0in 0in">
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-s=
ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Alexande=
r Alten-Lorenz [mailto:wget.null@gmail.com]
<br>
<b>Sent:</b> Friday, April 24, 2015 2:49 PM<br>
<b>To:</b> user@hadoop.apache.org<br>
<b>Subject:</b> Re: Large number of small files<o:p></o:p></span></p>
</div>
</div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<p class=3D"MsoNormal">Marko,<o:p></o:p></p>
<div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class=3D"MsoNormal">Cassandra is an noSQL DB like HBase for Hadoop is. P=
ro and cons wouldn't be discussed here.<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class=3D"MsoNormal">Parquet is an columnar based storage format. It is -=
 high level - a bit like a NoSQL DB, but on the storage level. it allows us=
ers to &quot;query&quot; the data with MR, Pig or similar tools. Additional=
ly, Parquet works perfectly with Hive and Cloudera
 Impala as well as Apache Dremel.<o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
<div>
<p class=3D"MsoNormal"><a href=3D"https://parquet.incubator.apache.org/docu=
mentation/latest/">https://parquet.incubator.apache.org/documentation/lates=
t/</a><o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal"><a href=3D"http://www.cloudera.com/content/cloudera/=
en/documentation/cloudera-impala/v2-0-x/topics/impala_parquet.html">http://=
www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-0-x/t=
opics/impala_parquet.html</a><o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal"><a href=3D"https://zoomdata.zendesk.com/hc/en-us/art=
icles/200865073-Loading-My-CSV-Data-into-Impala-as-a-Parquet-Table">https:/=
/zoomdata.zendesk.com/hc/en-us/articles/200865073-Loading-My-CSV-Data-into-=
Impala-as-a-Parquet-Table</a><o:p></o:p></p>
</div>
<div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:10.5pt;font-family:&quot;He=
lvetica&quot;,&quot;sans-serif&quot;;color:black"><br>
--<o:p></o:p></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:10.5pt;font-family:&quot;He=
lvetica&quot;,&quot;sans-serif&quot;;color:black">Alexander Alten-Lorenz<br=
>
m: <a href=3D"mailto:wget.null@gmail.com">wget.null@gmail.com</a><br>
b: <a href=3D"http://mapredit.blogspot.com">mapredit.blogspot.com</a><o:p><=
/o:p></span></p>
</div>
</div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<div>
<blockquote style=3D"margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class=3D"MsoNormal">On Apr 24, 2015, at 11:10 AM, Marko Dinic &lt;<a hre=
f=3D"mailto:marko.dinic@nissatech.com">marko.dinic@nissatech.com</a>&gt; wr=
ote:<o:p></o:p></p>
</div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
<div>
<p class=3D"MsoNormal">Anand,<br>
<br>
Thank you for your answer, but wouldn't that mean that I would have to seri=
alize the files each time I need to run the job? And I would still need to =
save the original files, so the NameNode still needs to take care of them?<=
br>
<br>
Please correct me if I'm missing something, I'm not very experienced with H=
adoop.<br>
<br>
What do you think about using Cassandra?<br>
<br>
Thanks<br>
<br>
On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan wrot=
e:<br>
<br>
<o:p></o:p></p>
<p class=3D"MsoNormal">Apart from databases like Cassandra, you may check s=
erialization formats like Avro or Parquet<br>
<br>
Regards,<br>
Anand<br>
<br>
-----Original Message-----<br>
From: Marko Dinic [<a href=3D"mailto:marko.dinic@nissatech.com">mailto:mark=
o.dinic@nissatech.com</a>]<br>
Sent: Friday, April 24, 2015 2:23 PM<br>
To: <a href=3D"mailto:user@hadoop.apache.org">user@hadoop.apache.org</a><br=
>
Subject: Large number of small files<br>
<br>
Hello,<br>
<br>
I'm not sure if this is the place to ask this question, but I'm still hoppi=
ng for an answer/advice.<br>
<br>
Large number of small files are uploaded, about 8KB. I am aware that this i=
s not something that you're hopping for when working with Hadoop.<br>
<br>
I was thinking about using HAR files and combined input, or sequence files.=
 The problem is, files are timestamped, and I need different subset in diff=
erent time, for example - one job needs to run on files that are uploaded d=
uring last 3 months, while next
 job might consider last 6 months. Naturally, as time passes different subs=
et of files is needed.<br>
<br>
This means that I would need to make a sequence file (or a HAR) each time I=
 run a job, to have smaller number of mappers. On the other hand, I need th=
e original files so I could subset them. This means that DataNode is at con=
stant pressure, saving all of this
 in its memory.<br>
<br>
How can I solve this problem?<br>
<br>
I was also considering using Cassandra, or something like that, and to save=
 the file content inside of it, instead of saving it to files on HDFS. FIle=
 content is actually some measurement, that is, a vector of numbers, with s=
ome metadata.<br>
<br>
Thanks<o:p></o:p></p>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><o:p>&nbsp;</o:p></p>
</div>
</div>
</body>
</html>

--_000_E5CD9C0EEBC5954A95554ACE79F770BB015F671BIE1AEX3007globa_--