Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <8D5F7E3237B3ED47B84CF187BB17B66662B0C35E@SHSMSX103.ccr.corp.intel.com>
References: <5cb2471f-0529-af9a-3312-68371a256dc9@imixs.com>
 <CAGP94Wnr-CnZKhPjcMpk8THkUh8KoA=xoXta2BXLq9_JTqhmcQ@mail.gmail.com>
 <trinity-a2f0d279-4669-4525-951b-41bf989a320a-1504555728822@3c-app-webde-bs29>
 <CAF8JZFryuWLUF5eg2SVnOwjwxRKbiAs5+TFSf3hVUC3pfCnNrA@mail.gmail.com> <8D5F7E3237B3ED47B84CF187BB17B66662B0C35E@SHSMSX103.ccr.corp.intel.com>
From: daemeon reiydelle <daemeonr@gmail.com>
Date: Mon, 4 Sep 2017 21:26:53 -0700
Message-ID: <CAOUOv0HMJnf-_4RyUNMR5_N8O=igfetYmVBGGCwL2N8z=m0rwg@mail.gmail.com>
Subject: Re: Re: Is Hadoop basically not suitable for a photo archive?
To: "Zheng, Kai" <kai.zheng@intel.com>
Cc: Hayati Gonultas <hayati.gonultas@gmail.com>,
	Alexey Eremihin <a.eremihin@corp.badoo.com.invalid>, Uwe Geercken <uwe.geercken@web.de>,
	Ralph Soika <ralph.soika@imixs.com>, "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary="94eb2c0ece44daf57e055869a4d1"
archived-at: Tue, 05 Sep 2017 04:27:36 -0000

--94eb2c0ece44daf57e055869a4d1
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Kai, this is great. It is well down the path to solving the
small/object-as-file problem. Good show!


*Daemeon C.M. ReiydelleSan Francisco 1.415.501.0198London 44 020 8144 9872*


On Mon, Sep 4, 2017 at 8:56 PM, Zheng, Kai <kai.zheng@intel.com> wrote:

> A nice discussion about support of small files in Hadoop.
>
>
>
> Not sure if this really helps, but I=E2=80=99d like to mention in Intel w=
e
> actually has spent some time on this interesting problem domain before an=
d
> again recently. We planned to develop a small files compaction optimizati=
on
> in the Smart Storage Management project (derived from
> https://issues.apache.org/jira/browse/HDFS-7343) that can support
> writing-a-small-file, reading-a-small-file, reading-batch-of-small-files,
> and compacting-small-files-together-in-background. These supports are
> transparent to applications but users need to use an HDFS compatible
> client. If you=E2=80=99re interested, please ref. the following links. We=
 have
> rough design and plans, one important target is to support Deep Learning
> use cases that want to train lots of small samples stored into HDFS as
> files. We will implement it but your feedback would be very welcome.
>
>
>
> https://github.com/Intel-bigdata/SSM
>
> https://github.com/Intel-bigdata/SSM/blob/trunk/docs/
> small-file-solution.md
>
>
>
> Regards,
>
> Kai
>
>
>
> *From:* Hayati Gonultas [mailto:hayati.gonultas@gmail.com]
> *Sent:* Tuesday, September 05, 2017 6:06 AM
> *To:* Alexey Eremihin <a.eremihin@corp.badoo.com.invalid>; Uwe Geercken <
> uwe.geercken@web.de>
> *Cc:* Ralph Soika <ralph.soika@imixs.com>; user@hadoop.apache.org
> *Subject:* Re: Re: Is Hadoop basically not suitable for a photo archive?
>
>
>
> I would recommend an object store such as openstack swift as another
> option.
>
>
>
> On Mon, Sep 4, 2017 at 1:09 PM Uwe Geercken <uwe.geercken@web.de> wrote:
>
> just my two cents:
>
>
>
> Maybe you can use hadoop for storing and to pack multiple files to use
> hdfs in a smarter way and at the same time store a limited amount of
> data/photos - based on time - in parallel in a different solution. I assu=
me
> you won't need high performant access to the whole time span.
>
>
>
> Yes it would be a duplication, but maybe - without knowing all the detail=
s
> - that would be acceptable and and easy way to go for.
>
>
>
> Cheers,
>
>
>
> Uwe
>
>
>
> *Gesendet:* Montag, 04. September 2017 um 21:32 Uhr
> *Von:* "Alexey Eremihin" <a.eremihin@corp.badoo.com.INVALID>
> *An:* "Ralph Soika" <ralph.soika@imixs.com>
> *Cc:* "user@hadoop.apache.org" <user@hadoop.apache.org>
> *Betreff:* Re: Is Hadoop basically not suitable for a photo archive?
>
> Hi Ralph,
>
> In general Hadoop is able to store such data. And even Har archives can b=
e
> used with conjunction with WebHDFS (by passing offset and limit
> attributes). What are your reading requirements? FS meta data are not
> distributed and reading the data is limited by the HDFS NameNode server
> performance. So if you would like to download files with high RPS that
> would not work well.
>
> On Monday, September 4, 2017, Ralph Soika <ralph.soika@imixs.com> wrote:
>
> Hi,
>
> I know that the issue around the small-file problem was asked frequently,
> not only in this mailing list.
> I also have read already some books about Haddoop and I also started to
> work with Hadoop. But still I did not really understand if Hadoop is the
> right choice for my goals.
>
> To simplify my problem domain I would like to use the use case of a photo
> archive:
>
> - An external application produces about 10 million photos in one year.
> The files contain important business critical data.
> - A single photo file has a size between 1 and 10 MB.
> - The photos need to be stored over several years (10-30 years).
> - The data store should support replication over several servers.
> - A checksum-concept is needed to guarantee the data integrity of all
> files over a long period of time.
> - To write and read the files a Rest API is preferred.
>
> So far Hadoop seems to be absolutely the perfect solution. But my last
> requirement seems to throw Hadoop out of the race.
>
> - The photos need to be readable with very short latency from an external
> enterprise application
>
> With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems
> that most of the Hadoop experts advise against this usage if the size of =
my
> data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB=
.
>
> I think I understood the concepts of HAR or sequential files.
> But if I pack, for example, my files together in a large file of many
> Gigabytes it is impossible to access one single photo from the Hadoop
> repository in a reasonable time. It makes no sense in my eyes to pack
> thousands of files into a large file just so that Hadoop jobs can handle =
it
> better. To simply access a single file from a web interface - as in my ca=
se
> - it seems to be all counterproductive.
>
> So my question is: Is Hadoop only feasible to archive large Web-server lo=
g
> files and not designed to handle big archives of small files with also
> business critical data?
>
>
> Thanks for your advice in advance.
>
> Ralph
>
> --
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org For additional
> commands, e-mail: user-help@hadoop.apache.org
>
> --
>
> Hayati Gonultas
>

--94eb2c0ece44daf57e055869a4d1
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:comic sa=
ns ms,sans-serif;color:rgb(7,55,99)">Kai, this is great. It is well down th=
e path to solving the small/object-as-file problem. Good show!<br></div></d=
iv><div class=3D"gmail_extra"><br clear=3D"all"><div><div class=3D"gmail_si=
gnature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=
=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr">=
<div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div=
 dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><span style=3D"col=
or:rgb(56,118,29)"><span style=3D"background-color:rgb(255,255,255)"><b><sp=
an style=3D"font-family:comic sans ms,sans-serif"></span></b></span></span>=
<span style=3D"color:rgb(56,118,29)"><span style=3D"background-color:rgb(25=
5,255,255)"></span></span><span><span style=3D"letter-spacing:2px"></span><=
/span><br><span style=3D"color:rgb(56,118,29)"><span style=3D"background-co=
lor:rgb(255,255,255)"><b><span style=3D"font-family:comic sans ms,sans-seri=
f">Daemeon C.M. Reiydelle<br>San Francisco 1.415.501.0198<br>London 44 020 =
8144 9872</span></b></span></span><font size=3D"1"><i><br></i></font><span>=
<span style=3D"color:rgb(56,118,29)"><span style=3D"background-color:rgb(25=
5,255,255)"><b><span style=3D"font-family:comic sans ms,sans-serif"><span><=
i><span><span style=3D"color:rgb(56,118,29)"><span style=3D"background-colo=
r:rgb(255,255,255)"><b><span style=3D"font-family:comic sans ms,sans-serif"=
><br></span></b></span></span></span></i></span></span></b></span></span></=
span></div></div></div></div></div></div></div></div></div></div></div></di=
v></div></div></div></div></div></div></div></div></div></div></div>
<br><div class=3D"gmail_quote">On Mon, Sep 4, 2017 at 8:56 PM, Zheng, Kai <=
span dir=3D"ltr">&lt;<a href=3D"mailto:kai.zheng@intel.com" target=3D"_blan=
k">kai.zheng@intel.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex">


<div link=3D"blue" vlink=3D"purple" lang=3D"EN-US">
<div class=3D"m_-3413269514654331539WordSection1">
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">A nice discussion about support of sm=
all files in Hadoop.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">Not sure if this really helps, but I=
=E2=80=99d like to mention in Intel we actually has spent some time on this=
 interesting problem domain before and again recently.
 We planned to develop a small files compaction optimization in the Smart S=
torage Management project (derived from
<a href=3D"https://issues.apache.org/jira/browse/HDFS-7343" target=3D"_blan=
k">https://issues.apache.org/<wbr>jira/browse/HDFS-7343</a>) that can suppo=
rt writing-a-small-file, reading-a-small-file, reading-batch-of-small-files=
, and compacting-small-files-<wbr>together-in-background. These
 supports are transparent to applications but users need to use an HDFS com=
patible client. If you=E2=80=99re interested, please ref. the following lin=
ks. We have rough design and plans, one important target is to support Deep=
 Learning use cases that want to train lots
 of small samples stored into HDFS as files. We will implement it but your =
feedback would be very welcome.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><a href=3D"https://github.com/Intel-b=
igdata/SSM" target=3D"_blank">https://github.com/Intel-<wbr>bigdata/SSM</a>=
<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><a href=3D"https://github.com/Intel-b=
igdata/SSM/blob/trunk/docs/small-file-solution.md" target=3D"_blank">https:=
//github.com/Intel-<wbr>bigdata/SSM/blob/trunk/docs/<wbr>small-file-solutio=
n.md</a><u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">Regards,<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">Kai<u></u><u></u></span></p>
<p class=3D"MsoNormal"><a name=3D"m_-3413269514654331539__MailEndCompose"><=
span style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif;c=
olor:#1f497d"><u></u>=C2=A0<u></u></span></a></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,sans-serif">From:</span></b><span style=3D"font-size:11.0pt;=
font-family:&quot;Calibri&quot;,sans-serif"> Hayati Gonultas [mailto:<a hre=
f=3D"mailto:hayati.gonultas@gmail.com" target=3D"_blank">hayati.gonultas@gm=
ail.<wbr>com</a>]
<br>
<b>Sent:</b> Tuesday, September 05, 2017 6:06 AM<br>
<b>To:</b> Alexey Eremihin &lt;<a href=3D"mailto:a.eremihin@corp.badoo.com"=
>a.eremihin@corp.badoo.com</a>.<wbr>invalid&gt;; Uwe Geercken &lt;<a href=
=3D"mailto:uwe.geercken@web.de" target=3D"_blank">uwe.geercken@web.de</a>&g=
t;<br>
<b>Cc:</b> Ralph Soika &lt;<a href=3D"mailto:ralph.soika@imixs.com" target=
=3D"_blank">ralph.soika@imixs.com</a>&gt;; <a href=3D"mailto:user@hadoop.ap=
ache.org" target=3D"_blank">user@hadoop.apache.org</a><br>
<b>Subject:</b> Re: Re: Is Hadoop basically not suitable for a photo archiv=
e?<u></u><u></u></span></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<div>
<p class=3D"MsoNormal">I would recommend an object store such as openstack =
swift as another option.<u></u><u></u></p>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<div>
<p class=3D"MsoNormal">On Mon, Sep 4, 2017 at 1:09 PM Uwe Geercken &lt;<a h=
ref=3D"mailto:uwe.geercken@web.de" target=3D"_blank">uwe.geercken@web.de</a=
>&gt; wrote:<u></u><u></u></p>
</div>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0i=
n 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">just my two cents:<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">=C2=A0<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">Maybe you can use hadoop for storing and to pack mul=
tiple files to use hdfs in a smarter way and at the same time store a limit=
ed amount of data/photos - based on time - in
 parallel in a different solution. I assume you won&#39;t need high perform=
ant access to the whole time span.<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">=C2=A0<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">Yes it would be a duplication, but maybe - without k=
nowing all the details - that would be acceptable and and easy way to go fo=
r.<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">=C2=A0<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">Cheers,<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">=C2=A0<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">Uwe<u></u><u></u></span></p>
</div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">=C2=A0
<u></u><u></u></span></p>
<div style=3D"border:none;border-left:solid #c3d9e5 1.5pt;padding:0in 0in 0=
in 8.0pt;margin-left:7.5pt;margin-top:7.5pt;margin-right:3.75pt;margin-bott=
om:3.75pt;word-wrap:break-word" name=3D"quote">
<div style=3D"margin-bottom:7.5pt">
<p class=3D"MsoNormal"><b><span style=3D"font-size:9.0pt;font-family:&quot;=
Verdana&quot;,sans-serif">Gesendet:</span></b><span style=3D"font-size:9.0p=
t;font-family:&quot;Verdana&quot;,sans-serif">=C2=A0Montag, 04. September 2=
017 um 21:32 Uhr<br>
<b>Von:</b>=C2=A0&quot;Alexey Eremihin&quot; &lt;<a href=3D"mailto:a.eremih=
in@corp.badoo.com.INVALID" target=3D"_blank">a.eremihin@corp.badoo.com.<wbr=
>INVALID</a>&gt;<br>
<b>An:</b>=C2=A0&quot;Ralph Soika&quot; &lt;<a href=3D"mailto:ralph.soika@i=
mixs.com" target=3D"_blank">ralph.soika@imixs.com</a>&gt;<br>
<b>Cc:</b>=C2=A0&quot;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_=
blank">user@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:user@hadoop.a=
pache.org" target=3D"_blank">user@hadoop.apache.org</a>&gt;<br>
<b>Betreff:</b>=C2=A0Re: Is Hadoop basically not suitable for a photo archi=
ve?<u></u><u></u></span></p>
</div>
</div>
</div>
</div>
</div>
</div>
<div>
<div>
<div>
<div>
<div style=3D"border:none;border-left:solid #c3d9e5 1.5pt;padding:0in 0in 0=
in 8.0pt;margin-left:7.5pt;margin-top:7.5pt;margin-right:3.75pt;margin-bott=
om:3.75pt;word-wrap:break-word" name=3D"quote">
<div name=3D"quoted-content">
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">Hi Ralph,=C2=A0
<u></u><u></u></span></p>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">In general Hadoop is able to store such data. And ev=
en Har archives can be used with conjunction with WebHDFS (by passing offse=
t and limit attributes). What are your reading
 requirements? FS meta data are not distributed and reading the data is lim=
ited by the HDFS NameNode server performance. So if you would like to downl=
oad files with high RPS that would not work well.<br>
<br>
On Monday, September 4, 2017, Ralph Soika &lt;<a href=3D"mailto:ralph.soika=
@imixs.com" target=3D"_blank">ralph.soika@imixs.com</a>&gt; wrote:
<u></u><u></u></span></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0i=
n 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<p><span style=3D"font-size:9.0pt;font-family:&quot;Verdana&quot;,sans-seri=
f">Hi,<u></u><u></u></span></p>
<p><span style=3D"font-size:9.0pt;font-family:&quot;Verdana&quot;,sans-seri=
f">I know that the issue around the small-file problem was asked frequently=
, not only in this mailing list.<br>
I also have read already some books about Haddoop and I also started to wor=
k with Hadoop. But still I did not really understand if Hadoop is the right=
 choice for my goals.<br>
<br>
To simplify my problem domain I would like to use the use case of a photo a=
rchive:<br>
<br>
- An external application produces about 10 million photos in one year. The=
 files contain important business critical data.<br>
- A single photo file has a size between 1 and 10 MB.<br>
- The photos need to be stored over several years (10-30 years).<br>
- The data store should support replication over several servers.<br>
- A checksum-concept is needed to guarantee the data integrity of all files=
 over a long period of time.<br>
- To write and read the files a Rest API is preferred.<br>
<br>
So far Hadoop seems to be absolutely the perfect solution. But my last requ=
irement seems to throw Hadoop out of the race.<br>
<br>
- The photos need to be readable with very short latency from an external e=
nterprise application<br>
<br>
With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems t=
hat most of the Hadoop experts advise against this usage if the size of my =
data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB.<=
br>
<br>
I think I understood the concepts of HAR or sequential files.<br>
But if I pack, for example, my files together in a large file of many Gigab=
ytes it is impossible to access one single photo from the Hadoop repository=
 in a reasonable time. It makes no sense in my eyes to pack thousands of fi=
les into a large file just so that
 Hadoop jobs can handle it better. To simply access a single file from a we=
b interface - as in my case - it seems to be all counterproductive.<br>
<br>
So my question is: Is Hadoop only feasible to archive large Web-server log =
files and not designed to handle big archives of small files with also busi=
ness critical data?<br>
<br>
<br>
Thanks for your advice in advance.<br>
<br>
Ralph<u></u><u></u></span></p>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:9.0pt;font-family:&quot;Ver=
dana&quot;,sans-serif">--<br>
=C2=A0<u></u><u></u></span></p>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p class=3D"MsoNormal">------------------------------<wbr>-----------------=
-------------<wbr>--------- To unsubscribe, e-mail:
<a href=3D"mailto:user-unsubscribe@hadoop.apache.org" target=3D"_blank">use=
r-unsubscribe@hadoop.<wbr>apache.org</a> For additional commands, e-mail:
<a href=3D"mailto:user-help@hadoop.apache.org" target=3D"_blank">user-help@=
hadoop.apache.org</a>
<u></u><u></u></p>
</blockquote>
</div>
</div>
<div>
<p class=3D"MsoNormal">-- <u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">Hayati Gonultas<u></u><u></u></p>
</div>
</div>
</div>

</blockquote></div><br></div>

--94eb2c0ece44daf57e055869a4d1--