hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Muhamad Dimas Adiputro <MA0C69...@TechMahindra.com>
Subject RE: Is Hadoop basically not suitable for a photo archive?
Date Wed, 06 Sep 2017 07:41:11 GMT
I think mapR-fs is your solution.

From: Anu Engineer [mailto:aengineer@hortonworks.com]
Sent: Tuesday, September 05, 2017 10:33 PM
To: Hayati Gonultas; Alexey Eremihin; Uwe Geercken
Cc: Ralph Soika; user@hadoop.apache.org
Subject: Re: Is Hadoop basically not suitable for a photo archive?

Please take a look at HDFS-7240; we are developing an Object store that uses HDFS  to store
the small files.  HDFS-7240 or Ozone is designed for the small file use case.
Caveat Emptor: This is a work in progress, but most of the core functionality is complete.
However, we are not ready for production use yet.


From: Hayati Gonultas <hayati.gonultas@gmail.com<mailto:hayati.gonultas@gmail.com>>
Date: Monday, September 4, 2017 at 3:05 PM
To: Alexey Eremihin <a.eremihin@corp.badoo.com.INVALID<mailto:a.eremihin@corp.badoo.com.INVALID>>,
Uwe Geercken <uwe.geercken@web.de<mailto:uwe.geercken@web.de>>
Cc: Ralph Soika <ralph.soika@imixs.com<mailto:ralph.soika@imixs.com>>, "user@hadoop.apache.org<mailto:user@hadoop.apache.org>"
Subject: Re: Re: Is Hadoop basically not suitable for a photo archive?

I would recommend an object store such as openstack swift as another option.

On Mon, Sep 4, 2017 at 1:09 PM Uwe Geercken <uwe.geercken@web.de<mailto:uwe.geercken@web.de>>
just my two cents:

Maybe you can use hadoop for storing and to pack multiple files to use hdfs in a smarter way
and at the same time store a limited amount of data/photos - based on time - in parallel in
a different solution. I assume you won't need high performant access to the whole time span.

Yes it would be a duplication, but maybe - without knowing all the details - that would be
acceptable and and easy way to go for.



Gesendet: Montag, 04. September 2017 um 21:32 Uhr
Von: "Alexey Eremihin" <a.eremihin@corp.badoo.com.INVALID<mailto:a.eremihin@corp.badoo.com.INVALID>>
An: "Ralph Soika" <ralph.soika@imixs.com<mailto:ralph.soika@imixs.com>>
Cc: "user@hadoop.apache.org<mailto:user@hadoop.apache.org>" <user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Betreff: Re: Is Hadoop basically not suitable for a photo archive?
Hi Ralph,
In general Hadoop is able to store such data. And even Har archives can be used with conjunction
with WebHDFS (by passing offset and limit attributes). What are your reading requirements?
FS meta data are not distributed and reading the data is limited by the HDFS NameNode server
performance. So if you would like to download files with high RPS that would not work well.

On Monday, September 4, 2017, Ralph Soika <ralph.soika@imixs.com<mailto:ralph.soika@imixs.com>>


I know that the issue around the small-file problem was asked frequently, not only in this
mailing list.
I also have read already some books about Haddoop and I also started to work with Hadoop.
But still I did not really understand if Hadoop is the right choice for my goals.

To simplify my problem domain I would like to use the use case of a photo archive:

- An external application produces about 10 million photos in one year. The files contain
important business critical data.
- A single photo file has a size between 1 and 10 MB.
- The photos need to be stored over several years (10-30 years).
- The data store should support replication over several servers.
- A checksum-concept is needed to guarantee the data integrity of all files over a long period
of time.
- To write and read the files a Rest API is preferred.

So far Hadoop seems to be absolutely the perfect solution. But my last requirement seems to
throw Hadoop out of the race.

- The photos need to be readable with very short latency from an external enterprise application

With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems that most of the
Hadoop experts advise against this usage if the size of my data files (1-10 MB) are well below
the Hadoop block size of 64 or 128 MB.

I think I understood the concepts of HAR or sequential files.
But if I pack, for example, my files together in a large file of many Gigabytes it is impossible
to access one single photo from the Hadoop repository in a reasonable time. It makes no sense
in my eyes to pack thousands of files into a large file just so that Hadoop jobs can handle
it better. To simply access a single file from a web interface - as in my case - it seems
to be all counterproductive.

So my question is: Is Hadoop only feasible to archive large Web-server log files and not designed
to handle big archives of small files with also business critical data?

Thanks for your advice in advance.


--------------------------------------------------------------------- To unsubscribe, e-mail:
user-unsubscribe@hadoop.apache.org<mailto:user-unsubscribe@hadoop.apache.org> For additional
commands, e-mail: user-help@hadoop.apache.org<mailto:user-help@hadoop.apache.org>
Hayati Gonultas

Disclaimer:  This message and the information contained herein is proprietary and confidential
and subject to the Tech Mahindra policy statement, you may review the policy at http://www.techmahindra.com/Disclaimer.html
<http://www.techmahindra.com/Disclaimer.html> externally http://tim.techmahindra.com/tim/disclaimer.html
<http://tim.techmahindra.com/tim/disclaimer.html> internally within TechMahindra.

View raw message