hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From daemeon reiydelle <daeme...@gmail.com>
Subject Re: Re: Is Hadoop basically not suitable for a photo archive?
Date Tue, 05 Sep 2017 04:26:53 GMT
Kai, this is great. It is well down the path to solving the
small/object-as-file problem. Good show!




*Daemeon C.M. ReiydelleSan Francisco 1.415.501.0198London 44 020 8144 9872*


On Mon, Sep 4, 2017 at 8:56 PM, Zheng, Kai <kai.zheng@intel.com> wrote:

> A nice discussion about support of small files in Hadoop.
>
>
>
> Not sure if this really helps, but I’d like to mention in Intel we
> actually has spent some time on this interesting problem domain before and
> again recently. We planned to develop a small files compaction optimization
> in the Smart Storage Management project (derived from
> https://issues.apache.org/jira/browse/HDFS-7343) that can support
> writing-a-small-file, reading-a-small-file, reading-batch-of-small-files,
> and compacting-small-files-together-in-background. These supports are
> transparent to applications but users need to use an HDFS compatible
> client. If you’re interested, please ref. the following links. We have
> rough design and plans, one important target is to support Deep Learning
> use cases that want to train lots of small samples stored into HDFS as
> files. We will implement it but your feedback would be very welcome.
>
>
>
> https://github.com/Intel-bigdata/SSM
>
> https://github.com/Intel-bigdata/SSM/blob/trunk/docs/
> small-file-solution.md
>
>
>
> Regards,
>
> Kai
>
>
>
> *From:* Hayati Gonultas [mailto:hayati.gonultas@gmail.com]
> *Sent:* Tuesday, September 05, 2017 6:06 AM
> *To:* Alexey Eremihin <a.eremihin@corp.badoo.com.invalid>; Uwe Geercken <
> uwe.geercken@web.de>
> *Cc:* Ralph Soika <ralph.soika@imixs.com>; user@hadoop.apache.org
> *Subject:* Re: Re: Is Hadoop basically not suitable for a photo archive?
>
>
>
> I would recommend an object store such as openstack swift as another
> option.
>
>
>
> On Mon, Sep 4, 2017 at 1:09 PM Uwe Geercken <uwe.geercken@web.de> wrote:
>
> just my two cents:
>
>
>
> Maybe you can use hadoop for storing and to pack multiple files to use
> hdfs in a smarter way and at the same time store a limited amount of
> data/photos - based on time - in parallel in a different solution. I assume
> you won't need high performant access to the whole time span.
>
>
>
> Yes it would be a duplication, but maybe - without knowing all the details
> - that would be acceptable and and easy way to go for.
>
>
>
> Cheers,
>
>
>
> Uwe
>
>
>
> *Gesendet:* Montag, 04. September 2017 um 21:32 Uhr
> *Von:* "Alexey Eremihin" <a.eremihin@corp.badoo.com.INVALID>
> *An:* "Ralph Soika" <ralph.soika@imixs.com>
> *Cc:* "user@hadoop.apache.org" <user@hadoop.apache.org>
> *Betreff:* Re: Is Hadoop basically not suitable for a photo archive?
>
> Hi Ralph,
>
> In general Hadoop is able to store such data. And even Har archives can be
> used with conjunction with WebHDFS (by passing offset and limit
> attributes). What are your reading requirements? FS meta data are not
> distributed and reading the data is limited by the HDFS NameNode server
> performance. So if you would like to download files with high RPS that
> would not work well.
>
> On Monday, September 4, 2017, Ralph Soika <ralph.soika@imixs.com> wrote:
>
> Hi,
>
> I know that the issue around the small-file problem was asked frequently,
> not only in this mailing list.
> I also have read already some books about Haddoop and I also started to
> work with Hadoop. But still I did not really understand if Hadoop is the
> right choice for my goals.
>
> To simplify my problem domain I would like to use the use case of a photo
> archive:
>
> - An external application produces about 10 million photos in one year.
> The files contain important business critical data.
> - A single photo file has a size between 1 and 10 MB.
> - The photos need to be stored over several years (10-30 years).
> - The data store should support replication over several servers.
> - A checksum-concept is needed to guarantee the data integrity of all
> files over a long period of time.
> - To write and read the files a Rest API is preferred.
>
> So far Hadoop seems to be absolutely the perfect solution. But my last
> requirement seems to throw Hadoop out of the race.
>
> - The photos need to be readable with very short latency from an external
> enterprise application
>
> With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems
> that most of the Hadoop experts advise against this usage if the size of my
> data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB.
>
> I think I understood the concepts of HAR or sequential files.
> But if I pack, for example, my files together in a large file of many
> Gigabytes it is impossible to access one single photo from the Hadoop
> repository in a reasonable time. It makes no sense in my eyes to pack
> thousands of files into a large file just so that Hadoop jobs can handle it
> better. To simply access a single file from a web interface - as in my case
> - it seems to be all counterproductive.
>
> So my question is: Is Hadoop only feasible to archive large Web-server log
> files and not designed to handle big archives of small files with also
> business critical data?
>
>
> Thanks for your advice in advance.
>
> Ralph
>
> --
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org For additional
> commands, e-mail: user-help@hadoop.apache.org
>
> --
>
> Hayati Gonultas
>

Mime
View raw message