Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id F386C200CF7 for ; Tue, 5 Sep 2017 06:27:35 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id F040F164F55; Tue, 5 Sep 2017 04:27:35 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C269C164EA9 for ; Tue, 5 Sep 2017 06:27:34 +0200 (CEST) Received: (qmail 33635 invoked by uid 500); 5 Sep 2017 04:27:32 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 33624 invoked by uid 99); 5 Sep 2017 04:27:32 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Sep 2017 04:27:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id A1B8CC9533 for ; Tue, 5 Sep 2017 04:27:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.88 X-Spam-Level: * X-Spam-Status: No, score=1.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id MxehVMMrdo9t for ; Tue, 5 Sep 2017 04:27:26 +0000 (UTC) Received: from mail-wm0-f44.google.com (mail-wm0-f44.google.com [74.125.82.44]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 046385FD0C for ; Tue, 5 Sep 2017 04:27:26 +0000 (UTC) Received: by mail-wm0-f44.google.com with SMTP id 137so19534722wmj.1 for ; Mon, 04 Sep 2017 21:27:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=pho5MaS/8XvNPdmWdqZ57QZ3wsDmnszE4MbXfdBy1EI=; b=dtsNnHWDynQFZGpX8XF8RDSaWKghMqld4al3JZILwnrtIt/t/xH5jNJP70pegNjqxD ttClaFWL/HrXIu+Gn2QbZN4mNOZoGxSfD6f/ycbDq744gWj7pR4rUIxmrYRZ6gJ1ELUf QXr7uuYQ2Y6WVBr9VOATq4wFLplW3G3l+WFlE1Dvr892swgQztGDIduzc3sNOfWsk2nK dkCK/60ERYfLOE3qpgYUcp3sCCuZySKWlCI7BTZwI/G2xKCo47ORUsBWlfYb7z8CJiIu DpEz3Upwtcehb31TvYnWnF9TiaQ4hBL48T52OwImqHaaDjFz/Q6hs9IgeUu54zXqUWxY 8pJQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=pho5MaS/8XvNPdmWdqZ57QZ3wsDmnszE4MbXfdBy1EI=; b=aOHgqz1u4GUW1AQXLHh+2jqpxXmvVkNPo3vP3VANpNHUV9Ll8X2QdkgaGsyzhYCtLn GLIZoodKCugio0Sg+OqamLirV8NgyYx8LtRtkZ2DGA8kkMDLv9FIAq79Fkn9IzSK4gFx dJ2d/3jmhJfTjRhuYkiQotio5C/IxZ6Va6lR+lHPmMfW2+LI8x3ZcIGLmPNXvUe+/cAH FGDJmQpxNHyi7C2wmneW7T04avqhOm8vO535/SYUWGvKj1X6FfURk9GfJK/breNb8yoZ 8/hej/mQhyRyZGPwZjCGtoUpCP6VsQjUj52CCgEFUZjEvMwdFru6PEQeQBeeKFfDjxHi Dsww== X-Gm-Message-State: AHPjjUiDJYo+F4tr1oUtUC9bAh+fvfWOlKYvh/JAcHlS5Z+DPoCpiYR4 qgwKjjevDiOF2SplCbx4ef6j2fuh2Q== X-Google-Smtp-Source: ADKCNb65N951ghGkEyNKukXaVMpWOWnEQAYX2KxrUfvZE+u4nGofaIOkCFH4KpjtoAM+5eaX4rGLS9JoJaCGbKciED4= X-Received: by 10.80.164.71 with SMTP id v7mr2044387edb.99.1504585644174; Mon, 04 Sep 2017 21:27:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.80.159.140 with HTTP; Mon, 4 Sep 2017 21:26:53 -0700 (PDT) In-Reply-To: <8D5F7E3237B3ED47B84CF187BB17B66662B0C35E@SHSMSX103.ccr.corp.intel.com> References: <5cb2471f-0529-af9a-3312-68371a256dc9@imixs.com> <8D5F7E3237B3ED47B84CF187BB17B66662B0C35E@SHSMSX103.ccr.corp.intel.com> From: daemeon reiydelle Date: Mon, 4 Sep 2017 21:26:53 -0700 Message-ID: Subject: Re: Re: Is Hadoop basically not suitable for a photo archive? To: "Zheng, Kai" Cc: Hayati Gonultas , Alexey Eremihin , Uwe Geercken , Ralph Soika , "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary="94eb2c0ece44daf57e055869a4d1" archived-at: Tue, 05 Sep 2017 04:27:36 -0000 --94eb2c0ece44daf57e055869a4d1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Kai, this is great. It is well down the path to solving the small/object-as-file problem. Good show! *Daemeon C.M. ReiydelleSan Francisco 1.415.501.0198London 44 020 8144 9872* On Mon, Sep 4, 2017 at 8:56 PM, Zheng, Kai wrote: > A nice discussion about support of small files in Hadoop. > > > > Not sure if this really helps, but I=E2=80=99d like to mention in Intel w= e > actually has spent some time on this interesting problem domain before an= d > again recently. We planned to develop a small files compaction optimizati= on > in the Smart Storage Management project (derived from > https://issues.apache.org/jira/browse/HDFS-7343) that can support > writing-a-small-file, reading-a-small-file, reading-batch-of-small-files, > and compacting-small-files-together-in-background. These supports are > transparent to applications but users need to use an HDFS compatible > client. If you=E2=80=99re interested, please ref. the following links. We= have > rough design and plans, one important target is to support Deep Learning > use cases that want to train lots of small samples stored into HDFS as > files. We will implement it but your feedback would be very welcome. > > > > https://github.com/Intel-bigdata/SSM > > https://github.com/Intel-bigdata/SSM/blob/trunk/docs/ > small-file-solution.md > > > > Regards, > > Kai > > > > *From:* Hayati Gonultas [mailto:hayati.gonultas@gmail.com] > *Sent:* Tuesday, September 05, 2017 6:06 AM > *To:* Alexey Eremihin ; Uwe Geercken < > uwe.geercken@web.de> > *Cc:* Ralph Soika ; user@hadoop.apache.org > *Subject:* Re: Re: Is Hadoop basically not suitable for a photo archive? > > > > I would recommend an object store such as openstack swift as another > option. > > > > On Mon, Sep 4, 2017 at 1:09 PM Uwe Geercken wrote: > > just my two cents: > > > > Maybe you can use hadoop for storing and to pack multiple files to use > hdfs in a smarter way and at the same time store a limited amount of > data/photos - based on time - in parallel in a different solution. I assu= me > you won't need high performant access to the whole time span. > > > > Yes it would be a duplication, but maybe - without knowing all the detail= s > - that would be acceptable and and easy way to go for. > > > > Cheers, > > > > Uwe > > > > *Gesendet:* Montag, 04. September 2017 um 21:32 Uhr > *Von:* "Alexey Eremihin" > *An:* "Ralph Soika" > *Cc:* "user@hadoop.apache.org" > *Betreff:* Re: Is Hadoop basically not suitable for a photo archive? > > Hi Ralph, > > In general Hadoop is able to store such data. And even Har archives can b= e > used with conjunction with WebHDFS (by passing offset and limit > attributes). What are your reading requirements? FS meta data are not > distributed and reading the data is limited by the HDFS NameNode server > performance. So if you would like to download files with high RPS that > would not work well. > > On Monday, September 4, 2017, Ralph Soika wrote: > > Hi, > > I know that the issue around the small-file problem was asked frequently, > not only in this mailing list. > I also have read already some books about Haddoop and I also started to > work with Hadoop. But still I did not really understand if Hadoop is the > right choice for my goals. > > To simplify my problem domain I would like to use the use case of a photo > archive: > > - An external application produces about 10 million photos in one year. > The files contain important business critical data. > - A single photo file has a size between 1 and 10 MB. > - The photos need to be stored over several years (10-30 years). > - The data store should support replication over several servers. > - A checksum-concept is needed to guarantee the data integrity of all > files over a long period of time. > - To write and read the files a Rest API is preferred. > > So far Hadoop seems to be absolutely the perfect solution. But my last > requirement seems to throw Hadoop out of the race. > > - The photos need to be readable with very short latency from an external > enterprise application > > With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems > that most of the Hadoop experts advise against this usage if the size of = my > data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB= . > > I think I understood the concepts of HAR or sequential files. > But if I pack, for example, my files together in a large file of many > Gigabytes it is impossible to access one single photo from the Hadoop > repository in a reasonable time. It makes no sense in my eyes to pack > thousands of files into a large file just so that Hadoop jobs can handle = it > better. To simply access a single file from a web interface - as in my ca= se > - it seems to be all counterproductive. > > So my question is: Is Hadoop only feasible to archive large Web-server lo= g > files and not designed to handle big archives of small files with also > business critical data? > > > Thanks for your advice in advance. > > Ralph > > -- > > > --------------------------------------------------------------------- To > unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org For additional > commands, e-mail: user-help@hadoop.apache.org > > -- > > Hayati Gonultas > --94eb2c0ece44daf57e055869a4d1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Kai, this is great. It is well down th= e path to solving the small/object-as-file problem. Good show!

=
= <= /span>
Daemeon C.M. Reiydelle
San Francisco 1.415.501.0198
London 44 020 = 8144 9872

= <= i>

On Mon, Sep 4, 2017 at 8:56 PM, Zheng, Kai <= span dir=3D"ltr"><kai.zheng@intel.com> wrote:

A nice discussion about support of sm= all files in Hadoop.

=C2=A0

Not sure if this really helps, but I= =E2=80=99d like to mention in Intel we actually has spent some time on this= interesting problem domain before and again recently. We planned to develop a small files compaction optimization in the Smart S= torage Management project (derived from https://issues.apache.org/jira/browse/HDFS-7343) that can suppo= rt writing-a-small-file, reading-a-small-file, reading-batch-of-small-files= , and compacting-small-files-together-in-background. These supports are transparent to applications but users need to use an HDFS com= patible client. If you=E2=80=99re interested, please ref. the following lin= ks. We have rough design and plans, one important target is to support Deep= Learning use cases that want to train lots of small samples stored into HDFS as files. We will implement it but your = feedback would be very welcome.

=C2=A0

https://github.com/Intel-bigdata/SSM=

https:= //github.com/Intel-bigdata/SSM/blob/trunk/docs/small-file-solutio= n.md

=C2=A0

Regards,

Kai

<= span style=3D"font-size:11.0pt;font-family:"Calibri",sans-serif;c= olor:#1f497d">=C2=A0

From: Hayati Gonultas [mailto:hayati.gonultas@gm= ail.com]
Sent: Tuesday, September 05, 2017 6:06 AM
To: Alexey Eremihin <a.eremihin@corp.badoo.com.invalid>; Uwe Geercken <uwe.geercken@web.de&g= t;
Cc: Ralph Soika <ralph.soika@imixs.com>; user@hadoop.apache.org
Subject: Re: Re: Is Hadoop basically not suitable for a photo archiv= e?

=C2=A0

I would recommend an object store such as openstack = swift as another option.

=C2=A0

On Mon, Sep 4, 2017 at 1:09 PM Uwe Geercken <uwe.geercken@web.de> wrote:

Hi Ralph,=C2=A0

In general Hadoop is able to store such data. And ev= en Har archives can be used with conjunction with WebHDFS (by passing offse= t and limit attributes). What are your reading requirements? FS meta data are not distributed and reading the data is lim= ited by the HDFS NameNode server performance. So if you would like to downl= oad files with high RPS that would not work well.

On Monday, September 4, 2017, Ralph Soika <ralph.soika@imixs.com> wrote:

Hi,

I know that the issue around the small-file problem was asked frequently= , not only in this mailing list.
I also have read already some books about Haddoop and I also started to wor= k with Hadoop. But still I did not really understand if Hadoop is the right= choice for my goals.

To simplify my problem domain I would like to use the use case of a photo a= rchive:

- An external application produces about 10 million photos in one year. The= files contain important business critical data.
- A single photo file has a size between 1 and 10 MB.
- The photos need to be stored over several years (10-30 years).
- The data store should support replication over several servers.
- A checksum-concept is needed to guarantee the data integrity of all files= over a long period of time.
- To write and read the files a Rest API is preferred.

So far Hadoop seems to be absolutely the perfect solution. But my last requ= irement seems to throw Hadoop out of the race.

- The photos need to be readable with very short latency from an external e= nterprise application

With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems t= hat most of the Hadoop experts advise against this usage if the size of my = data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB.<= br>
I think I understood the concepts of HAR or sequential files.
But if I pack, for example, my files together in a large file of many Gigab= ytes it is impossible to access one single photo from the Hadoop repository= in a reasonable time. It makes no sense in my eyes to pack thousands of fi= les into a large file just so that Hadoop jobs can handle it better. To simply access a single file from a we= b interface - as in my case - it seems to be all counterproductive.

So my question is: Is Hadoop only feasible to archive large Web-server log = files and not designed to handle big archives of small files with also busi= ness critical data?


Thanks for your advice in advance.

Ralph

--
=C2=A0

-----------------------------------------------= ---------------------- To unsubscribe, e-mail: use= r-unsubscribe@hadoop.apache.org For additional commands, e-mail: user-help@= hadoop.apache.org

--

Hayati Gonultas


--94eb2c0ece44daf57e055869a4d1--