Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E0A2C200CEF for ; Mon, 4 Sep 2017 21:32:38 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id DEE7C163A0D; Mon, 4 Sep 2017 19:32:38 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 098FA163A03 for ; Mon, 4 Sep 2017 21:32:37 +0200 (CEST) Received: (qmail 98103 invoked by uid 500); 4 Sep 2017 19:32:35 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 98092 invoked by uid 99); 4 Sep 2017 19:32:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Sep 2017 19:32:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 9B80F181801 for ; Mon, 4 Sep 2017 19:32:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.4 X-Spam-Level: X-Spam-Status: No, score=-0.4 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=corp.badoo.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id KSB-A05fKQyy for ; Mon, 4 Sep 2017 19:32:29 +0000 (UTC) Received: from mail-wr0-f182.google.com (mail-wr0-f182.google.com [209.85.128.182]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 85C7F61031 for ; Mon, 4 Sep 2017 19:32:28 +0000 (UTC) Received: by mail-wr0-f182.google.com with SMTP id y15so3925028wrc.2 for ; Mon, 04 Sep 2017 12:32:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=Qav6Rx4zjy5BjQRHKCpHBUUyA5Q2cTj/4CPPx3NmUFo=; b=TqboSPIXfV5Rg7JcOCyoJWVcqZFYwdJk+k+SSV5XB8SedyEPbFPxBlcb8APsWSYYeW MaM1HBK89LSBvTwbhealBirzLJLN3Uxil+v6sdNFMqAdSTxVRWPJDKnoVIbNk8PiWmkm 9Mw0pWEzvpBfkMfloqj9iVxHijBWQ1WwmwnlSrEFVMKIF3wM3+/l8C1WgkKi5rSXcf1W 5C82F16v5w7LsptybQNBL6ZYD5jlJ+DGJJWRdcbTaBrB+CgrLW8bjML0l12+m02DVf3C OY/xIt/+CbWXsnrAch53mJNgUTHru8rPuHjATrVuVwlLuAP6ebik49ki9w0287aArHGQ Tclw== X-Gm-Message-State: AHPjjUieru5C+YJfTZXddTjerHzXRjcJkc657oDdVnQRWWZsMM/h/Xag sH/VgkbBiloudUoQXxxhN4+xExxO4T8JKbk= X-Google-Smtp-Source: ADKCNb7dG5hl5WNvr+FikwAH8Rl2VAmQ6XtTPcTfLDQE+IXqXcSufilzvN9IgnD3sqSMXjg5ShXnfXrSZxI4GpO6pxs= X-Received: by 10.223.195.41 with SMTP id n38mr1042633wrf.75.1504553547898; Mon, 04 Sep 2017 12:32:27 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.5.145 with HTTP; Mon, 4 Sep 2017 12:32:27 -0700 (PDT) In-Reply-To: <5cb2471f-0529-af9a-3312-68371a256dc9@imixs.com> References: <5cb2471f-0529-af9a-3312-68371a256dc9@imixs.com> From: Alexey Eremihin Date: Mon, 4 Sep 2017 22:32:27 +0300 Message-ID: Subject: Re: Is Hadoop basically not suitable for a photo archive? To: Ralph Soika Cc: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary="f403045fa370c4b7810558622b47" archived-at: Mon, 04 Sep 2017 19:32:39 -0000 --f403045fa370c4b7810558622b47 Content-Type: text/plain; charset="UTF-8" Hi Ralph, In general Hadoop is able to store such data. And even Har archives can be used with conjunction with WebHDFS (by passing offset and limit attributes). What are your reading requirements? FS meta data are not distributed and reading the data is limited by the HDFS NameNode server performance. So if you would like to download files with high RPS that would not work well. On Monday, September 4, 2017, Ralph Soika wrote: > Hi, > > I know that the issue around the small-file problem was asked frequently, > not only in this mailing list. > I also have read already some books about Haddoop and I also started to > work with Hadoop. But still I did not really understand if Hadoop is the > right choice for my goals. > > To simplify my problem domain I would like to use the use case of a photo > archive: > > - An external application produces about 10 million photos in one year. > The files contain important business critical data. > - A single photo file has a size between 1 and 10 MB. > - The photos need to be stored over several years (10-30 years). > - The data store should support replication over several servers. > - A checksum-concept is needed to guarantee the data integrity of all > files over a long period of time. > - To write and read the files a Rest API is preferred. > > So far Hadoop seems to be absolutely the perfect solution. But my last > requirement seems to throw Hadoop out of the race. > > - The photos need to be readable with very short latency from an external > enterprise application > > With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems > that most of the Hadoop experts advise against this usage if the size of my > data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB. > > I think I understood the concepts of HAR or sequential files. > But if I pack, for example, my files together in a large file of many > Gigabytes it is impossible to access one single photo from the Hadoop > repository in a reasonable time. It makes no sense in my eyes to pack > thousands of files into a large file just so that Hadoop jobs can handle it > better. To simply access a single file from a web interface - as in my case > - it seems to be all counterproductive. > > So my question is: Is Hadoop only feasible to archive large Web-server log > files and not designed to handle big archives of small files with also > business critical data? > > > Thanks for your advice in advance. > > Ralph > -- > > --f403045fa370c4b7810558622b47 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Ralph,=C2=A0
In general Hadoop is able to store such data. And even = Har archives can be used with conjunction with WebHDFS (by passing offset a= nd limit attributes). What are your reading requirements? FS meta data are = not distributed and reading the data is limited by the HDFS NameNode server= performance. So if you would like to download files with high RPS that wou= ld not work well.

On Monday, September 4, 2017, Ralph Soika <ralph.soika@imixs.com> wrote:
=20 =20 =20

Hi,

I know that the issue around the small-file problem was asked frequently, not only in this mailing list.
I also have read already some books about Haddoop and I also started to work with Hadoop. But still I did not really understand if Hadoop is the right choice for my goals.

To simplify my problem domain I would like to use the use case of a photo archive:

- An external application produces about 10 million photos in one year. The files contain important business critical data.
- A single photo file has a size between 1 and 10 MB.
- The photos need to be stored over several years (10-30 years).
- The data store should support replication over several servers.
- A checksum-concept is needed to guarantee the data integrity of all files over a long period of time.
- To write and read the files a Rest API is preferred.

So far Hadoop seems to be absolutely the perfect solution. But my last requirement seems to throw Hadoop out of the race.

- The photos need to be readable with very short latency from an external enterprise application

With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems that most of the Hadoop experts advise against this usage if the size of my data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB.

I think I understood the concepts of HAR or sequential files.
But if I pack, for example, my files together in a large file of many Gigabytes it is impossible to access one single photo from the Hadoop repository in a reasonable time. It makes no sense in my eyes to pack thousands of files into a large file just so that Hadoop jobs can handle it better. To simply access a single file from a web interface - as in my case - it seems to be all counterproductive.

So my question is: Is Hadoop only feasible to archive large Web-server log files and not designed to handle big archives of small files with also business critical data?


Thanks for your advice in advance.

Ralph

--

--f403045fa370c4b7810558622b47--