hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleks Laz <al-userhad...@none.at>
Subject Newbie Question about 37TB binary storage on HDFS
Date Thu, 27 Nov 2014 16:49:14 GMT
Dear All.

We have since ~2012 collected a lot of binary data (jpg's).

The Storage hierarchy is like this.

                     <YEAR>/<MONTH>/<DAY>
<MOUNT_ROOT>/cams/<ID>/2014/11/19/

The binary data are in the directory below <DAY> ~1000 Files per 
directory and mounted with xfs.

Due to the fact that the platform now grows up we need to create a more 
scalable setup.

I have read

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
http://wiki.apache.org/hadoop/FAQ#HDFS
...
and hope that I have understand the main concept behind HDFS.

Due to the fact that on this list more experienced Hadoop and HDFS users 
are then I, I hope you can answer some basic questions from me.

Our application is a nginx/php-fpm/postgresql Setup.
The target design is nginx + proxy features / php-fpm / $DB / $Storage.

.) Can I mix HDFS for binary data storage and data analyzing?

.) What is the preferred way to us HDFS with PHP?
.) How difficult is it to use HDFS with PHP?
    Google have a lot of answers to this question (WebHDFS, NFS, thrift, 
...) but which one is now 'the' solution and still 'supported' by the 
hadoop community?
    Btw.: The link on http://wiki.apache.org/hadoop/HDFS-APIs for PHP is 
a 404


.) What's a good solution for the 37 TB or the upcoming ~120 TB to 
distribute?
   [ ] N Servers with 1 37 TB mountpoints per server?
   [ ] N Servers with x TB mountpoints pers server?
   [ ] other:

.) Is HDFS a good value for $Storage?
.) Is HBase a good value for $DB?
    DB-Size is smaller then 1 GB, I would use HBase just for HA features 
of Hadoop.

.) Due to the fact that HDFS is a file-system I could use
      /cams , for binary data
      /DB   , for DB storage
      /logs , for log storage
    but is this wise. On the 'disk' they are different RAIDs.

.) Should I plan a dedicated Network+Card for the 'cluster 
communication' as for the most other cluster software?
    From what I have read it looks not necessary but from security point 
of view, yes.

.) Maybe the communication with the componnents (hadoop, zk, ...) could 
be setup ed with TLS?

Thank you very much that you have read the mail up to this line ;-)

Thank you also for feedback which is very welcome and appreciated.

Best Regards
Aleks

Mime
View raw message