hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marko Dinic <marko.di...@nissatech.com>
Subject Large number of small files
Date Fri, 24 Apr 2015 08:53:29 GMT

I'm not sure if this is the place to ask this question, but I'm still 
hopping for an answer/advice.

Large number of small files are uploaded, about 8KB. I am aware that 
this is not something that you're hopping for when working with Hadoop.

I was thinking about using HAR files and combined input, or sequence 
files. The problem is, files are timestamped, and I need different 
subset in different time, for example - one job needs to run on files 
that are uploaded during last 3 months, while next job might consider 
last 6 months. Naturally, as time passes different subset of files is 

This means that I would need to make a sequence file (or a HAR) each 
time I run a job, to have smaller number of mappers. On the other hand, 
I need the original files so I could subset them. This means that 
DataNode is at constant pressure, saving all of this in its memory.

How can I solve this problem?

I was also considering using Cassandra, or something like that, and to 
save the file content inside of it, instead of saving it to files on 
HDFS. FIle content is actually some measurement, that is, a vector of 
numbers, with some metadata.


View raw message