hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marko Dinic <marko.di...@nissatech.com>
Subject Re: Large number of small files
Date Fri, 24 Apr 2015 09:10:40 GMT
Anand,

Thank you for your answer, but wouldn't that mean that I would have to 
serialize the files each time I need to run the job? And I would still 
need to save the original files, so the NameNode still needs to take 
care of them?

Please correct me if I'm missing something, I'm not very experienced 
with Hadoop.

What do you think about using Cassandra?

Thanks

On Fri 24 Apr 2015 11:03:19 AM CEST, Chandra Mohan, Ananda Vel Murugan 
wrote:
> Apart from databases like Cassandra, you may check serialization formats like Avro or
Parquet
>
> Regards,
> Anand
>
> -----Original Message-----
> From: Marko Dinic [mailto:marko.dinic@nissatech.com]
> Sent: Friday, April 24, 2015 2:23 PM
> To: user@hadoop.apache.org
> Subject: Large number of small files
>
> Hello,
>
> I'm not sure if this is the place to ask this question, but I'm still hopping for an
answer/advice.
>
> Large number of small files are uploaded, about 8KB. I am aware that this is not something
that you're hopping for when working with Hadoop.
>
> I was thinking about using HAR files and combined input, or sequence files. The problem
is, files are timestamped, and I need different subset in different time, for example - one
job needs to run on files that are uploaded during last 3 months, while next job might consider
last 6 months. Naturally, as time passes different subset of files is needed.
>
> This means that I would need to make a sequence file (or a HAR) each time I run a job,
to have smaller number of mappers. On the other hand, I need the original files so I could
subset them. This means that DataNode is at constant pressure, saving all of this in its memory.
>
> How can I solve this problem?
>
> I was also considering using Cassandra, or something like that, and to save the file
content inside of it, instead of saving it to files on HDFS. FIle content is actually some
measurement, that is, a vector of numbers, with some metadata.
>
> Thanks

Mime
View raw message