hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mirko Kämpf <mirko.kae...@gmail.com>
Subject Re: HDFS - many files, small size
Date Thu, 02 Oct 2014 09:30:56 GMT
Hi Roger,

you can use Apache Flume to ingest this files into your cluster. Store it
in an HBase table for fast random access and extract the "metadata" on the
fly using morphlines (See:
http://kitesdk.org/docs/0.11.0/kite-morphlines/index.html). Even then
base64 conversion can be done on the fly if you like. For MapReduce jobs
you can consider sequence files as intermediate storage or AVRO as it is
more flexible. HCatalog allows you to access datasets stored in HBase (see:
https://cwiki.apache.org/confluence/display/HCATALOG/HCatalog+HBase+Integration+Design
)
If random access to all files is not required I suggest not to use HBase.
Solr-Cloud can also store the raw content beside the extracted metadata,
but MR is not that simple in this case.

Goodl luck and,
Best wishes,
Mirko

2014-10-02 9:12 GMT+01:00 Roger Maillist <darkchanterlist@gmail.com>:

> Hi there
> I got millions of rather small PDF-Files which I want to load into HDFS
> for later analysis. Also I need to re-encode them as base64-stream to get
> the MR-Job for parsing work.
>
> Is there any better/faster method of just calling the 'put' function in a
> huge (bash) loop? Maybe I could implement encoding and loading as an MR-Job
> itself?
>
> Second thing is, according to a cloudera blog I read, it's a bad idea to
> store small files on HDFS, especially if there are large numbers of them.
> They recommend HBase instead. However I want to take further action via
> HCatalog...
>
> Thanks for your Suggestions
> Roger
>

Mime
View raw message