hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: FileSystem Caching in Hadoop
Date Wed, 07 Oct 2009 15:20:47 GMT
On Wed, Oct 7, 2009 at 10:48 AM, Todd Lipcon <todd@cloudera.com> wrote:
> On Wed, Oct 7, 2009 at 7:45 AM, Edward Capriolo <edlinuxguru@gmail.com>wrote:
>> Todd,
>> I do think it could be an inherent problem. With all the reading and
>> writing of intermediate data hadoop does, the file system cache would
>> would likely never contain the initial raw data you want to work with.
>> The HBase RegionServer seems to be successful, so there must be some
>> place for caching.
>> Once I get something in HDFS, like lasts hours log data, about 40
>> different processes are going to repeatedly re/read it from disk. I
>> think if i can force that data into a cache I can get much faster
>> processing.
>> In cases like this, we should expose access type hints like posix_fadvise
> POSIX_ADV_DONTNEED for the data we dont' want to end up in the cache.
> There's already a JIRA out there for a JNI library for platform specific
> optimization, and I think this is one that will be worth doing.
> -ToddEdward

Those make sense.

This started with HiveRegionServer now we are at VFS hints and JNI.

I think the optimizations could be done in lots of places, anywhere
from close to the application with InputFormat and Memcache, on the
other end we could go the Oracle route and write to raw disk
partitions :)

View raw message