hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Booth (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-516) Low Latency distributed reads
Date Mon, 03 Aug 2009 22:43:14 GMT

    [ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738651#action_12738651
] 

Jay Booth commented on HDFS-516:
--------------------------------

Wow, thanks Raghu, that's awesome and will save me a ton of time.  A couple points for discussion:

* The random 4k byte grabber is awesome and I will be using it as part of my benchmarking
at the first opportunity, however I think it's worth also testing some likely applications
to really show the strength of client-side caching.  10MB or so worth of properly warmed cache
could mean your first 20 lookups in a binary search are almost-free, and having the frontmost
10% of a lucene index in cache will mean that almost all of the scoring portion of the search
will be computed against local memory.  Meanwhile, for truly random reads, having a cache
that's, say, 5-10% of the size of the data will only get you a small improvement.  So I'd
like to get some numbers for use cases that really thrive on caching in addition to truly
random access.    But that will be extremely useful for tuning the IO layer and establishing
a baseline for cache-miss performance, so thanks for the heads up.

* I have a feeling that my implementation is significantly slower than the default when it
comes to streaming, since it relies on successive, small positioned reads and a heavy memory
footprint rather than a simple stream of bytes.  Watching my unit tests run on my laptop with
a ton of confounding factors sure seemed that way, although that's not a scientific measurement
(one more item to benchmark).  So while I agree with the urge for simplicity, I feel like
we need to make that performance tradeoff clear.  Otherwise, we could have a lot of very slow
mapreduce jobs happening.  Given that MapReduce is the primary use case for Hadoop, my instinct
was to make RadFileSystem a non-default implementation.  Point very well taken about the BlockLocations
and CRC verification, maybe the best way to handle future integration with DataNode would
be to develop separately, reuse as much code as possible and then when RadFileSystem is mature
and benchmarked we can revisit a merge with DistributedFileSystem?

Thanks again, I'll try and write a post later tonight with an explicit plan for benchmarking
and then maybe people can comment and poke holes in it as they see fit?

> Low Latency distributed reads
> -----------------------------
>
>                 Key: HDFS-516
>                 URL: https://issues.apache.org/jira/browse/HDFS-516
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jay Booth
>            Priority: Minor
>         Attachments: radfs.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I created a method for low latency random reads using NIO on the server side and simulated
OS paging with LRU caching and lookahead on the client side.  Some applications could include
lucene searching (term->doc and doc->offset mappings are likely to be in local cache,
thus much faster than nutch's current FsDirectory impl and binary search through record files
(bytes at 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message