hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-250) shared memory java dbm for map-side joins
Date Mon, 26 Jan 2009 18:55:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667351#action_12667351
] 

Joydeep Sen Sarma commented on HIVE-250:
----------------------------------------

ok - i guess i was quite off on this.

These modules use a java object cache - they deserialize objects from underlying dbm files.
the object cache (in the way it's implemented) cannot be shared across processes. the act
of mmaping the file and then reading it has some performance advantages (less system calls)
- but doesn't really help that much from memory perspective. if the file is small and there
are lots of readers - then the file system will anyway hold the file in cache (and that's
1X memory consumption).

so for v1 - what i would say is just use one of these modules with efficient deserialization
(jdbm uses object serialization by default and that's way too expensive). and trust file caching
to do it's job. The object cache in dbm (configurable) will have to be kept to reasonable
size - quite likely a subset of the entire data size.

if we get time - we can do something smarter - the object cache should have pointers to data
in the underlying mmap'ed file. This would mean that data across all object caches is shared
(we will still have to pay for object overhead though). with the same amount of memory - the
object cache can be sized much larger (since it's not storing data anymore).

--
i did do some experiments:
- replaced file io with mmaped buffers in jdbm - works perfectly
- bulk of compute cost is in object deserialization. On my T60 with a single thread - lookup
throughput went from 6.5K op/s to 41K op/s if deserialization was avoided (by making the object
cache large enough to store all values). cpu bound of course.



> shared memory java dbm for map-side joins
> -----------------------------------------
>
>                 Key: HIVE-250
>                 URL: https://issues.apache.org/jira/browse/HIVE-250
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Joydeep Sen Sarma
>
> can use either:
> - sdbm: http://freshmeat.net/projects/solingerjavasdbm/
> - jdbm: http://sourceforge.net/projects/jdbm/
> both need modifications to use file mmaps instead of regular file io. will do some testing
to see if there's a major difference between the two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message