hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1700) Optimiza JDBM to make mapjoin faster
Date Tue, 12 Oct 2010 20:54:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920339#action_12920339
] 

Namit Jain commented on HIVE-1700:
----------------------------------

Let us break into 2 separate jiras:

1. HTree.get() will deserialize both key and value until find a matched key. We can only de-serialize
the key, and de-serialize the value until the key match.
seems like we should move all deserialization to hive land. jdbm should just work on byte
arrays for both keys and values. (since the output of the serializer used by hive is byte
comparable - that seems to suffice)

2. Htree.get() cost 70% total time. It could help a lot if there is bloom filter here to avoid
unneeded get() if we know for sure the given key is not in JDBM. (we can generate the bloom
filter when doing the jdbm sink, and read into memory when doing read. )
   


> Optimiza JDBM to make mapjoin faster
> ------------------------------------
>
>                 Key: HIVE-1700
>                 URL: https://issues.apache.org/jira/browse/HIVE-1700
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> copied from email:
> From: Joydeep Sen Sarma
> Sent: Tuesday, October 12, 2010 11:11 AM
> To: Yongqiang He; Liyin Tang; Namit Jain
> Subject: RE: Optimize jdbm
> seems like we should move all deserialization to hive land. jdbm should just work on
byte arrays for both keys and values. (since the output of the serializer used by hive is
byte comparable - that seems to suffice)
> ________________________________________
> From: Yongqiang He
> Sent: Tuesday, October 12, 2010 10:22 AM
> To: Liyin Tang; Namit Jain
> Cc: Joydeep Sen Sarma
> Subject: Optimize jdbm
>   1.  Htree.get() cost 70% total time.  It could help a lot if there is bloom filter
here to avoid unneeded get() if we know for sure the given key is not in JDBM. (we can generate
the bloom filter when doing the jdbm sink, and read into memory when doing read. )
>   2.  HTree.get() will deserialize both key and value until find a matched key. We can
only de-serialize the key, and de-serialize the value until  the key match.
> Any others?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message