hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Duo Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15352) FST BlockEncoder
Date Sat, 27 Feb 2016 07:36:18 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170424#comment-15170424
] 

Duo Zhang commented on HBASE-15352:
-----------------------------------

I used to think about this but there is a little difference between lucene's index and our
index.

For lucene, the index only need to report hit or not hit, and if hit, return the offset. The
algorithm is straight forward on a sorted list.

But for us, our index does not reject any input, that means, for every input we need to return
an offset for it.

For example, two pairs

aa -> 10
az -> 20

For lucene, input 'aa' will return 10 and input 'az' will return 20, input 'ab' or 'ac' or
'aaz' will just be rejected.

But for us, besides 'aa' and 'az', we need to return 10 for every input that between 'aa'
and 'az'. So we need to modify the algorithm otherwise we will have an infinite number of
states. And how to compress the states will be the key issue here.
And I'm even not sure whether it could be represented by an FSA... maybe a PDA here? If so,
that's a completely different problem...

Thanks.

> FST BlockEncoder
> ----------------
>
>                 Key: HBASE-15352
>                 URL: https://issues.apache.org/jira/browse/HBASE-15352
>             Project: HBase
>          Issue Type: New Feature
>          Components: regionserver
>            Reporter: Nick Dimiduk
>             Fix For: 2.0.0, 0.98.19, 1.4.0
>
>
> We could improve on the existing [PREFIX_TREE block|http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/codec/prefixtree/package-summary.html]
encoder by upgrading the persistent data structure from a trie to a finite state transducer.
This would theoretically allow us to reuse bytes not just for rowkey prefixes, but infixes
and suffixes as well. My read of the literature means we may also be able to encode values
as well, further reducing storage size when values are repeated (ie, a "customer id" field
with very low cardinality -- probably happens a lot in our denormalized world). There's a
really nice [blog post|http://blog.burntsushi.net/transducers/] about this data structure,
and apparently our siblings in Lucene make heavy use of [their implementation|http://lucene.apache.org/core/5_5_0/core/org/apache/lucene/util/fst/package-summary.html#package_description].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message