commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J.Pietschmann" <>
Subject [collections] [lang] Req: Trie/ternary tree data structure
Date Sat, 29 Nov 2003 21:34:43 GMT
Hi all,
I'd like to tap into the community wisdom for some thoughts.

  == Background ==
There are various task which require all substrings of a certain
string to be looked up in a collection. The application I'm
concerned with is pattern based word hyphenation, but there are
others (spell check, stemming).
An example: take the string "word". The task requires looking up
the 10 strings "w", "wo", "wor", "word", "o", "or", "ord", "r",
"rd" and "d".

The idea is to take advantage of the fact that the strings to look
up are related because they are substrings of another string. Another
point is that you know that you deal with strings/characters, not
arbitrary data as keys.
Liang, Knuth et al. used a trie for handling this efficiently, both
in terms of performance and memory. A node in a trie is basically
a sorted array of characters, so that the character can be looked up
quickly. The slot then points to the stored value as well as to
another node.
If it is known that a key string in the trie is never a substring
of another key string, the pointers to the value and the following
node can take up the same storage. All of the patters for hyphenation
i've seen have this property, but there is no guarantee.

The Apache FOP hyphenator uses a ternary tree as implementation.
Basically it's the same, except the sorted array is replaced by a
binary tree. Instread of using Java object for the tree nodes, tree
char arrays are used for storing indices to the respective following
nodes. Ideally, the binary trees should be balanced in order to provide
the same O(log(n)) time for looking up like the sorted arrays in the trie
do. Of course, if the ternary tree is build by inserting random key
strings this requires red-black-trees or a similar mechanism.
for the current state of the art.

An example how strings are looked up in a trie: Suppose the collection
contains the key strings "ord", "ot" and "rda".
The trie is roughly
     (0) o  r
        /    \
  (1) r t    d (2)
      | -    |
  (3) d      a (4)
      -      -

Now for the string lookup
i=0 j=0 c='w' no match in (0) -> continue
i=1 j=0 c='o' match 'o' in (0) -> (1)
     j=1 c='r' match 'r' in (1) -> (3)
     j=3 c='d' match 'd' in (3), leaf node, get value
i=2 j=0 c='r' match 'r' in (0) -> (2)
     j=1 c='d' match 'd' in (2) -> (4)
     eos but no leaf node -> continue
i=3 j=0 c='d' no match (0) -> continue

You see, none of the longer substrings starting with "w" is actually
looked up, which should provide some performance gain over simply using
a hashtable.

  == Requirements ==

The data structure is built once from mor or less randomly distributed key
strings. After this, it is used for many lookups. This means the key
strings can first be gathered and analyzed, then stored into an optimized
data structure.

The number of expected key strings is of the order of several thousend to
a few hundred thousand. Efficient memory usage is a must. Tradeoffs between
using less memory and lookup performance should be carefully considered.

It should be possible to make the optimized data structure persistent. It
should load fast from a data or serialized object file. Unfortunately, I
have no experience whether it is more advanteous to serialize a tree of
Java objects or an object with a one or few big arrays.

  == Considerations ==

1. As already said, for the current hyphenation patters no key string
is a substring at the beginning of another key. I.e if there is a "wor"
key, there won't be "word" or "wora" keys, but there may be a "kword"
key. This may allow some storage optimizations. (see
for an example, the keys are the strings in the <patterns> sections sans
the digits)
However for using such a data structure for stemming, there may be keys
which are also at the satrt of other keys. Actually, the curent state for
hyphenation patters is to some extent an artefact of how the patterns are
created, and it may well be it means we are using suboptimal patterns.

2. In most cases, the set of unicode characters in the key strings is limited.
It is possible that the charaters can be mapped onto bytes. For hyphenation,
a mapping of the characters of the original string is necessary anyway
(normalizing uppercase characters to lowercase etc.).

Ok, now what do you think?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message