>> I think, if we add ord as an output to the FST, then it builds
>> everything we need? Ie no further data structures should be needed?
>> Maybe I'm confused :)
>
> If you put the ord as an output the common part will be shifted towards the
> front of the tree. This will work if you want to look up a given value
> assigned to some string, but will not work if you need to look up the string
> from its value. The latter case can be solved if you "know" which branch to
> take while descending from root and the "shared prefix" alone won't give you
> this information. At least I don't see how it could.
>
> I am familiar with the basic prefix hashing procedure suggested by Daciuk
> (and other authors), but maybe some progress has been made there, I don't
> know... the one I know is really conceptually simple  since each arc
> encodes the number of leaves (or input sequences) in the automaton, you know
> which path must lead you to your string. For example if you have a node like
> this and seek for the 12th term:
>
> 0  10  ...
> + 10  ...
> + 5  ..
> you look at the first path, it'd give you terms 1..10, then the next one
> contains terms 11..20 so you add 10 to an internal counter which is added to
> further computations, descend and repeat the procedure until you find a leaf
> node.
>
> Dawid
There's a possible speedup here. If, instead of storing the count of
all downstream leaves, you store the sum of counts for all previous
siblings, you can do a binary lookup instead of linear scan on each
node.
Taking your example:
0  0  ...
+ 10  ... We know that for 12th term we should descend along
this edge, as it has the biggest tag less than 12.
+ 15  ...
That's what I invented, and yes, it was invented by countless people before :)

Kirill Zakharenko/Кирилл Захаренко
EMail/Jabber: earwin@gmail.com
Phone: +7 (495) 6835674
ICQ: 104465785

To unsubscribe, email: devunsubscribe@lucene.apache.org
For additional commands, email: devhelp@lucene.apache.org
