lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <ear...@gmail.com>
Subject Re: FST and FieldCache?
Date Thu, 19 May 2011 12:31:18 GMT
>> I think, if we add ord as an output to the FST, then it builds
>> everything we need?  Ie no further data structures should be needed?
>> Maybe I'm confused :)
>
> If you put the ord as an output the common part will be shifted towards the
> front of the tree. This will work if you want to look up a given value
> assigned to some string, but will not work if you need to look up the string
> from its value. The latter case can be solved if you "know" which branch to
> take while descending from root and the "shared prefix" alone won't give you
> this information. At least I don't see how it could.
>
> I am familiar with the basic prefix hashing procedure suggested by Daciuk
> (and other authors), but maybe some progress has been made there, I don't
> know... the one I know is really conceptually simple -- since each arc
> encodes the number of leaves (or input sequences) in the automaton, you know
> which path must lead you to your string. For example if you have a node like
> this and seek for the 12-th term:
>
> 0 -- 10 -- ...
>   +- 10 -- ...
>   +- 5 -- ..
> you look at the first path, it'd give you terms 1..10, then the next one
> contains terms 11..20 so you add 10 to an internal counter which is added to
> further computations, descend and repeat the procedure until you find a leaf
> node.
>
> Dawid

There's a possible speedup here. If, instead of storing the count of
all downstream leaves, you store the sum of counts for all previous
siblings, you can do a binary lookup instead of linear scan on each
node.
Taking your example:

0 -- 0 -- ...
  +- 10 -- ... We know that for 12-th term we should descend along
this edge, as it has the biggest tag less than 12.
  +- 15 -- ...

That's what I invented, and yes, it was invented by countless people before :)

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: earwin@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message