lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: FST and FieldCache?
Date Thu, 19 May 2011 12:44:27 GMT
This (storing sums) is, I think, exactly what the FST stores as
outputs on the arcs.  Ie, it bounds the range of outputs if you were
to recurse on that arc.

So, from any node, we can unambiguously determine which arc to recurse
on, when looking up by value (only if the value is strictly
monotonic).

It should be straightforward to implement, ie should not require any
additional data structure / storage in the FST.  It's a lookup-only
change, I think.

Mike

http://blog.mikemccandless.com

On Thu, May 19, 2011 at 8:31 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
>>> I think, if we add ord as an output to the FST, then it builds
>>> everything we need?  Ie no further data structures should be needed?
>>> Maybe I'm confused :)
>>
>> If you put the ord as an output the common part will be shifted towards the
>> front of the tree. This will work if you want to look up a given value
>> assigned to some string, but will not work if you need to look up the string
>> from its value. The latter case can be solved if you "know" which branch to
>> take while descending from root and the "shared prefix" alone won't give you
>> this information. At least I don't see how it could.
>>
>> I am familiar with the basic prefix hashing procedure suggested by Daciuk
>> (and other authors), but maybe some progress has been made there, I don't
>> know... the one I know is really conceptually simple -- since each arc
>> encodes the number of leaves (or input sequences) in the automaton, you know
>> which path must lead you to your string. For example if you have a node like
>> this and seek for the 12-th term:
>>
>> 0 -- 10 -- ...
>>   +- 10 -- ...
>>   +- 5 -- ..
>> you look at the first path, it'd give you terms 1..10, then the next one
>> contains terms 11..20 so you add 10 to an internal counter which is added to
>> further computations, descend and repeat the procedure until you find a leaf
>> node.
>>
>> Dawid
>
> There's a possible speedup here. If, instead of storing the count of
> all downstream leaves, you store the sum of counts for all previous
> siblings, you can do a binary lookup instead of linear scan on each
> node.
> Taking your example:
>
> 0 -- 0 -- ...
>  +- 10 -- ... We know that for 12-th term we should descend along
> this edge, as it has the biggest tag less than 12.
>  +- 15 -- ...
>
> That's what I invented, and yes, it was invented by countless people before :)
>
> --
> Kirill Zakharenko/Кирилл Захаренко
> E-Mail/Jabber: earwin@gmail.com
> Phone: +7 (495) 683-567-4
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message