lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Fst - arc output?
Date Fri, 30 Mar 2012 18:29:36 GMT
Well I think your picture demonstrates why its hard to beat? all the
'huge numbers' are in the root arcs which are easily cached (and
cached for latin-1 by default).

So its bad in that it bloats the FST, but ultimately bloats it where
it doesn't hurt.

Of course if someone wants to use this for say, CJK, then we need to
add an option to cache more root arcs (just like Kuromoji does). If it
costs you 1 or 2MB ram to have a faster suggester... i think thats an
ok tradeoff.

but yeah, if there is a more efficient way thats also not bloated, i'm
interested!

On Fri, Mar 30, 2012 at 2:23 PM, Dawid Weiss
<dawid.weiss@cs.put.poznan.pl> wrote:
> Clear, I didn't think of that and had positive weights in my mind
> only. Thanks for explanation.
>
> Dawid
>
> On Fri, Mar 30, 2012 at 8:17 PM, Robert Muir <rcmuir@gmail.com> wrote:
>> I think its listed in the documentation (at least the jira issue!).
>> Now that shortestPaths() is generified to any outputs, please feel
>> free to find a more compact/faster output!!!!!!!
>>
>> Basically I experimented with tons of ideas (but the simplest ugliest
>> hack was most successful it seemed?)
>>
>> we take the cost, and encode it as Integer.MAX_VALUE - cost to turn it
>> into a "weight".
>>
>> This is bad because its huge vints, but other approaches i tried were
>> sketchier, requiring encoding of negatives and such (see the recent
>> JIRA issue about that which i tripped in the process!)
>>
>> I think i wrote 5 or 6 outputs impls trying to make this cleaner
>> before just deciding to commit the simple approach...
>>
>> On Fri, Mar 30, 2012 at 2:13 PM, Dawid Weiss
>> <dawid.weiss@cs.put.poznan.pl> wrote:
>>> I'm doing this on trunk:
>>>
>>>
>>>        WFSTCompletionLookup cl = new WFSTCompletionLookup();
>>>        TermFreq [] input = new TermFreq [] {
>>>            new TermFreq("cat",   0),
>>>            new TermFreq("chat",  2),
>>>            new TermFreq("fat",   3),
>>>            new TermFreq("feat",  1),
>>>            new TermFreq("sea",   0),
>>>            new TermFreq("seat",  3),
>>>            new TermFreq("swat",  0),
>>>            new TermFreq("sweat", 3),
>>>        };
>>>        cl.build(new TermFreqArrayIterator(input));
>>>
>>>        StringWriter sw = new StringWriter();
>>>        Field f = cl.getClass().getDeclaredField("fst");
>>>        f.setAccessible(true);
>>>        FST<?> fst = (FST<?>) f.get(cl);
>>>        Util.toDot(fst, sw, false, true);
>>>
>>>        for (TermFreq tf : input) {
>>>            List<LookupResult> lookup =
>>> cl.lookup(tf.term.utf8ToString(), true, 1);
>>>            System.out.println(lookup.get(0));
>>>        }
>>>
>>>        System.out.println(sw.toString());
>>>
>>> The output is attached. How come the first arcs have such high outputs?
>>>
>>> Dawid
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>>
>> --
>> lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message