Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of dawid.weiss@gmail.com
 designates 209.85.161.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:from:date
         :x-google-sender-auth:message-id:subject:to:content-type;
        b=DKjmQKOM4loBXtinu8nkRFel/htdalTxNCgEMmSoO0ixuKX4lz22is727rdlytjuY0
         Lt3vNAoORIGNDydZ28xKkvghP9df1VuWmI+9R1nknimiX7hHtVjbFhkPdPnf/O3Wl9Pr
         vs3keeVEE7SpJ/s8r+zpQnhLaJQJbtEEmNyP4=
MIME-Version: 1.0
Sender: dawid.weiss@gmail.com
In-Reply-To: <BANLkTimu72Sh2kjLUqgtjbhxtKPFqvsUOw@mail.gmail.com>
References: <1305780790222-2960030.post@n3.nabble.com>
 <BANLkTi=gERmisv_Pf4j0z55j8b2_erX6-A@mail.gmail.com>
 <BANLkTikU1Uf+urMfG4b6ZBBgtzwkOV0rPw@mail.gmail.com>
 <BANLkTimtV=ZRONdq0XKby1doAGWaJRCDdw@mail.gmail.com>
 <BANLkTim734V7gKDOyiKME2=Pwxht=2tcMQ@mail.gmail.com>
 <BANLkTimu72Sh2kjLUqgtjbhxtKPFqvsUOw@mail.gmail.com>
From: Dawid Weiss <dawid.weiss@cs.put.poznan.pl>
Date: Thu, 19 May 2011 12:36:33 +0200
Message-ID: <BANLkTimp6SCCX-E=JYKFhM2uUFAhh8TJWA@mail.gmail.com>
Subject: Re: FST and FieldCache?
To: dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e640981460194904a39e9592

--0016e640981460194904a39e9592
Content-Type: text/plain; charset=UTF-8

> I think, if we add ord as an output to the FST, then it builds
> everything we need?  Ie no further data structures should be needed?
> Maybe I'm confused :)

If you put the ord as an output the common part will be shifted towards the
front of the tree. This will work if you want to look up a given value
assigned to some string, but will not work if you need to look up the string
from its value. The latter case can be solved if you "know" which branch to
take while descending from root and the "shared prefix" alone won't give you
this information. At least I don't see how it could.

I am familiar with the basic prefix hashing procedure suggested by Daciuk
(and other authors), but maybe some progress has been made there, I don't
know... the one I know is really conceptually simple -- since each arc
encodes the number of leaves (or input sequences) in the automaton, you know
which path must lead you to your string. For example if you have a node like
this and seek for the 12-th term:

0 -- 10 -- ...
  +- 10 -- ...
  +- 5 -- ..

you look at the first path, it'd give you terms 1..10, then the next one
contains terms 11..20 so you add 10 to an internal counter which is added to
further computations, descend and repeat the procedure until you find a leaf
node.

Dawid

--0016e640981460194904a39e9592
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<br>&gt; I think, if we add ord as an output to the FST, then it builds<br>=
&gt; everything we need? =C2=A0Ie no further data structures should be need=
ed?<br>&gt; Maybe I&#39;m confused :)<br><br>If you put the ord as an outpu=
t the common part will be shifted towards the front of the tree. This will =
work if you want to look up a given value assigned to some string, but will=
 not work if you need to look up the string from its value. The latter case=
 can be solved if you &quot;know&quot; which branch to take while descendin=
g from root and the &quot;shared prefix&quot; alone won&#39;t give you this=
 information. At least I don&#39;t see how it could.<br>

<br>I am familiar with the basic prefix hashing procedure suggested by Daci=
uk (and other authors), but maybe some progress has been made there, I don&=
#39;t know... the one I know is really conceptually simple -- since each ar=
c encodes the number of leaves (or input sequences) in the automaton, you k=
now which path must lead you to your string. For example if you have a node=
 like this and seek for the 12-th term:<br>

<br><font class=3D"Apple-style-span" face=3D"&#39;courier new&#39;, monospa=
ce">0 -- 10 -- ...<br>=C2=A0 +- 10 -- ...<br>=C2=A0 +- 5 -- ..</font><div><=
font class=3D"Apple-style-span" face=3D"&#39;courier new&#39;, monospace"><=
br></font></div>

<div><font class=3D"Apple-style-span" face=3D"&#39;courier new&#39;, monosp=
ace">you look at the first path, it&#39;d give you terms 1..10, then the ne=
xt one contains terms 11..20 so you add 10 to an internal counter which is =
added to further computations, descend and repeat the procedure until you f=
ind a leaf node.<br>

</font><br>Dawid</div>

--0016e640981460194904a39e9592--