Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D161942AD for ; Thu, 19 May 2011 10:37:21 +0000 (UTC) Received: (qmail 81270 invoked by uid 500); 19 May 2011 10:37:20 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 81208 invoked by uid 500); 19 May 2011 10:37:20 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 81201 invoked by uid 99); 19 May 2011 10:37:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 May 2011 10:37:20 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_FROM,HTML_FONT_FACE_BAD,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dawid.weiss@gmail.com designates 209.85.161.176 as permitted sender) Received: from [209.85.161.176] (HELO mail-gx0-f176.google.com) (209.85.161.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 May 2011 10:37:14 +0000 Received: by gxk7 with SMTP id 7so1219970gxk.35 for ; Thu, 19 May 2011 03:36:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:from :date:x-google-sender-auth:message-id:subject:to:content-type; bh=NbEhYXZjV3Dp/ufg+umT/n8JNuq8ykTkCCv6okJXdKE=; b=BOQ1FJxMunIUGS6q5SPdsJtHA/2iz8PwW66dXagdl47YjDZ3Cv+WcX/WBmB8Bg4b6F NQ9O/9135yPu+K1yF4bfOwBhv9Ivj99BZ7o4cLT3Q0mj/sBJlid2ZbzVUuhXtEzK4/fB FVfDikHM5HM1anvElxi596+OibOGw3GUvCyRk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; b=DKjmQKOM4loBXtinu8nkRFel/htdalTxNCgEMmSoO0ixuKX4lz22is727rdlytjuY0 Lt3vNAoORIGNDydZ28xKkvghP9df1VuWmI+9R1nknimiX7hHtVjbFhkPdPnf/O3Wl9Pr vs3keeVEE7SpJ/s8r+zpQnhLaJQJbtEEmNyP4= Received: by 10.91.164.33 with SMTP id r33mr2323348ago.129.1305801413105; Thu, 19 May 2011 03:36:53 -0700 (PDT) MIME-Version: 1.0 Sender: dawid.weiss@gmail.com Received: by 10.90.29.18 with HTTP; Thu, 19 May 2011 03:36:33 -0700 (PDT) In-Reply-To: References: <1305780790222-2960030.post@n3.nabble.com> From: Dawid Weiss Date: Thu, 19 May 2011 12:36:33 +0200 X-Google-Sender-Auth: by8zIMHaidTTVAHtSJSERHO0uwU Message-ID: Subject: Re: FST and FieldCache? To: dev@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e640981460194904a39e9592 --0016e640981460194904a39e9592 Content-Type: text/plain; charset=UTF-8 > I think, if we add ord as an output to the FST, then it builds > everything we need? Ie no further data structures should be needed? > Maybe I'm confused :) If you put the ord as an output the common part will be shifted towards the front of the tree. This will work if you want to look up a given value assigned to some string, but will not work if you need to look up the string from its value. The latter case can be solved if you "know" which branch to take while descending from root and the "shared prefix" alone won't give you this information. At least I don't see how it could. I am familiar with the basic prefix hashing procedure suggested by Daciuk (and other authors), but maybe some progress has been made there, I don't know... the one I know is really conceptually simple -- since each arc encodes the number of leaves (or input sequences) in the automaton, you know which path must lead you to your string. For example if you have a node like this and seek for the 12-th term: 0 -- 10 -- ... +- 10 -- ... +- 5 -- .. you look at the first path, it'd give you terms 1..10, then the next one contains terms 11..20 so you add 10 to an internal counter which is added to further computations, descend and repeat the procedure until you find a leaf node. Dawid --0016e640981460194904a39e9592 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
> I think, if we add ord as an output to the FST, then it builds
= > everything we need? =C2=A0Ie no further data structures should be need= ed?
> Maybe I'm confused :)

If you put the ord as an outpu= t the common part will be shifted towards the front of the tree. This will = work if you want to look up a given value assigned to some string, but will= not work if you need to look up the string from its value. The latter case= can be solved if you "know" which branch to take while descendin= g from root and the "shared prefix" alone won't give you this= information. At least I don't see how it could.

I am familiar with the basic prefix hashing procedure suggested by Daci= uk (and other authors), but maybe some progress has been made there, I don&= #39;t know... the one I know is really conceptually simple -- since each ar= c encodes the number of leaves (or input sequences) in the automaton, you k= now which path must lead you to your string. For example if you have a node= like this and seek for the 12-th term:

0 -- 10 -- ...
=C2=A0 +- 10 -- ...
=C2=A0 +- 5 -- ..
<= font class=3D"Apple-style-span" face=3D"'courier new', monospace"><= br>
you look at the first path, it'd give you terms 1..10, then the ne= xt one contains terms 11..20 so you add 10 to an internal counter which is = added to further computations, descend and repeat the procedure until you f= ind a leaf node.

Dawid
--0016e640981460194904a39e9592--