Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 58044 invoked from network); 13 Apr 2011 13:43:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Apr 2011 13:43:02 -0000 Received: (qmail 21749 invoked by uid 500); 13 Apr 2011 13:43:01 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 21705 invoked by uid 500); 13 Apr 2011 13:43:01 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 21698 invoked by uid 99); 13 Apr 2011 13:43:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2011 13:43:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [130.225.24.68] (HELO sbexch03.sb.statsbiblioteket.dk) (130.225.24.68) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Apr 2011 13:42:52 +0000 Received: from [130.225.25.23] (130.225.25.23) by sbexch03.sb.statsbiblioteket.dk (130.225.24.68) with Microsoft SMTP Server id 8.3.137.0; Wed, 13 Apr 2011 15:42:31 +0200 Subject: Re: Numerical ids for terms? From: Toke Eskildsen Reply-To: te@statsbiblioteket.dk To: "dev@lucene.apache.org" In-Reply-To: <4DA41E2F.1020408@arbylon.net> References: <4DA41E2F.1020408@arbylon.net> Content-Type: text/plain; charset="UTF-8" Organization: State and University Library, Denmark Date: Wed, 13 Apr 2011 15:42:31 +0200 Message-ID: <1302702151.10761.14.camel@te-prime> MIME-Version: 1.0 X-Mailer: Evolution 2.28.3 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote: > Hi -- has there been any effort to create a numerical representation of Lucene > indices. That is, to use the Lucene Directory backend as a large term-document > matrix at index level. As this would require bijective mapping between terms > (per-field, as customary in Lucene) and a numerical index (integer, monotonous > from 0 to numTerms()-1), I guess this requires some some special modifications > to the Lucene core. Maybe you're thinking about something like TermsEnum? https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/TermsEnum.html It provides ordinal-access to terms, represented with longs. In order to make the access at index-level rather than segment-level you will have to perform a merge of the ordinals from the different segments. Unfortunately it is optional whether the codec supports ordinal-based terms access and the default codec does not, so you will have to explicitly select a codec when you build your index. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org