Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 57507 invoked from network); 15 Feb 2008 12:02:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 Feb 2008 12:02:43 -0000 Received: (qmail 6752 invoked by uid 500); 15 Feb 2008 12:02:29 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 6724 invoked by uid 500); 15 Feb 2008 12:02:29 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 6713 invoked by uid 99); 15 Feb 2008 12:02:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Feb 2008 04:02:29 -0800 X-ASF-Spam-Status: No, hits=4.7 required=10.0 tests=HTML_MESSAGE,SPF_PASS,TRACKER_ID X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of adrian.m.smith@gmail.com designates 72.14.204.236 as permitted sender) Received: from [72.14.204.236] (HELO qb-out-0506.google.com) (72.14.204.236) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Feb 2008 12:01:54 +0000 Received: by qb-out-0506.google.com with SMTP id o21so11755554qba.9 for ; Fri, 15 Feb 2008 04:02:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=aX1mCSaMW1PGTfFnqEGnKaur1NKo00MTDCyKcc7mM4A=; b=qyvo3ZKkklPvADD0Tpe1D7dqht7Y4QqQd+4uaF974pqupbGyAWknJvh3gUgTAID2+uKiBxMnQxYyJGK3LRyTi6rZY4qbc0L9jYFpp/oDd12A+g8ewyL/ssDWREwJ/V+yhjnI0FSYLxH09/nzYoV85EG3yuwAw0AZiUoNbjlV4uI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=tEtzDvtbpYLCISplMtnZHkyfLY7UvZ7W0GqP9ITNI32trXtp6g8ulK4EJRo6cFb8WsDThyNRAagWgeNDKOC253Nof6r16JfCaX+jPsF2qNuQyqZF1IwJ3x9FuoSt94aqd4HaDihxUO+D8MHTY1VBIpoV5alGJJ1Q3nEAv0rm2H0= Received: by 10.65.132.13 with SMTP id j13mr3730153qbn.5.1203076922113; Fri, 15 Feb 2008 04:02:02 -0800 (PST) Received: by 10.65.98.3 with HTTP; Fri, 15 Feb 2008 04:02:02 -0800 (PST) Message-ID: Date: Fri, 15 Feb 2008 13:02:02 +0100 From: "Adrian Smith" To: java-user@lucene.apache.org Subject: Re: Design questions In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_4948_9905554.1203076922109" References: <20080109213932.229980@gmx.net> <359a92830801091449x5f11d031v4b20cfef23f1d579@mail.gmail.com> <359a92830802141601t1cd4d753i82f1ea13813d191@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_4948_9905554.1203076922109 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi, I have a similar sitaution. I also considered using $. But for the sake of not running into (potential) problems with Tokenisers, I just defined a string in a config file which for sure is never going to occur in a document and will never be searched for, e.g. dfgjkjrkruigduhfkdgjrugr Cheers, Adrian -- Java Software Developer http://www.databasesandlife.com/ On 15/02/2008, Chris Hostetter wrote: > > > I haven't really been following this thread that closely, but... > > : Why not just use $$$$$$$$? Check to insure that it makes > > : it through whatever analyzer you choose though. For instance, > : LetterTokenizer will remove it... > > > 1) i'm 99% sure you can do something like this... > > Document doc = new Document() > for (int i = 0; i < pages.length; i++) { > doc.add(new Field("text", pages[i], Field.Store.NO, > Field.Index.TOKENIZED)); > doc.add(new Field("text", "$$", Field.Store.NO, > Field.Index.UN_TOKENIZED)); > } > > ...and you'll get your magic token regardless of whether it would normally > make it through your analyzer. In fact: you want it to be something your > analyzer could never produce, even if it appears in the orriginal text, so > you don't get false boundaries (ie: if you use an Analzeer that lowercases > everything, then "A" makes a perfectly fine boundary token. > > 2) if your goal is just to be able to make sure you can query for phrases > without crossing page boundaries, it's a lot simpler just to use are > really big positionIncimentGap with your analyzer (and add each page as a > seperate Field instance). boundary tokens like these are relaly only > neccessary if you want more complex queries (like "find X and Y on > the same page but not in the same sentence") > > > > > -Hoss > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_4948_9905554.1203076922109--