Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of adrian.m.smith@gmail.com
 designates 72.14.204.236 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=tEtzDvtbpYLCISplMtnZHkyfLY7UvZ7W0GqP9ITNI32trXtp6g8ulK4EJRo6cFb8WsDThyNRAagWgeNDKOC253Nof6r16JfCaX+jPsF2qNuQyqZF1IwJ3x9FuoSt94aqd4HaDihxUO+D8MHTY1VBIpoV5alGJJ1Q3nEAv0rm2H0=
Message-ID: <efe487cc0802150402m6e035b0o762f3dc04e3eff79@mail.gmail.com>
Date: Fri, 15 Feb 2008 13:02:02 +0100
From: "Adrian Smith" <adrian.m.smith@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Design questions
In-Reply-To: <Pine.LNX.4.62.0802142256530.25935@radix.cryptio.net>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_4948_9905554.1203076922109"
References: <20080109213932.229980@gmx.net>
	 <359a92830801091449x5f11d031v4b20cfef23f1d579@mail.gmail.com>
	 <FA7F479DF26B4AC2B32EF68EE9311B18@msrvcn04>
	 <359a92830802141601t1cd4d753i82f1ea13813d191@mail.gmail.com>
	 <Pine.LNX.4.62.0802142256530.25935@radix.cryptio.net>

------=_Part_4948_9905554.1203076922109
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hi,

I have a similar sitaution. I also considered using $. But for the sake of
not running into (potential) problems with Tokenisers, I just defined a
string in a config file which for sure is never going to occur in a document
and will never be searched for, e.g.

dfgjkjrkruigduhfkdgjrugr

Cheers, Adrian
--
Java Software Developer
http://www.databasesandlife.com/


On 15/02/2008, Chris Hostetter <hossman_lucene@fucit.org> wrote:
>
>
> I haven't really been following this thread that closely, but...
>
> : Why not just use $$$$$$$$? Check to insure that it makes
>
> : it through whatever analyzer you choose though. For instance,
> : LetterTokenizer will remove it...
>
>
> 1) i'm 99% sure you can do something like this...
>
>   Document doc = new Document()
>   for (int i = 0; i < pages.length; i++) {
>     doc.add(new Field("text", pages[i], Field.Store.NO,
> Field.Index.TOKENIZED));
>     doc.add(new Field("text", "$$", Field.Store.NO,
> Field.Index.UN_TOKENIZED));
>   }
>
> ...and you'll get your magic token regardless of whether it would normally
> make it through your analyzer. In fact: you want it to be something your
> analyzer could never produce, even if it appears in the orriginal text, so
> you don't get false boundaries (ie: if you use an Analzeer that lowercases
> everything, then "A" makes a perfectly fine boundary token.
>
> 2) if your goal is just to be able to make sure you can query for phrases
> without crossing page boundaries, it's a lot simpler just to use are
> really big positionIncimentGap with your analyzer (and add each page as a
> seperate Field instance).  boundary tokens like these are relaly only
> neccessary if you want more complex queries (like "find X and Y on
> the same page but not in the same sentence")
>
>
>
>
> -Hoss
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_4948_9905554.1203076922109--