lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrian Smith" <adrian.m.sm...@gmail.com>
Subject Re: Design questions
Date Fri, 15 Feb 2008 12:02:02 GMT
Hi,

I have a similar sitaution. I also considered using $. But for the sake of
not running into (potential) problems with Tokenisers, I just defined a
string in a config file which for sure is never going to occur in a document
and will never be searched for, e.g.

dfgjkjrkruigduhfkdgjrugr

Cheers, Adrian
--
Java Software Developer
http://www.databasesandlife.com/



On 15/02/2008, Chris Hostetter <hossman_lucene@fucit.org> wrote:
>
>
> I haven't really been following this thread that closely, but...
>
> : Why not just use $$$$$$$$? Check to insure that it makes
>
> : it through whatever analyzer you choose though. For instance,
> : LetterTokenizer will remove it...
>
>
> 1) i'm 99% sure you can do something like this...
>
>   Document doc = new Document()
>   for (int i = 0; i < pages.length; i++) {
>     doc.add(new Field("text", pages[i], Field.Store.NO,
> Field.Index.TOKENIZED));
>     doc.add(new Field("text", "$$", Field.Store.NO,
> Field.Index.UN_TOKENIZED));
>   }
>
> ...and you'll get your magic token regardless of whether it would normally
> make it through your analyzer. In fact: you want it to be something your
> analyzer could never produce, even if it appears in the orriginal text, so
> you don't get false boundaries (ie: if you use an Analzeer that lowercases
> everything, then "A" makes a perfectly fine boundary token.
>
> 2) if your goal is just to be able to make sure you can query for phrases
> without crossing page boundaries, it's a lot simpler just to use are
> really big positionIncimentGap with your analyzer (and add each page as a
> seperate Field instance).  boundary tokens like these are relaly only
> neccessary if you want more complex queries (like "find X and Y on
> the same page but not in the same sentence")
>
>
>
>
> -Hoss
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message