lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Design questions
Date Fri, 15 Feb 2008 07:03:22 GMT

I haven't really been following this thread that closely, but...

: Why not just use $$$$$$$$? Check to insure that it makes
: it through whatever analyzer you choose though. For instance,
: LetterTokenizer will remove it...

1) i'm 99% sure you can do something like this...

  Document doc = new Document()
  for (int i = 0; i < pages.length; i++) {
    doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED));
    doc.add(new Field("text", "$$", Field.Store.NO, Field.Index.UN_TOKENIZED));
  }

...and you'll get your magic token regardless of whether it would normally 
make it through your analyzer. In fact: you want it to be something your 
analyzer could never produce, even if it appears in the orriginal text, so 
you don't get false boundaries (ie: if you use an Analzeer that lowercases 
everything, then "A" makes a perfectly fine boundary token.

2) if your goal is just to be able to make sure you can query for phrases 
without crossing page boundaries, it's a lot simpler just to use are 
really big positionIncimentGap with your analyzer (and add each page as a 
seperate Field instance).  boundary tokens like these are relaly only 
neccessary if you want more complex queries (like "find X and Y on 
the same page but not in the same sentence")




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message