lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Shackles" <gshack...@gmail.com>
Subject Re: Looking for a way to customize how StandardAnalyzer handles punctuation
Date Thu, 11 Dec 2008 14:30:17 GMT
I still need all the normal benefits of the StandardAnalyzer as far as
punctuation and everything else goes, with just this one special exception.
Since I was on a limited schedule I ended up just doing the method where I
escape these cases myself in a way that makes them get tokenized.  Certainly
not the best solution but it works as far as I can tell.  If hacking the
StandardAnalyzer grammar is somewhat straightforward then I will try and
looking into doing it that way since I prefer to do things the right way if
possible : )  Thanks Grant!

- Greg

On Wed, Dec 10, 2008 at 8:52 PM, Grant Ingersoll <gsingers@apache.org>wrote:

> Let's take a quick step back and see if it helps.  Why do you feel you need
> the StandardAnalyzer to solve your problem?  What else are you gaining from
> it?  Would you be better served by a WhitespaceTokenizer?
>
> That being said, hacking up the grammar isn't as bad as you might think.
>  There are actually two examples of the "grammar" in Lucene, one is the
> StdTokenizer and the other is the WikipediaTokenizer.  They are similar, but
> maybe by looking at two examples it might also help.
>
>
>
> On Dec 9, 2008, at 10:14 AM, Greg Shackles wrote:
>
>  Hey everyone,
>>
>> I'm running into a problem where some punctuation that I would actually
>> want
>> to keep gets thrown out because they don't get tokenized.  By far the most
>> common case for this is ampersand, but it does happen with others as well.
>> My concern isn't even so much in that I need to be able to enforce that
>> punctuation in the search, but more that I need to know it was there when
>> I
>> get the results.  I am attaching important word data to the payload of
>> each
>> token, so if a "word" was just an ampersand, it disappears.  I took a
>> quick
>> look at the StandardAnalyzer classes and it looks like it would be a pain
>> to
>> try and modify that directly (I don't have much experience in
>> grammar/parsers).  A couple options come to mind, but I wanted to make
>> sure
>> there wasn't a better, more elegant solution before I did something that
>> felt a little hacky:
>>
>> 1) Add a couple fields to the payload saying whether the previous/next
>> word
>> is a single punctuation mark, and which it is.  Then the search can insert
>> the punctuation in the results.  The downside to this would be losing the
>> metadata that would have gone into the payload for that punctuation mark.
>>
>> 2) Do some sort of string replacement logic during indexing and searching
>> to
>> change it into something that will get made into a token, but should not
>> appear naturally on its own in the text.  I usually shy away from
>> solutions
>> like this, but sometimes they prove useful.
>>
>> Has anyone done anything like this?  I don't want to lose most of
>> StandardAnalyzer's punctuation logic, but mainly I want to tokenize
>> punctuation if it appears by itself (surrounded by whitespace).  Thanks!
>>
>> - Greg
>>
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message