lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: content disappears in the index
Date Thu, 15 Nov 2012 16:33:52 GMT
Oddly I had the exact same thought. Although it's not obvious from the name
(and common usage) of trim-like functions that you'd also have a way to
specify maximum length (after trimming I'd assume).

And the other thought I had was that TrimFilter should optionally take a
list of characters to trim. Then I thought of regex, especially to specify
character classes like \w..... naaahhhhhh, we just went there......

but I think I'd prefer a separate filter. If for no other reason that by
including a length in the trim filter, you implicitly disallow having
spaces in the beginning or end of your tokens. Why you'd want this I don't
have a use-case for, but there's no good reason I can think of to couple
these two different functions....

FWIW,
Erick


On Wed, Nov 14, 2012 at 2:05 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Hi Geoff,
> cool, that will eliminate possible regex pitfalls in schema.xml
>
> I was thinking about enhancing an existing filter as multi-purpose filter.
> E.g. TrimFilter, if maxLength is set then also limit the termAtt to
> maxLength.
> This will keep the number of available filters small, especially for
> simple tasks.
> Any thoughts from the core developers about this idea?
>
> Regards
> Bernd
>
>
> Am 13.11.2012 17:56, schrieb Geoff Cooney:
> > Hi,
> >
> > I've been following this thread and happen to have a simple
> > TruncatingFilter class I wrote for the same purpose.  I think this should
> > do what you want:
> >
> >
> >
> > import java.io.IOException;
> >
> > import org.apache.lucene.analysis.TokenFilter;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >
> > public class TruncatingFilter extends TokenFilter {
> >     private final CharTermAttribute termAtt =
> > addAttribute(CharTermAttribute.class);
> >     private final int maxLength;
> >
> >     protected TruncatingFilter(TokenStream input, int maxLength) {
> >         super(input);
> >         this.maxLength = maxLength;
> >     }
> >
> >     @Override
> >     public boolean incrementToken() throws IOException {
> >         if (input.incrementToken()) {
> >             if (termAtt.length() > maxLength) {
> >                 termAtt.setLength(maxLength);
> >             }
> >
> >             return true;
> >         } else {
> >             return false;
> >         }
> >     }
> >
> > }
> >
> > Cheers,
> > Geoff
> >
> >
> > On Tue, Nov 13, 2012 at 7:54 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
> >
> >> There's nothing in Solr that I know of that does this. It would be a
> pretty
> >> easy custom filter to create though....
> >>
> >> FWIW,
> >> Erick
> >>
> >>
> >> On Tue, Nov 13, 2012 at 7:02 AM, Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >>> On Mon, Nov 12, 2012 at 10:47 PM, Bernd Fehling
> >>> <bernd.fehling@uni-bielefeld.de> wrote:
> >>>> By the way, why does TrimFilter option updateOffset defaults to false,
> >>>> just keep it backwards compatible?
> >>>>
> >>>
> >>> In my opinion this option should be removed.
> >>>
> >>> TokenFilters shouldn't muck with offsets, for a lot of reasons, but
> >>> especially because its too late to interact with any charfilter.
> >>>
> >>> This is the tokenizer's job.
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message