lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Filter Factory question
Date Wed, 27 Sep 2017 08:52:06 GMT
Hi Homer,
There is no need for special filter, there is one that is for some reason not part of documentation
(will ask why so follow that thread if decided to go this way): You can use something like:
<filter class=“solr.PatternCaptureGroupTokenFilterFactory” pattern=“([A-Z][a-z]?\d+)”
preserveOriginal=“true” />

This will capture all atom counts as a separate tokens.

HTH,
Emir

> On 26 Sep 2017, at 23:14, Webster Homer <webster.homer@sial.com> wrote:
> 
> I am trying to create a filter that normalizes an input token, but also
> splits it inot multiple pieces. Sort of like what the WordDelimiterFilter
> does.
> 
> It's meant to take a molecular formula like C2H6O and normalize it to C2H6O1
> 
> That part works. However I was also going to have it put out the individual
> atom counts as tokens.
> C2H6O1
> C2
> H6
> O1
> 
> When I enable this feature in the factory, I don't get any output at all.
> 
> I looked over a couple of filters that do what I want and it's not entirely
> clear what they're doing. So I have some questions:
> Looking at ShingleFilter and WordDelimitierFilter
> They both set several attributes:
> CharTermAttribute : Seems to be the actual terms being set. Seemed straight
> forward, works fine when I only have one term to add.
> 
> PositionIncrementAttribute: What does this do? It appears that
> WordDelimiterFilter sets this to 0 most of the time. This has decent
> documentation.
> 
> OffsetAttribute: I think that this tracks offsets for each term being
> processed. Not really sure though. The documentation mentions tokens. So if
> I have multiple variations for for a token is this for each variation?
> 
> TypeAttribute: default is "word". Don't know what this is for.
> 
> PositionLengthAttribute: WordDelimiterFilter doesn' use this but Shingle
> does. It defaults to 1. What's it good for when should I use it?
> 
> Here is my incrementToken method.
> 
>    @Override
>    public boolean incrementToken() throws IOException {
>    while(true) {
>    if (!hasSavedState) {
>    if (! input.incrementToken()) {
>    return false;
>    }
>    if (! generateFragments) { // This part works fine!
>        String normalizedFormula = molFormula.normalize(new
> String(termAttribute.buffer()));
>        char[]newBuffer = normalizedFormula.toCharArray();
>        termAttribute.setEmpty();
>        termAttribute.copyBuffer(newBuffer, 0, newBuffer.length);
>        return true;
>    }
>    formulas = molFormula.normalizeToList(new
> String(termAttribute.buffer()));
>    iterator = formulas.listIterator();
>    savedPositionIncrement += posIncAttribute.getPositionIncrement();
>    hasSavedState = true;
>    first = true;
>    saveState();
>    }
>    if (!iterator.hasNext()) {
>    posIncAttribute.setPositionIncrement(savedPositionIncrement);
>    savedPositionIncrement = 0;
>    hasSavedState = false;
>    continue;
>    }
>    String formula = iterator.next();
>        int startOffset = savedStartOffset;
> 
>        if (first) {
>        termAttribute.setEmpty();
>        }
>        int endOffset = savedStartOffset + formula.length();
>        System.out.printf("Writing formula %s %d to %d%n", formula,
> startOffset, endOffset);;
>        termAttribute.append(formula);
>            offsetAttribute.setOffset(startOffset, endOffset);
>            savedStartOffset = endOffset + 1;
>            if (first) {
>            posIncAttribute.setPositionIncrement(0);
>            } else {
>            first = false;
>                posIncAttribute.setPositionIncrement(0);
>            }
>            typeAttribute.setType(savedType);
>            return true;
>    }
>    }
> 
> -- 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to 
> any other person. If you have received this transmission in error, please 
> notify the sender immediately and delete the message and any attachment 
> from your system. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not accept liability for any omissions or errors in this 
> message which may arise as a result of E-Mail-transmission or for damages 
> resulting from any unauthorized changes of the content of this message and 
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not guarantee that this message is free of viruses and does 
> not accept liability for any damages caused by any virus transmitted 
> therewith.
> 
> Click http://www.emdgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.


Mime
View raw message