lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Filter Factory question
Date Fri, 29 Sep 2017 07:33:55 GMT
It is still on master: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.java
<https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.java>

Emir

> On 28 Sep 2017, at 17:32, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> PatternCaptureGroupTokenFilter has been around since 2013 (at least
> that's the earliest revision in Git). I located it even in 5x so it
> should be there in
> ...lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern
> 
> Best,
> Erick
> 
> On Thu, Sep 28, 2017 at 7:45 AM, Webster Homer <webster.homer@sial.com> wrote:
>> It's still buggy, so not ready to share.
>> 
>> I keep a copy of Solr source which I use for this type of development. I
>> don't see PatternCaptureGroupTokenFilterFactory in the Solr 6.2 code base
>> at all. I was thinking of seeing how it treated the positions etc...
>> 
>> My code now looks reasonable in the Analysis tool,  but doesn't seem to
>> create searchable lucene data. I've changed it considerably since my first
>> post so I see output in the tool which was an improvement
>> 
>> 
>> On Wed, Sep 27, 2017 at 10:30 AM, Stefan Matheis <matheis.stefan@gmail.com>
>> wrote:
>> 
>>>> In any case I figured out my problem. I was over thinking it.
>>> 
>>> Mind to share?
>>> 
>>> -Stefan
>>> 
>>> On Sep 27, 2017 4:34 PM, "Webster Homer" <webster.homer@sial.com> wrote:
>>> 
>>>> There is a need for a special filter since the input has to be
>>> normalized.
>>>> That is the main requirement, splitting into pieces is optional. As far
>>> as
>>>> I know there is nothing in solr that knows about molecular formulas.
>>>> 
>>>> In any case I figured out my problem. I was over thinking it.
>>>> 
>>>> On Wed, Sep 27, 2017 at 3:52 AM, Emir Arnautović <
>>>> emir.arnautovic@sematext.com> wrote:
>>>> 
>>>>> Hi Homer,
>>>>> There is no need for special filter, there is one that is for some
>>> reason
>>>>> not part of documentation (will ask why so follow that thread if
>>> decided
>>>> to
>>>>> go this way): You can use something like:
>>>>> <filter class=“solr.PatternCaptureGroupTokenFilterFactory”
>>>>> pattern=“([A-Z][a-z]?\d+)” preserveOriginal=“true” />
>>>>> 
>>>>> This will capture all atom counts as a separate tokens.
>>>>> 
>>>>> HTH,
>>>>> Emir
>>>>> 
>>>>>> On 26 Sep 2017, at 23:14, Webster Homer <webster.homer@sial.com>
>>>> wrote:
>>>>>> 
>>>>>> I am trying to create a filter that normalizes an input token, but
>>> also
>>>>>> splits it inot multiple pieces. Sort of like what the
>>>> WordDelimiterFilter
>>>>>> does.
>>>>>> 
>>>>>> It's meant to take a molecular formula like C2H6O and normalize it
to
>>>>> C2H6O1
>>>>>> 
>>>>>> That part works. However I was also going to have it put out the
>>>>> individual
>>>>>> atom counts as tokens.
>>>>>> C2H6O1
>>>>>> C2
>>>>>> H6
>>>>>> O1
>>>>>> 
>>>>>> When I enable this feature in the factory, I don't get any output
at
>>>> all.
>>>>>> 
>>>>>> I looked over a couple of filters that do what I want and it's not
>>>>> entirely
>>>>>> clear what they're doing. So I have some questions:
>>>>>> Looking at ShingleFilter and WordDelimitierFilter
>>>>>> They both set several attributes:
>>>>>> CharTermAttribute : Seems to be the actual terms being set. Seemed
>>>>> straight
>>>>>> forward, works fine when I only have one term to add.
>>>>>> 
>>>>>> PositionIncrementAttribute: What does this do? It appears that
>>>>>> WordDelimiterFilter sets this to 0 most of the time. This has decent
>>>>>> documentation.
>>>>>> 
>>>>>> OffsetAttribute: I think that this tracks offsets for each term being
>>>>>> processed. Not really sure though. The documentation mentions tokens.
>>>> So
>>>>> if
>>>>>> I have multiple variations for for a token is this for each
>>> variation?
>>>>>> 
>>>>>> TypeAttribute: default is "word". Don't know what this is for.
>>>>>> 
>>>>>> PositionLengthAttribute: WordDelimiterFilter doesn' use this but
>>>> Shingle
>>>>>> does. It defaults to 1. What's it good for when should I use it?
>>>>>> 
>>>>>> Here is my incrementToken method.
>>>>>> 
>>>>>>   @Override
>>>>>>   public boolean incrementToken() throws IOException {
>>>>>>   while(true) {
>>>>>>   if (!hasSavedState) {
>>>>>>   if (! input.incrementToken()) {
>>>>>>   return false;
>>>>>>   }
>>>>>>   if (! generateFragments) { // This part works fine!
>>>>>>       String normalizedFormula = molFormula.normalize(new
>>>>>> String(termAttribute.buffer()));
>>>>>>       char[]newBuffer = normalizedFormula.toCharArray();
>>>>>>       termAttribute.setEmpty();
>>>>>>       termAttribute.copyBuffer(newBuffer, 0, newBuffer.length);
>>>>>>       return true;
>>>>>>   }
>>>>>>   formulas = molFormula.normalizeToList(new
>>>>>> String(termAttribute.buffer()));
>>>>>>   iterator = formulas.listIterator();
>>>>>>   savedPositionIncrement += posIncAttribute.getPositionIncrement();
>>>>>>   hasSavedState = true;
>>>>>>   first = true;
>>>>>>   saveState();
>>>>>>   }
>>>>>>   if (!iterator.hasNext()) {
>>>>>>   posIncAttribute.setPositionIncrement(savedPositionIncrement);
>>>>>>   savedPositionIncrement = 0;
>>>>>>   hasSavedState = false;
>>>>>>   continue;
>>>>>>   }
>>>>>>   String formula = iterator.next();
>>>>>>       int startOffset = savedStartOffset;
>>>>>> 
>>>>>>       if (first) {
>>>>>>       termAttribute.setEmpty();
>>>>>>       }
>>>>>>       int endOffset = savedStartOffset + formula.length();
>>>>>>       System.out.printf("Writing formula %s %d to %d%n", formula,
>>>>>> startOffset, endOffset);;
>>>>>>       termAttribute.append(formula);
>>>>>>           offsetAttribute.setOffset(startOffset, endOffset);
>>>>>>           savedStartOffset = endOffset + 1;
>>>>>>           if (first) {
>>>>>>           posIncAttribute.setPositionIncrement(0);
>>>>>>           } else {
>>>>>>           first = false;
>>>>>>               posIncAttribute.setPositionIncrement(0);
>>>>>>           }
>>>>>>           typeAttribute.setType(savedType);
>>>>>>           return true;
>>>>>>   }
>>>>>>   }
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> 
>>>>>> This message and any attachment are confidential and may be
>>> privileged
>>>> or
>>>>>> otherwise protected from disclosure. If you are not the intended
>>>>> recipient,
>>>>>> you must not copy this message or attachment or disclose the contents
>>>> to
>>>>>> any other person. If you have received this transmission in error,
>>>> please
>>>>>> notify the sender immediately and delete the message and any
>>> attachment
>>>>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>>>>> subsidiaries do not accept liability for any omissions or errors
in
>>>> this
>>>>>> message which may arise as a result of E-Mail-transmission or for
>>>> damages
>>>>>> resulting from any unauthorized changes of the content of this
>>> message
>>>>> and
>>>>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
its
>>>>>> subsidiaries do not guarantee that this message is free of viruses
>>> and
>>>>> does
>>>>>> not accept liability for any damages caused by any virus transmitted
>>>>>> therewith.
>>>>>> 
>>>>>> Click http://www.emdgroup.com/disclaimer to access the German,
>>> French,
>>>>>> Spanish and Portuguese versions of this disclaimer.
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> 
>>>> 
>>>> This message and any attachment are confidential and may be privileged or
>>>> otherwise protected from disclosure. If you are not the intended
>>> recipient,
>>>> you must not copy this message or attachment or disclose the contents to
>>>> any other person. If you have received this transmission in error, please
>>>> notify the sender immediately and delete the message and any attachment
>>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>>> subsidiaries do not accept liability for any omissions or errors in this
>>>> message which may arise as a result of E-Mail-transmission or for damages
>>>> resulting from any unauthorized changes of the content of this message
>>> and
>>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>>> subsidiaries do not guarantee that this message is free of viruses and
>>> does
>>>> not accept liability for any damages caused by any virus transmitted
>>>> therewith.
>>>> 
>>>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>>>> Spanish and Portuguese versions of this disclaimer.
>>>> 
>>> 
>> 
>> --
>> 
>> 
>> This message and any attachment are confidential and may be privileged or
>> otherwise protected from disclosure. If you are not the intended recipient,
>> you must not copy this message or attachment or disclose the contents to
>> any other person. If you have received this transmission in error, please
>> notify the sender immediately and delete the message and any attachment
>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>> subsidiaries do not accept liability for any omissions or errors in this
>> message which may arise as a result of E-Mail-transmission or for damages
>> resulting from any unauthorized changes of the content of this message and
>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> subsidiaries do not guarantee that this message is free of viruses and does
>> not accept liability for any damages caused by any virus transmitted
>> therewith.
>> 
>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>> Spanish and Portuguese versions of this disclaimer.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message