lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edans Sandes (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-11604) ShingleFilter should have an option to skip filler tokens (e.g. stop words)
Date Sat, 04 Nov 2017 14:22:03 GMT

     [ https://issues.apache.org/jira/browse/SOLR-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Edans Sandes updated SOLR-11604:
--------------------------------
    Description: 
ShingleFilterFactory should have an option to ignore filler tokens in the total shingle size.

For instance (adapted from [https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs]),
consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords
and execute the ShingleFilter (shingle size = 3), it gives us the following result:

1. _ brown fox
2. brown fox quickly
3. fox quickly jump
4. quickly jump _
5. jump _ _
6. _ _ lazy
7. _ lazy dog

We can clearly see that the filler token "_" occupies one token in the shingle.
I suppose the returned shingles should be:
1. brown fox quickly
2. fox quickly jump
3. quickly jump lazy
4. jump lazy dog

To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens"
to implement this behavior (note that this is different than using fillerTokens="", since
the empty string occupies one token in the shingle)

I will attach a patch for the ShingleFilter class (getNextToken() method).




  was:
ShingleFilterFactory should have an option to ignore filler tokens in the total shingle size.

For instance (adapted from [stackoverflow https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs]),
consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords
and execute the ShingleFilter (shingle size = 3), it gives us the following result:

1. _ brown fox
2. brown fox quickly
3. fox quickly jump
4. quickly jump _
5. jump _ _
6. _ _ lazy
7. _ lazy dog

We can clearly see that the filler token "_" occupies one token in the shingle.
I suppose the returned shingles should be:
1. brown fox quickly
2. fox quickly jump
3. quickly jump lazy
4. jump lazy dog

To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens"
to implement this behavior (note that this is different than using fillerTokens="", since
the empty string occupies one token in the shingle)

I will attach a patch for the ShingleFilter class (getNextToken() method).





> ShingleFilter should have an option to skip filler tokens (e.g. stop words)
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-11604
>                 URL: https://issues.apache.org/jira/browse/SOLR-11604
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Schema and Analysis
>    Affects Versions: 7.1
>            Reporter: Edans Sandes
>              Labels: ShingleFilter, StopFilter, StopWords
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> ShingleFilterFactory should have an option to ignore filler tokens in the total shingle
size. 
> For instance (adapted from [https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs]),
consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords
and execute the ShingleFilter (shingle size = 3), it gives us the following result:
> 1. _ brown fox
> 2. brown fox quickly
> 3. fox quickly jump
> 4. quickly jump _
> 5. jump _ _
> 6. _ _ lazy
> 7. _ lazy dog
> We can clearly see that the filler token "_" occupies one token in the shingle.
> I suppose the returned shingles should be:
> 1. brown fox quickly
> 2. fox quickly jump
> 3. quickly jump lazy
> 4. jump lazy dog
> To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens"
to implement this behavior (note that this is different than using fillerTokens="", since
the empty string occupies one token in the shingle)
> I will attach a patch for the ShingleFilter class (getNextToken() method).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message