lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Sanders <csand...@redhat.com>
Subject Re: Problem with solr.LengthFilterFactory
Date Mon, 18 May 2015 16:21:12 GMT
No, the field has always been text. And from the error, its obviously passing a very large
token to the index, regardless of the tokenizer and filter. 

So I guess I will have to tokenize and filter the text before I send it to solr, since solr
is not able to properly handle a very large token. More custom code. 

Thanks everyone for your help. 
Charles 


----- Original Message -----

From: "Jack Krupansky" <jack.krupansky@gmail.com> 
To: solr-user@lucene.apache.org 
Sent: Monday, May 18, 2015 12:00:37 PM 
Subject: Re: Problem with solr.LengthFilterFactory 

Sorry for not spotting that earlier. Lucene itself does have such a limit. 
No way around it - an individual term is limited to 32K-2 bytes. Lucene is 
designed for searching of terms, not large blob storage. 

Maybe you defined that field as a string originally and later updated your 
schema to a text field, so that Lucene still knows it as an unanalyzed 
string field? You need to delete the index and start over if you want to 
change the field types like that. 

-- Jack Krupansky 

On Mon, May 18, 2015 at 8:33 AM, Charles Sanders <csanders@redhat.com> 
wrote: 

> Jack, 
> Thanks for the information. If I understand this correctly, the White 
> space tokenizer will break a single token of size 300 into two tokens, one 
> of size 256 and the other of size 44. If this is true, then for the single 
> test document I have used, in the index in the portal_package field, I 
> should see two tokens rather than one large single token. 
> 
> If my understanding is correct, then why in my production system, where we 
> occasionally get a single very large token, do I see this error? 
> Caused by: java.lang.IllegalArgumentException: Document contains at least 
> one immense term in field="portal_package" (whose UTF8 encoding is longer 
> than the max length 32766) 
> 
> The existence of this error would lead me to conclude that a very large 
> single token is making its way through the white space tokenizer and 
> filters to the index where it is rejected. 
> 
> I'm afraid my understanding is not complete. Can you fill in the gaps? 
> 
> Thanks, 
> Charles 
> 
> 
> ----- Original Message ----- 
> 
> From: "Jack Krupansky" <jack.krupansky@gmail.com> 
> To: solr-user@lucene.apache.org 
> Sent: Friday, May 15, 2015 4:31:22 PM 
> Subject: Re: Problem with solr.LengthFilterFactory 
> 
> Sorry that my brain has turned to mush... the issue you are hitting is due 
> to a known, undocumented limit in the whitespace tokenizer: 
> 
> https://issues.apache.org/jira/browse/LUCENE-5785 
> "White space tokenizer has undocumented limit of 256 characters per token" 
> 
> If you look at the parsed query you will see that two query terms were 
> generated. This is because the whitespace tokenizer will simply split long 
> tokens every 256 characters. So, your filter will never see a long term. 
> 
> There is a note on the Jira (evidently by me!) that you can use the pattern 
> tokenizer as a workaround. But... if your term is a string anyway, you 
> could just use the keyword tokenizer. 
> 
> 
> -- Jack Krupansky 
> 
> On Fri, May 15, 2015 at 4:06 PM, Charles Sanders <csanders@redhat.com> 
> wrote: 
> 
> > Shawn, 
> > Thanks a bunch for working with me on this. 
> > 
> > I have deleted all records from my index. Stopped solr. Made the schema 
> > changes as requested. Started solr. Then insert the one test record. Then 
> > search. Still see the same results. No portal_package is not the unique 
> > key, its uri. Which is a string field. 
> > 
> > <field name="portal_package" type="text_std" indexed="true" stored="true" 
> > multiValued="true"/> 
> > 
> > <fieldType name="text_std" class="solr.TextField" 
> > positionIncrementGap="100"> 
> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
> > <filter class="solr.LengthFilterFactory" min="1" max="300" /> 
> > </fieldType> 
> > 
> > { 
> > "documentKind": "test", 
> > "uri": "test300", 
> > "id": "test300", 
> > 
> "portal_package":"12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"

> > } 
> > 
> > 
> > { 
> > "responseHeader": { 
> > "status": 0, 
> > "QTime": 47, 
> > "params": { 
> > "spellcheck": "true", 
> > "enableElevation": "false", 
> > "df": "allText", 
> > "echoParams": "all", 
> > "spellcheck.maxCollations": "5", 
> > "spellcheck.dictionary": "andreasAutoComplete", 
> > "spellcheck.count": "5", 
> > "spellcheck.collate": "true", 
> > "spellcheck.onlyMorePopular": "true", 
> > "rows": "10", 
> > "indent": "true", 
> > "q": 
> > 
> "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",

> > "_": "1431719989047", 
> > "debug": "query", 
> > "wt": "json" 
> > } 
> > }, 
> > "response": { 
> > "numFound": 1, 
> > "start": 0, 
> > "docs": [ 
> > { 
> > "documentKind": "test", 
> > "uri": "test300", 
> > "id": "test300", 
> > "portal_package": [ 
> > 
> > 
> "12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"

> > ], 
> > "_version_": 1501267024421060600, 
> > "timestamp": "2015-05-15T19:56:43.247Z", 
> > "language": "en" 
> > } 
> > ] 
> > }, 
> > "debug": { 
> > "rawquerystring": 
> > 
> "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",

> > "querystring": 
> > 
> "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",

> > "parsedquery": 
> > 
> "portal_package:1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456

> > 
> portal_package:7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",

> > "parsedquery_toString": 
> > 
> "portal_package:1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456

> > 
> portal_package:7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",

> > "QParser": "LuceneQParser" 
> > } 
> > } 
> > 
> > 
> > 
> > 
> > 
> > ----- Original Message ----- 
> > 
> > From: "Shawn Heisey" <apache@elyograg.org> 
> > To: solr-user@lucene.apache.org 
> > Sent: Friday, May 15, 2015 3:29:19 PM 
> > Subject: Re: Problem with solr.LengthFilterFactory 
> > 
> > On 5/15/2015 1:23 PM, Shawn Heisey wrote: 
> > > Then I looked back at your fieldType definition and noticed that you 
> > > are only defining an index analyzer. Remove the 'type="index"' part of 
> > > the analyzer config so it happens at both index and query time, 
> > > reindex, then try again. 
> > 
> > The reindex may be very important here. I would actually completely 
> > delete your data directory and restart Solr before reindexing, to be 
> > sure you don't have old recordsfrom any previous reindexes. 
> > 
> > http://wiki.apache.org/solr/HowToReindex 
> > 
> > I think this next part is unlikely, but I'm going to ask it anyway: Is 
> > the portal_package field your schema uniqueKey? If it is, that might be 
> > an additional source of problems. Using a solr.Textfield for a 
> > uniqueKey field causes Solr to behave in unexpected ways. 
> > 
> > Thanks, 
> > Shawn 
> > 
> > 
> > 
> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message