lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)
Date Sat, 06 Aug 2016 03:46:54 GMT
You also need to find out _why_ you're trying to index such huge
tokens, they indicate that something you're ingesting isn't
reasonable....

Just truncating the input will index things, true. But a 32K token is
unexpected, and indicates what's in your index may not be what you
expect and may not be useful.

But you know what you're indexing best, this is just a general statement.

Erick

On Fri, Aug 5, 2016 at 12:55 PM, Musshorn, Kris T CTR USARMY RDECOM
ARL (US) <kris.t.musshorn.ctr@mail.mil> wrote:
> CLASSIFICATION: UNCLASSIFIED
>
> What I did was force nutch to truncate content to 32765 max before indexing into solr
and it solved my problem.
>
>
> Thanks,
> Kris
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor – Catapult Technology Inc.
> US Army Research Lab
> Aberdeen Proving Ground
> Application Management & Development Branch
> 410-278-7251
> kris.t.musshorn.ctr@mail.mil
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Friday, August 05, 2016 3:29 PM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED)
>
> All active links contained in this email were disabled.  Please verify the identity of
the sender, and confirm the authenticity of all links contained within the message prior to
copying and pasting the address to a Web browser.
>
>
>
>
> ----
>
> what that error is telling you is that you have an unanalyzed term that is, well, huge
(i..e > 32K). Is your "content" field by chance a "string" type? It's very rare that a
term > 32K is actually useful.
> You can't search on it except with, say, wildcards,there's no stemming etc. So the first
question is whether the "content" field is appropriately defined in your schema for your use
case.
>
> If your content field is some kind of text-based field (i.e.
> solr.Textfield), then the second issue may be that you just have wonky data coming in,
say a base-64 encoded image or something scraped from somewhere. In that case you need to
NOT index it. You can try Or try LengthFilterFactory, see:
> Caution-https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory.
>
> This is a fundamental limitation enforced at the Lucene layer, so if that doesn't work,
the only real solution is "don't do that". You'll have to intercept the doc and omit that
data, perhaps write a custom update processor to throw out huge fields or the like.
>
> Best,
> Erick
>
>
> On Fri, Aug 5, 2016 at 10:59 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) <kris.t.musshorn.ctr@mail.mil>
wrote:
>> CLASSIFICATION: UNCLASSIFIED
>>
>> I am trying to index from nutch 1.12 to SOLR 6.1.0.
>> Got this error.
>> java.lang.Exception:
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at Caution-http://localhost:8983/solr/ARLInside:
>> Exception writing document id
>> Caution-https://emcstage.arl.army.mil/inside/fellows/corner/research.v
>> ol.3.2/index.cfm to the index; possible analysis error: Document
>> contains at least one immense term in field="content" (whose UTF8
>> encoding is longer than the max length 32766
>>
>> How to correct?
>>
>> Thanks,
>> Kris
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>> Kris T. Musshorn
>> FileMaker Developer - Contractor - Catapult Technology Inc.
>> US Army Research Lab
>> Aberdeen Proving Ground
>> Application Management & Development Branch
>> 410-278-7251
>> kris.t.musshorn.ctr@mail.mil
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>
>>
>> CLASSIFICATION: UNCLASSIFIED
>
>
> CLASSIFICATION: UNCLASSIFIED

Mime
View raw message