Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C3AFF200B69 for ; Sat, 6 Aug 2016 05:47:41 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C22F5160AAC; Sat, 6 Aug 2016 03:47:41 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E3C03160A8E for ; Sat, 6 Aug 2016 05:47:40 +0200 (CEST) Received: (qmail 98528 invoked by uid 500); 6 Aug 2016 03:47:39 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 98516 invoked by uid 99); 6 Aug 2016 03:47:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Aug 2016 03:47:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 5C076C034D for ; Sat, 6 Aug 2016 03:47:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 4yjwWp0hN_Aw for ; Sat, 6 Aug 2016 03:47:36 +0000 (UTC) Received: from mail-io0-f171.google.com (mail-io0-f171.google.com [209.85.223.171]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 543E35F295 for ; Sat, 6 Aug 2016 03:47:36 +0000 (UTC) Received: by mail-io0-f171.google.com with SMTP id q83so316883691iod.1 for ; Fri, 05 Aug 2016 20:47:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=OYdlW8HFU/dVFgQfYenh7x1BOPGpfDZA06Cj6vJ8oos=; b=KFRNmy2fq2x4v4+DzvthYbVi07+A/QJKvpnHjIMu7TPEadFFhE2lsQiF3VST8zYx/w M5XDTXpNoS+KuQhG8E7wdZwVm9B47oTbMJAuqfuo8kvjSfz3ZkXQ0TOtNTDh55jdODaY 6A0h5gLDAjtj1Zy624xHVSZpmXJ9SG0R5tetSQCclp4pI7MyW86OK4SWSN6b/1hBPkWQ 2uCXLfEG3nOCQx6Jwk3WbeBuJTbME2GTICA2zkKO52b09gD3un6V95vuU0Wjy0waXf1z VnLslaEKvJFpmv6lWWs8ETkrEsSTZz5eUMLQjf7+Z6HTLb/684siaiLciHpENk3zhRgg /RGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-transfer-encoding; bh=OYdlW8HFU/dVFgQfYenh7x1BOPGpfDZA06Cj6vJ8oos=; b=fEXE3w5B4qc8dJpYmmqFXiGcvi7aGegFHSU11pnIzNPoB80LKGfD3boho+ZQYOilRR nBWV03eHijE6UdR1wQO0ejA5iUGcevcR4g/pE+HX1jqzFXB1kaWSEcLMBru2wD+tkI8r CM8zzBydHixVBZEiKXQY7mVuDQ09TOHyEv3COkG8ulkDLbx3c5zhG6udsyNCBYo6PHfF vU80RaG5+Kk22J0SVuMrZJkLctUDK3QEbM4HDl+Oj6IbMLY+4yx/FORaveQujxPgCR/A xJ6U4imL1BbeZPnLuGOJz4Uc4SShYKY7FI1Xh5l+WZV+D9xY7Lr/DpySvqYuUaydNrb0 nqGw== X-Gm-Message-State: AEkoousT0ysdzfnDlZWXXOWevUuBFo80mV1U+CGRC7MbxgJynptQduKYCyNeTC7zrmigiPCOy17bqsi1PW2kzA== X-Received: by 10.107.135.24 with SMTP id j24mr78126898iod.158.1470455255015; Fri, 05 Aug 2016 20:47:35 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.150.19 with HTTP; Fri, 5 Aug 2016 20:46:54 -0700 (PDT) In-Reply-To: <7982FEE3B786964CB7669AEB63F355267DF63B83@UCOLHPUG.easf.csd.disa.mil> References: <7982FEE3B786964CB7669AEB63F355267DF63B1D@UCOLHPUG.easf.csd.disa.mil> <7982FEE3B786964CB7669AEB63F355267DF63B83@UCOLHPUG.easf.csd.disa.mil> From: Erick Erickson Date: Fri, 5 Aug 2016 20:46:54 -0700 Message-ID: Subject: Re: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED) To: solr-user Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable archived-at: Sat, 06 Aug 2016 03:47:41 -0000 You also need to find out _why_ you're trying to index such huge tokens, they indicate that something you're ingesting isn't reasonable.... Just truncating the input will index things, true. But a 32K token is unexpected, and indicates what's in your index may not be what you expect and may not be useful. But you know what you're indexing best, this is just a general statement. Erick On Fri, Aug 5, 2016 at 12:55 PM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) wrote: > CLASSIFICATION: UNCLASSIFIED > > What I did was force nutch to truncate content to 32765 max before indexi= ng into solr and it solved my problem. > > > Thanks, > Kris > > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > Kris T. Musshorn > FileMaker Developer - Contractor =E2=80=93 Catapult Technology Inc. > US Army Research Lab > Aberdeen Proving Ground > Application Management & Development Branch > 410-278-7251 > kris.t.musshorn.ctr@mail.mil > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > -----Original Message----- > From: Erick Erickson [mailto:erickerickson@gmail.com] > Sent: Friday, August 05, 2016 3:29 PM > To: solr-user > Subject: [Non-DoD Source] Re: Solr 6.1.0 issue (UNCLASSIFIED) > > All active links contained in this email were disabled. Please verify th= e identity of the sender, and confirm the authenticity of all links contain= ed within the message prior to copying and pasting the address to a Web bro= wser. > > > > > ---- > > what that error is telling you is that you have an unanalyzed term that i= s, well, huge (i..e > 32K). Is your "content" field by chance a "string" ty= pe? It's very rare that a term > 32K is actually useful. > You can't search on it except with, say, wildcards,there's no stemming et= c. So the first question is whether the "content" field is appropriately de= fined in your schema for your use case. > > If your content field is some kind of text-based field (i.e. > solr.Textfield), then the second issue may be that you just have wonky da= ta coming in, say a base-64 encoded image or something scraped from somewhe= re. In that case you need to NOT index it. You can try Or try LengthFilterF= actory, see: > Caution-https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr= .LengthFilterFactory. > > This is a fundamental limitation enforced at the Lucene layer, so if that= doesn't work, the only real solution is "don't do that". You'll have to in= tercept the doc and omit that data, perhaps write a custom update processor= to throw out huge fields or the like. > > Best, > Erick > > > On Fri, Aug 5, 2016 at 10:59 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (= US) wrote: >> CLASSIFICATION: UNCLASSIFIED >> >> I am trying to index from nutch 1.12 to SOLR 6.1.0. >> Got this error. >> java.lang.Exception: >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at Caution-http://localhost:8983/solr/ARLInside: >> Exception writing document id >> Caution-https://emcstage.arl.army.mil/inside/fellows/corner/research.v >> ol.3.2/index.cfm to the index; possible analysis error: Document >> contains at least one immense term in field=3D"content" (whose UTF8 >> encoding is longer than the max length 32766 >> >> How to correct? >> >> Thanks, >> Kris >> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~ >> Kris T. Musshorn >> FileMaker Developer - Contractor - Catapult Technology Inc. >> US Army Research Lab >> Aberdeen Proving Ground >> Application Management & Development Branch >> 410-278-7251 >> kris.t.musshorn.ctr@mail.mil >> ~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> >> >> CLASSIFICATION: UNCLASSIFIED > > > CLASSIFICATION: UNCLASSIFIED