Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id AB1D8200D70 for ; Fri, 22 Dec 2017 02:11:45 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A9A0D160C2C; Fri, 22 Dec 2017 01:11:45 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EE393160C2B for ; Fri, 22 Dec 2017 02:11:44 +0100 (CET) Received: (qmail 55058 invoked by uid 500); 22 Dec 2017 01:11:43 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 55046 invoked by uid 99); 22 Dec 2017 01:11:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Dec 2017 01:11:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 42949180A1E for ; Fri, 22 Dec 2017 01:11:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.121 X-Spam-Level: X-Spam-Status: No, score=-0.121 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id gj9X6Gl6N_aq for ; Fri, 22 Dec 2017 01:11:40 +0000 (UTC) Received: from mail-pl0-f52.google.com (mail-pl0-f52.google.com [209.85.160.52]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id EF0A65F39C for ; Fri, 22 Dec 2017 01:11:39 +0000 (UTC) Received: by mail-pl0-f52.google.com with SMTP id i6so12037404plt.13 for ; Thu, 21 Dec 2017 17:11:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=Gmilmy/+nuvhh7ia2+Pa+giff+9Wtt+adDEQ8LEHiog=; b=hHgsv2o7Wj+eCrx132M0rA+zUXR+UHg2/n+8jtzDdttXcFBcMATrwW2jrIDAhOYoj7 Y5mMOaEGK2EsTzfdGKAWV1U1/I0UOMPSX94iqa16hhB2oBR9ockpRq1rXL+9guxwA5Nk oeiAd/fZdo9d8y6akE+1cW43PzG8ssmAslL7EvxZ7VpIMMRp8o+N16Yeg/LcnUBRk5pD YoOBEQvEO6CDGiersMK3ml5e0OzJeAgQG3mJVre7IvD4PZ84Gf5MNpalOTsMuEGaSK8M NIrJwdv9G1xmtiEDPtfromUG28EU4N+tDKLS0UQBEX0IRI513Wx4nmLqemAsbCyQfyrr ECYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=Gmilmy/+nuvhh7ia2+Pa+giff+9Wtt+adDEQ8LEHiog=; b=Zh3XacdHSsjdv2Wg7Ipnx8nXANOUVdWp8fuhV00v1b2KeZsgU5p6cOxtQhNLzREQM9 2ZQlymOc1aWtzCnxsNh0/1lowVSHjS7Um8vpEgd7fLPSc9WNXEnM4g/lL0sB4U6/61G/ P4YHdxZE5rMoey4D98NdieV8NFuMiig7D04V2AgMdMuwrPgkDCl41A0mAVuBz+aovIcm H/UmeBMj61d2KlWHwu1OTvsmgjl/KcDnEH3aHH1yALwip1PHR/N7I7vREjKEmDHfT0yy DoMr5TxuRCt7talQr7EYicGU4caITmZBRVhCVItdmOdBeIrcQKAAZc2uvuX2vb2iooFD lh+w== X-Gm-Message-State: AKGB3mLQYwXE2Nxa8z3L+b15l6lH9ewUHzp00lvgXZx/2mlBZmJrLfWM dNMbKrS4EwfjmAN1umCrVprUmh1/YBM2xchmTP8= X-Google-Smtp-Source: ACJfBotAXiVGm61L5+2YRCuwknRJXx0pYCG5AND8Za9csGbKuweFcaNKuiMsBaNoTAA9QfllTaG/MuoCIkaaeeMDNE4= X-Received: by 10.84.246.137 with SMTP id m9mr12369661pll.130.1513905097893; Thu, 21 Dec 2017 17:11:37 -0800 (PST) MIME-Version: 1.0 Received: by 10.100.166.167 with HTTP; Thu, 21 Dec 2017 17:11:37 -0800 (PST) In-Reply-To: References: From: Phillip Rhodes Date: Thu, 21 Dec 2017 20:11:37 -0500 Message-ID: Subject: Re: Issue with Solr Cell mixing metadata and content together To: solr-user@lucene.apache.org Content-Type: text/plain; charset="UTF-8" archived-at: Fri, 22 Dec 2017 01:11:45 -0000 Fair enough. I'm actually using ManifoldCF to manage the indexing, and I see that they have a TIka Content Extraction transformer available, so I'll look into wiring that into my pipeline and see if that gets me the results I'm looking for. Thanks, Phil This message optimized for indexing by NSA PRISM On Thu, Dec 21, 2017 at 7:43 PM, Erick Erickson wrote: > bq: s there any way to get reasonable behavior using the > ExtractingRequestHandler, or should I just dump that approach and plan > to run Tika outside of Solr, and then send Solr the exact content I > want? > > Actually, this is recommended for a bunch of reasons, so I'd just > go there straightaway. Tika has all sorts of "interesting" things to > cope with, and since the underlying file formats are more-or-less > followed by this vendor or that, there's always the possibility > that Tika will kill your Solr. > > Here's a place to start: > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > Best, > Erick > > On Thu, Dec 21, 2017 at 4:31 PM, Phillip Rhodes > wrote: >> Hi all, I have been having an issue with Solr, using the >> ExtractingRequestHandler. Basically, when indexing a PDF (for >> example) I get all the metadata mixed into the "content" field along >> with the content. See: >> >> for the gory details. >> >> I'm guessing this is the same basic issue as >> which is still >> unresolved. But I thought I'd ping the list just to see if anyone had >> a workaround or any more information on this. >> >> Is there any way to get reasonable behavior using the >> ExtractingRequestHandler, or should I just dump that approach and plan >> to run Tika outside of Solr, and then send Solr the exact content I >> want? >> >> >> Thanks, >> >> >> >> This message optimized for indexing by NSA PRISM