lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Rhodes <motley.crue....@gmail.com>
Subject Re: Issue with Solr Cell mixing metadata and content together
Date Fri, 22 Dec 2017 01:11:37 GMT
Fair enough.  I'm actually using ManifoldCF to manage the indexing,
and I see that they have a TIka Content Extraction transformer
available, so I'll look into wiring that into my pipeline and see if
that gets me the results I'm looking for.


Thanks,


Phil

This message optimized for indexing by NSA PRISM


On Thu, Dec 21, 2017 at 7:43 PM, Erick Erickson <erickerickson@gmail.com> wrote:
> bq: s there any way to get reasonable behavior using the
> ExtractingRequestHandler, or should I just dump that approach and plan
> to run Tika outside of Solr, and then send Solr the exact content I
> want?
>
> Actually, this is recommended for a bunch of reasons, so I'd just
> go there straightaway. Tika has all sorts of "interesting" things to
> cope with, and since the underlying file formats are more-or-less
> followed by this vendor or that, there's always the possibility
> that Tika will kill your Solr.
>
> Here's a place to start:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Thu, Dec 21, 2017 at 4:31 PM, Phillip Rhodes
> <motley.crue.fan@gmail.com> wrote:
>> Hi all, I have been having an issue with Solr, using the
>> ExtractingRequestHandler.  Basically, when indexing a PDF (for
>> example) I get all the metadata mixed into the "content" field along
>> with the content.  See:
>> <https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content>
>> for the gory details.
>>
>> I'm guessing this is the same basic issue as
>> <https://issues.apache.org/jira/browse/SOLR-9178> which is still
>> unresolved.  But I thought I'd ping the list just to see if anyone had
>> a workaround or any more information on this.
>>
>> Is there any way to get reasonable behavior using the
>> ExtractingRequestHandler, or should I just dump that approach and plan
>> to run Tika outside of Solr, and then send Solr the exact content I
>> want?
>>
>>
>> Thanks,
>>
>>
>>
>> This message optimized for indexing by NSA PRISM

Mime
View raw message