manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Rhodes <motley.crue....@gmail.com>
Subject Re: MCF not indexing documents due to mime-type
Date Fri, 22 Dec 2017 00:21:02 GMT
OK, it looks like the root of the problem I was seeing, metadata
winding up mixed in with the content, is ultimately a bug in Solr.
<https://issues.apache.org/jira/browse/SOLR-9178>

It seems that if you use the "Tika built into Solr" approach this is
just what you get.  The answer seems to be "do the Tika processing
outside of Solr".

So now my question vis-a-vis ManifoldCF is this: can I achieve the
scenario of having MCF index everything, and send it all to Solr,
while *not* using the ExtractingRequestHandler if I run Tika in MCF
directly?  My naive understanding is that the "Tika Content Extractor"
should let me accomplish this.  Can anyone confirm if that is correct?


Thanks


Phil

This message optimized for indexing by NSA PRISM


On Wed, Dec 20, 2017 at 7:53 AM, Karl Wright <daddywri@gmail.com> wrote:
> Hi Phil,
>
> For some output connectors, they *only* accept text documents.  That's why
> you need to run your documents through Tika first.  So your original setup
> was right.
>
> If you are still using ElasticSearch, you can make it accept non-text
> documents only by specifying the mapper attachment in the output connection
> configuration.
>
>
>
> Karl
>
>
> On Wed, Dec 20, 2017 at 4:25 AM, Phillip Rhodes <motley.crue.fan@gmail.com>
> wrote:
>>
>> MCF folks:
>>
>> I'm about to tear my hair out over this one... I just realized that
>> I've been running MCF with the "Use the Extract Update Handler:"
>> option checked.  Suspecting this might be related to another issue I
>> was having (content was not being stored in the field named in the
>> "Content field name:" option in MCF), I turned this option off.
>>
>> Now, MCF happily rejects nearly every document in my repository with this:
>>
>> Result Code: EXCLUDEDMIMETYPE
>> Result Description: Excluding document because of mime type
>> (application/pdf)
>> (and so on for many other mime types)
>>
>> So... this is *not* what I would expect to happen as I have nothing at
>> all listed in the "excluded mime types" setting for this output
>> connector.  With nothing explicitly excluded, I would (perhaps
>> naively) expect all mime types to be sent to Solr.
>>
>> But what makes it even worse is this: even when I explicitly add types
>> (for example, application/pdf) to the "included mime types" setting
>> and re-index, I *still* get the same message and no PDF files are
>> indexed.
>>
>> Any ideas?  Is this a bug, or is there something else I need to do?
>>
>>
>>
>> Thanks,
>>
>>
>> Phil
>> ~~~
>> This message optimized for indexing by NSA PRISM
>
>

Mime
View raw message