manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: MCF not indexing documents due to mime-type
Date Sat, 23 Dec 2017 00:27:24 GMT
Hi Phil,

Are these fields extracted by Tika from your document?  Just curious,
because if it's in MCF itself we could do something about it.

Anyhow, what you want is the metadata adjuster:

https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#metadataadjuster


Karl


On Fri, Dec 22, 2017 at 1:47 AM, Phillip Rhodes <motley.crue.fan@gmail.com>
wrote:

> On Thu, Dec 21, 2017 at 8:35 PM, Karl Wright <daddywri@gmail.com> wrote:
> > Well, there are some differences; "Solr Cell" (as they used to call it)
> > generates a couple of fields that the standard Tika extractor in MCF
> won't.
> > But other than that it should work.
>
> By and large I don't think I care about those fields, so that part
> shouldn't be an issue.
>
> > Note that you can still use the extracting update handler in the solr
> > connector; since the input will always be text/plain Tika shouldn't do
> > anything to the document on the Solr side.  If that doesn't happen to be
> > true, you can use the standard Solr input handler,
>
> FWIW, it appears that even when using the Tika connector in MCF, what
> gets sent to
> Solr still triggers some Tika behavior if you have the "use extract
> handler" option turned on.
> When I did this I got all sorts of weird Tika parse exceptions and
> what-not from Solr.
>
> Fortunately just sending everything to Solr using the standard handler
> worked and I'm
> at a point now where *almost* everything works.
>
> The one issue I'm still seeing is this:  when using the Tika
> connector, it seems that some date oriented
> fields are being generated with a value that does not have the
> trailing 'Z` timezone flag.  This causes
> a Solr error if the corresponding field is date typed, as Solr
> requires dates to be in that UTC timezone.
>
> Ex:
>
> dcterms:created: 2011-03-02T08:44:45
> found field: dcterms:modified: 2011-03-02T08:44:45
> Last-Save-Date: 2011-03-02T08:44:45
> meta:save-date: 2011-03-02T08:44:45
>
> Solr wants all of thse to look like
>
>
> 2011-03-02T08:44:45Z
>
>
> Is there any way, using any built in MCF functionality, to forcibly
> munge the field values to correct this?  If not, could I accomplish
> that by writing a custom Transform connector?
>
>
> Thanks,
>
>
> Phil
>

Mime
View raw message