manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Rhodes <motley.crue....@gmail.com>
Subject Re: MCF not indexing documents due to mime-type
Date Fri, 22 Dec 2017 06:47:04 GMT
On Thu, Dec 21, 2017 at 8:35 PM, Karl Wright <daddywri@gmail.com> wrote:
> Well, there are some differences; "Solr Cell" (as they used to call it)
> generates a couple of fields that the standard Tika extractor in MCF won't.
> But other than that it should work.

By and large I don't think I care about those fields, so that part
shouldn't be an issue.

> Note that you can still use the extracting update handler in the solr
> connector; since the input will always be text/plain Tika shouldn't do
> anything to the document on the Solr side.  If that doesn't happen to be
> true, you can use the standard Solr input handler,

FWIW, it appears that even when using the Tika connector in MCF, what
gets sent to
Solr still triggers some Tika behavior if you have the "use extract
handler" option turned on.
When I did this I got all sorts of weird Tika parse exceptions and
what-not from Solr.

Fortunately just sending everything to Solr using the standard handler
worked and I'm
at a point now where *almost* everything works.

The one issue I'm still seeing is this:  when using the Tika
connector, it seems that some date oriented
fields are being generated with a value that does not have the
trailing 'Z` timezone flag.  This causes
a Solr error if the corresponding field is date typed, as Solr
requires dates to be in that UTC timezone.

Ex:

dcterms:created: 2011-03-02T08:44:45
found field: dcterms:modified: 2011-03-02T08:44:45
Last-Save-Date: 2011-03-02T08:44:45
meta:save-date: 2011-03-02T08:44:45

Solr wants all of thse to look like


2011-03-02T08:44:45Z


Is there any way, using any built in MCF functionality, to forcibly
munge the field values to correct this?  If not, could I accomplish
that by writing a custom Transform connector?


Thanks,


Phil

Mime
View raw message