manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phillip Rhodes <motley.crue....@gmail.com>
Subject Re: MCF not indexing documents due to mime-type
Date Sun, 24 Dec 2017 05:31:28 GMT
As far as I know, the wonkiness with the data I'm seeing is actually a
reflection of an underlying problem with digital images.  Apparently
some or all of the various date typed fields mandated by EXIF and XMP
don't require time-zone information.  So apparently you can have an
image that legitimately has a date/time field like "created date" that
does not include time-zone info.   But since Solr requires UTC
time-zone for date typed fields, if you want to store that date in a
date field, you have to impute the correct value (or a reasonable
approximation).

In my case, I doubt anybody is ever going to care to search images in
a way where a difference of a few hours is going to matter, so I think
I'm just going to force everything to a time value of midnight UTC on
the date in question.

Right now I'm exploring writing my own custom transformer to do the
data munging.   It might be overkill, but I wanted to do it just to
learn that side of MCF if nothing else.  So far the transformer I
threw together seems to be working.


Thanks,


Phil

This message optimized for indexing by NSA PRISM


On Fri, Dec 22, 2017 at 7:27 PM, Karl Wright <daddywri@gmail.com> wrote:
> Hi Phil,
>
> Are these fields extracted by Tika from your document?  Just curious,
> because if it's in MCF itself we could do something about it.
>
> Anyhow, what you want is the metadata adjuster:
>
> https://manifoldcf.apache.org/release/release-1.10/en_US/end-user-documentation.html#metadataadjuster
>
>
> Karl
>
>
> On Fri, Dec 22, 2017 at 1:47 AM, Phillip Rhodes <motley.crue.fan@gmail.com>
> wrote:
>>
>> On Thu, Dec 21, 2017 at 8:35 PM, Karl Wright <daddywri@gmail.com> wrote:
>> > Well, there are some differences; "Solr Cell" (as they used to call it)
>> > generates a couple of fields that the standard Tika extractor in MCF
>> > won't.
>> > But other than that it should work.
>>
>> By and large I don't think I care about those fields, so that part
>> shouldn't be an issue.
>>
>> > Note that you can still use the extracting update handler in the solr
>> > connector; since the input will always be text/plain Tika shouldn't do
>> > anything to the document on the Solr side.  If that doesn't happen to be
>> > true, you can use the standard Solr input handler,
>>
>> FWIW, it appears that even when using the Tika connector in MCF, what
>> gets sent to
>> Solr still triggers some Tika behavior if you have the "use extract
>> handler" option turned on.
>> When I did this I got all sorts of weird Tika parse exceptions and
>> what-not from Solr.
>>
>> Fortunately just sending everything to Solr using the standard handler
>> worked and I'm
>> at a point now where *almost* everything works.
>>
>> The one issue I'm still seeing is this:  when using the Tika
>> connector, it seems that some date oriented
>> fields are being generated with a value that does not have the
>> trailing 'Z` timezone flag.  This causes
>> a Solr error if the corresponding field is date typed, as Solr
>> requires dates to be in that UTC timezone.
>>
>> Ex:
>>
>> dcterms:created: 2011-03-02T08:44:45
>> found field: dcterms:modified: 2011-03-02T08:44:45
>> Last-Save-Date: 2011-03-02T08:44:45
>> meta:save-date: 2011-03-02T08:44:45
>>
>> Solr wants all of thse to look like
>>
>>
>> 2011-03-02T08:44:45Z
>>
>>
>> Is there any way, using any built in MCF functionality, to forcibly
>> munge the field values to correct this?  If not, could I accomplish
>> that by writing a custom Transform connector?
>>
>>
>> Thanks,
>>
>>
>> Phil
>
>

Mime
View raw message