manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1440) "Created date field name" is not honored for pdf filesystem to ElasticSearch
Date Tue, 04 Jul 2017 00:34:02 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072992#comment-16072992
] 

Karl Wright commented on CONNECTORS-1440:
-----------------------------------------

The problem is that the file system connector doesn't set the standard creation date at all:

{code}
        RepositoryDocument data = new RepositoryDocument();
        data.setFileName(fileName);
        data.setMimeType(mimeType);
        data.setModifiedDate(modifiedDate);
        if (convertPath != null) {
          // WGET-compatible input; convert back to external URI
          data.addField("uri",uri);
        } else {
          data.addField("uri",file.toString());
        }
        // MHL for other metadata
        
        // Ingest the document.
{code}

As we've discussed before, the reason for this omission is because the Java standard IO code
doesn't support creation date.  Instead, the creation date is coming (apparently) from fields
extracted using Tika.  So you have the following choices:

(1) If you want the creation date from the PDF metadata, you will need to map these to the
field names you want using the Metadata Adjuster transformer.
(2) If you want the file system creation date, and your file system supports it, we can consider
using java.nio, as described here: https://stackoverflow.com/questions/2723838/determine-file-creation-date-in-java

Please let me know what you want to do.


> "Created date field name" is not honored for pdf filesystem to ElasticSearch
> ----------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1440
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1440
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector
>    Affects Versions: ManifoldCF 2.7.1
>         Environment: Ubuntu 16.10
> ElasticSearch 5.4.1
>            Reporter: Steph van Schalkwyk
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 2.8
>
>
> The "Created date field name" attribute name is not honored for pdf crawls to ES. 
> The ES field created is  "created", not the name entered on the ES parameters page, in
my case "createdOn". BTW, I have a mapping in the index for "createdOn".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message