manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CONNECTORS-1433) Add CLI options to pipeline modules, e.g. allow Tika to export TEXT, not BASE64
Date Thu, 22 Jun 2017 21:56:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059970#comment-16059970
] 

Karl Wright edited comment on CONNECTORS-1433 at 6/22/17 9:55 PM:
------------------------------------------------------------------

In this scenario below, what would the equivalent of the "my_data" field name be in the ES
connector?

PUT my_index/my_type/my_id?pipeline=attachment
{code}
{
  "*+my_data+*": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
{code}

GET my_index/my_type/my_id

{code}
{
  "found": true,
  "_index": "my_index",
  "_type": "my_type",
  "_id": "my_id",
  "_version": 1,
  "_source": {
    "*+my_data+*": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
  }
}
{code}

When it PUTs the document, what is the field name? I have this from ElasticSearchIndex.java
(line 202):

{code}
          // Since ES 1.0
          pw.print(" \"_content\" : \"");
          Base64 base64 = new Base64();
          base64.encodeStream(inputStream, pw);
          pw.print("\"}");
{code}

so I assumed it was the _content field, but that doesn't work in the pipeline. 
I'll investigate further.


was (Author: svanschalkwyk):
In this scenario below, what would the equivalent of the "my_data" field name be in the ES
connector?

PUT my_index/my_type/my_id?pipeline=attachment
{code}
{
  "*+my_data+*": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/my_type/my_id
{
  "found": true,
  "_index": "my_index",
  "_type": "my_type",
  "_id": "my_id",
  "_version": 1,
  "_source": {
    "*+my_data+*": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
  }
}
{code}

When it PUTs the document, what is the field name? I have this from ElasticSearchIndex.java
(line 202):

{code}
          // Since ES 1.0
          pw.print(" \"_content\" : \"");
          Base64 base64 = new Base64();
          base64.encodeStream(inputStream, pw);
          pw.print("\"}");
{code}

so I assumed it was the _content field, but that doesn't work in the pipeline. 
I'll investigate further.

> Add CLI options to pipeline modules, e.g. allow Tika to export TEXT, not BASE64
> -------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1433
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1433
>             Project: ManifoldCF
>          Issue Type: Wish
>          Components: Tika extractor
>            Reporter: Steph van Schalkwyk
>            Assignee: Karl Wright
>         Attachments: CONNECTORS-1433.patch, image.png, image.png
>
>
> Would love to have Tika spout TEXT, not BASE64.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message