manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From msaunier <msaun...@citya.com>
Subject RE: OCR Tika to read PDF, txt and doc docx
Date Fri, 05 Jan 2018 17:46:35 GMT
HI,

 

I used Tika extractor today and it work but he don’t extract content text of they documents.

 

What is the field name of the content_text Tika return ?

 

"stream_name":"201801010200100000005782L.pdf",
        "createdon":"Fri Dec 22 10:37:04 CET 2017",
          "id":"file://///srvics01/ways_holding/gestion_ged/gerance/3004/3004100812019699/201801010200100000005782L.pdf",
        "pdf_docinfo_created":"2017-12-22T09:37:03Z",
        "pdf_docinfo_producer":"Apache FOP Version 1.1",
        "xmp_creatortool":"Apache FOP Version 1.1",
        "access_permission_fill_in_form":"true",
        "meta_creation_date":"2017-12-22T09:37:03Z",
        "content_type":["application/pdf",
          "text/plain; charset=UTF-8"],
        "stream_size":143674,
        "dcterms_created":"2017-12-22T09:37:03Z",
        "access_permission_can_print":"true",
        "access_permission_modify_annotations":"true",
        "pdf_pdfversion":"1.4",
        "dc_format":"application/pdf; version=1.4",
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.txt.TXTParser"],
        "access_permission_extract_for_accessibility":"true",
        "producer":"Apache FOP Version 1.1",
        "lastmodified":"Fri Dec 22 10:37:04 CET 2017",
        "pdf_docinfo_creator_tool":"Apache FOP Version 1.1",
        "created":"Fri Dec 22 10:37:03 CET 2017",
        "resourcename":["201801010200100000005782L.pdf",
          "201801010200100000005782L.pdf"],
        "filelastmodified":"2017-12-22T09:37:04.070Z",
        "creation_date":"2017-12-22T09:37:03Z",
        "xmptpg_npages":"1",
        "access_permission_can_print_degraded":"true",
        "filecreatedon":"2017-12-22T09:37:04.000Z",
        "access_permission_can_modify":"true",
        "access_permission_extract_content":"true",
        "attributes":"32",
        "access_permission_assemble_document":"true",
        "sharename":"ways_holding",
        "pdf_encrypted":"false",
        "stream_content_type":"application/pdf",
        "stream_source_info":"201801010200100000005782L.pdf",
        "content_encoding":["UTF-8"],
        "_version_":1588768212845068289}]
  }}

 

 

Cordialement,

 



 

 

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : vendredi 5 janvier 2018 18:40
À : user@manifoldcf.apache.org
Objet : Re: OCR Tika to read PDF, txt and doc docx

 

Hi,

 

It's pretty straightforward.  EITHER you configure your Solr output connection to use the
extracting update handler and Solr Cell (the default), so that Tika is used on the Solr side,
OR you configure to use the standard update handler and insert the Tika Extractor as a document
transformer in your job's pipeline.

 

Karl

 

On Fri, Jan 5, 2018 at 12:19 PM, msaunier <msaunier@citya.com <mailto:msaunier@citya.com>
> wrote:

Sorry, it’s an error. I need the text content of PDF, txt and doc docx to index in solr.

 

Thanks for your help.

 

 

De : msaunier [mailto:msaunier@citya.com <mailto:msaunier@citya.com> ] 
Envoyé : vendredi 5 janvier 2018 18:05
À : user@manifoldcf.apache.org <mailto:user@manifoldcf.apache.org> 
Objet : OCR Tika to read PDF, txt and doc docx

 

Hello,

 

How can I used/install an OCR to extract the content_html in files with ManifoldCF ?

I need the HTML content.

 

Thanks for your help, 

 

 

 

 

 


Mime
View raw message