lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <>
Subject RE: Japanese character is garbled when using TikaEntityProcessor
Date Mon, 10 Apr 2017 18:10:35 GMT
Please open an issue on Tika's JIRA and share the triggering file if possible.  If we can touch
the file, we may be able to recommend alternate ways to configure Tika's encoding detectors.
 We just added configurability to the encoding detectors and that will be available with Tika
1.15. [1]

We use a fallback set of detectors: html, universalchardet, icu4j.  Whichever one has a non-null
answer, we go with that.  This is perhaps not the best option, but that's what we've been
doing for a while. We are in the process of reassessing our current methods[2], but that will
take some time.


-----Original Message-----
From: Noriyuki TAKEI [] 
Sent: Monday, April 10, 2017 1:46 PM
Subject: Japanese character is garbled when using TikaEntityProcessor


I use TikaEntityProcessor to extract the text content from binary or text file.

But when I try to extract Japanese Characters from HTML File whose caharacter encoding is
SJIS, the content is garbled.In the case of UTF-8,it does work well.

The setting of Data Import Handler is as below.

--- from here ---
  <dataSource name="ds-db"
  <dataSource name="ds-file" type="BinFileDataSource"/>

    <entity name="messages"
            query="select id,title from messages">
      <field column="id" name="id"/>
      <field column="title" name="title"/>

      <entity name="contents"
              query="select id,path from contents where id=${}">

        <entity name="file" dataSource="ds-file"
processor="TikaEntityProcessor" url="${contents.path}" format="text">
          <field column="text" name="content" />
--- to here ---

How do I solve this?

View this message in context:
Sent from the Solr - User mailing list archive at

View raw message