lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-2346) Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting indexed correctly.
Date Wed, 09 Mar 2011 02:25:59 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004339#comment-13004339
] 

Koji Sekiguchi commented on SOLR-2346:
--------------------------------------

I've faced the same problem. I'm trying to index a Shift_JIS encoded text file through the
following request:

http://localhost:8983/solr/update/extract?literal.id=docA0000001&stream.file=/foo/bar/sjis.txt&commit=true&stream.contentType=text%2Fplain%3B+charset%3DShift_JIS

But Tika's AutoDetectParser doesn't regard Solr's charset (or Solr doesn't set the content
type to Tika Parser; I should dig in).

I looked into ExtractingDocumentLoader.java and it seemed that I could select an appropriate
parser if I use stream.type parameter:

{code:title=ExtractingDocumentLoader.java}
public void load(SolrQueryRequest req, SolrQueryResponse rsp, ContentStream stream) throws
IOException {
  errHeader = "ExtractingDocumentLoader: " + stream.getSourceInfo();
  Parser parser = null;
  String streamType = req.getParams().get(ExtractingParams.STREAM_TYPE, null);
  if (streamType != null) {
    //Cache?  Parsers are lightweight to construct and thread-safe, so I'm told
    MediaType mt = MediaType.parse(streamType.trim().toLowerCase());
    parser = config.getParser(mt);
  } else {
    parser = autoDetectParser;
  }
  :
}
{code}

The request was:

http://localhost:8983/solr/update/extract?literal.id=docA0000001&stream.file=/foo/bar/sjis.txt&commit=true&stream.contentType=text%2Fplain%3B+charset%3DShift_JIS&stream.type=text%2Fplain

I could select TXTParser rather than AutoDetectParser, but the problem wasn't solved.

And I looked at Tika Javadoc for TXTParser and it said "The text encoding of the document
stream is automatically detected based on the byte patterns found at the beginning of the
stream. The input metadata key HttpHeaders.CONTENT_ENCODING is used as an encoding hint if
the automatic encoding detection fails.":

http://tika.apache.org/0.8/api/org/apache/tika/parser/txt/TXTParser.html

So I tried to insert the following hard coded fix:

{code:title=ExtractingDocumentLoader.java}
Metadata metadata = new Metadata();
metadata.add(ExtractingMetadataConstants.STREAM_NAME, stream.getName());
metadata.add(ExtractingMetadataConstants.STREAM_SOURCE_INFO, stream.getSourceInfo());
metadata.add(ExtractingMetadataConstants.STREAM_SIZE, String.valueOf(stream.getSize()));
metadata.add(ExtractingMetadataConstants.STREAM_CONTENT_TYPE, stream.getContentType());
metadata.add(HttpHeaders.CONTENT_ENCODING, "Shift_JIS");   // <= temporary fix
{code}

and the problem was gone (anymore garbled characters indexed).

> Non UTF-8 Text files having other than english texts(Japanese/Hebrew) are no getting
indexed correctly.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2346
>                 URL: https://issues.apache.org/jira/browse/SOLR-2346
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 1.4.1
>         Environment: Solr 1.4.1, Packaged Jetty as servlet container, Windows XP SP1,
Machine was booted in Japanese Locale.
>            Reporter: Prasad Deshpande
>            Priority: Critical
>         Attachments: NormalSave.msg, UnicodeSave.msg, sample_jap_UTF-8.txt, sample_jap_non_UTF-8.txt
>
>
> I am able to successfully index/search non-Engilsh files (like Hebrew, Japanese) which
was encoded in UTF-8. However, When I tried to index data which was encoded in local encoding
like Big5 for Japanese I could not see the desired results. The contents after indexing looked
garbled for Big5 encoded document when I searched for all indexed documents. When I index
attached non utf-8 file it indexes in following way
> - <result name="response" numFound="1" start="0">
> - <doc>
> - <arr name="attr_content">
>   <str>�� ������</str>
>   </arr>
> - <arr name="attr_content_encoding">
>   <str>Big5</str>
>   </arr>
> - <arr name="attr_content_language">
>   <str>zh</str>
>   </arr>
> - <arr name="attr_language">
>   <str>zh</str>
>   </arr>
> - <arr name="attr_stream_size">
>   <str>17</str>
>   </arr>
> - <arr name="content_type">
>   <str>text/plain</str>
>   </arr>
>   <str name="id">doc2</str>
>   </doc>
>   </result>
>   </response>
> Here you said it index file in UTF8 however it seems that non UTF8 file gets indexed
in Big5 encoding.
> Here I tried fetching indexed data stream in Big5 and converted in UTF8.
> String id = (String) resulDocument.getFirstValue("attr_content");
>             byte[] bytearray = id.getBytes("Big5");
>             String utf8String = new String(bytearray, "UTF-8");
> It does not gives expected results.
> When I index UTF-8 file it indexes like following
> - <doc>
> - <arr name="attr_content">
>   <str>マイ ネットワーク</str>
>   </arr>
> - <arr name="attr_content_encoding">
>   <str>UTF-8</str>
>   </arr>
> - <arr name="attr_stream_content_type">
>   <str>text/plain</str>
>   </arr>
> - <arr name="attr_stream_name">
>   <str>sample_jap_unicode.txt</str>
>   </arr>
> - <arr name="attr_stream_size">
>   <str>28</str>
>   </arr>
> - <arr name="attr_stream_source_info">
>   <str>myfile</str>
>   </arr>
> - <arr name="content_type">
>   <str>text/plain</str>
>   </arr>
>   <str name="id">doc2</str>
>   </doc>
> So, I can index and search UTF-8 data.
> For more reference below is the discussion with Yonik.
>     Please find attached TXT file which I was using to index and search.
>     curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true&charset=utf-8"
-F "myfile=@sample_jap_non_UTF-8"
> One problem is that you are giving big5 encoded text to Solr and saying that it's UTF8.
> Here's one way to actually tell solr what the encoding of the text you are sending is:
> curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&fmap.div=foo_t&boost.foo_t=3&commit=true"
--data-binary @sample_jap_non_UTF-8.txt -H 'Content-type:text/plain; charset=big5'
> Now the problem appears that for some reason, this doesn't work...
> Could you open a JIRA issue and attach your two test files?
> -Yonik
> http://lucidimagination.com

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message