lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject TIKA-2440 Remove Furigana/phonetic as default for xlsx?
Date Wed, 09 Aug 2017 12:43:39 GMT
Solrians,
  We have a request to drop phonetic strings from xlsx as the default in Tika.  I'm not familiar
enough with Japanese to know if users would generally expect to be able to search on these
as well as the original.  The current practice is to include them.
  Any recommendations?  Thank you!

           Best,

                     Tim

-----Original Message-----
From: Takahiro Ochi (JIRA) [mailto:jira@apache.org] 
Sent: Tuesday, August 8, 2017 2:28 AM
To: dev@tika.apache.org
Subject: [jira] [Created] (TIKA-2440) Phonetic strings handling for multilingual environments.

Takahiro Ochi created TIKA-2440:
-----------------------------------

             Summary: Phonetic strings handling for multilingual environments.
                 Key: TIKA-2440
                 URL: https://issues.apache.org/jira/browse/TIKA-2440
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Takahiro Ochi
            Priority: Minor


Hi there,

I would like to propose an idea to improve phonetic strings handling for multilingual environments.
I believe Tika should not concatenate phonetic strings because text with phonetic strings
is recognized as noisy text in most situations of natural language processing.

Excel files include phonetic strings in some languages such as Japanese, Chinese and so on.
Apache POI concatenates phonetic strings onto the shared strings when Tika extract text from
Excel files.

Recent Apache POI has an switch flag for phonetic strings concatination as follows:
https://poi.apache.org/apidocs/org/apache/poi/xssf/eventusermodel/ReadOnlySharedStringsTable.html#ReadOnlySharedStringsTable(org.apache.poi.openxml4j.opc.OPCPackage,%20boolean)

Tika should set the 2nd argument "includePhoneticRuns" as false. Here is the simple patch
for my idea.


{code:java}
$ diff -ru XSSFExcelExtractorDecorator.java ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
--- XSSFExcelExtractorDecorator.java    2017-06-10 19:13:33.355412625 +0900
+++ ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
2017-06-10 19:14:30.452411830 +0900
@@ -130,7 +130,7 @@
             styles = xssfReader.getStylesTable();

             iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
-            strings = new ReadOnlySharedStringsTable(container);
+            strings = new ReadOnlySharedStringsTable(container,false);
         } catch (InvalidFormatException e) {
             throw new XmlException(e);
         } catch (OpenXML4JException oe) {

{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
Mime
View raw message