lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jayendra Patil (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents
Date Wed, 18 Jan 2012 18:54:39 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188616#comment-13188616
] 

Jayendra Patil commented on SOLR-2416:
--------------------------------------

Tika parsers the zip file and extracts the complete content of the files as well.
It parsers all the files in the zip as well as the the zip in zip.
The metadata is the zip file rather than the individual files

There would be no special handling required from the Solr side.
The metadata for the Zip and its contents would be indexed as well.

Also, Solr doesn't allow attaching multiple files with a single document.
Zip is a nice way of associating a document with multiple files.

And, as in the current behavior of indexing zip with just the file names doesn't have much
value in it.
                
> Solr Cell fails to index Zip file contents
> ------------------------------------------
>
>                 Key: SOLR-2416
>                 URL: https://issues.apache.org/jira/browse/SOLR-2416
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction)
>    Affects Versions: 1.4.1
>            Reporter: Jayendra Patil
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2416_ExtractingDocumentLoader.patch
>
>
> Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java)
and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again.
> It just indexes the file names again.
> This issue was addressed some time back, late last year, but seems to have reappeared
with the latest code.
> Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message