lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-284) Parsing Rich Document Types
Date Sat, 27 Jun 2009 14:10:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724855#action_12724855
] 

Yonik Seeley commented on SOLR-284:
-----------------------------------

Not sure if I should open a new issue or keep improvements here.
I think we need to improve the OOTB experience with this...
http://search.lucidimagination.com/search/document/302440b8a2451908/solr_cell

Ideas for improvement:
- auto-mapping names of the form Last-Modified to a more solrish field name like last_modified
- drop "ext." from parameter names, and revisit naming to try and unify with other update
handlers like CSV
  note: in the future, one could see generic functionality like boosting fields, setting field
value defaults, etc, being handled by a generic component or update processor... all the better
reason to drop the ext prefix.
-  I imagine that metadata is normally useful, so we should
  1. predefine commonly used metadata fields in the example schema... there's really no cost
to this
  2. use mappings to normalize any metadata names (if such normalization isn't already done
in Tika)
  3. ignore or drop fields that have little use
  4. provide a way to handle new attributes w/o dropping them or throwing an error
- enable the handler by default - lazy to avoid a dependency on having all the tika libs available


> Parsing Rich Document Types
> ---------------------------
>
>                 Key: SOLR-284
>                 URL: https://issues.apache.org/jira/browse/SOLR-284
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Eric Pugh
>            Assignee: Grant Ingersoll
>             Fix For: 1.4
>
>         Attachments: libs.zip, rich.patch, rich.patch, rich.patch, rich.patch, rich.patch,
rich.patch, rich.patch, SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch,
SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip,
test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler that supports
streaming a PDF, Word, Powerpoint, Excel, or PDF document into Solr.
> There is a wiki page with information here: http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message