lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-3808) Extraction contrib to utilize Boilerpipe
Date Thu, 13 Sep 2012 23:02:08 GMT

     [ https://issues.apache.org/jira/browse/SOLR-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-3808:
---------------------------

    Fix Version/s:     (was: 4.0)

Markus: thanks for the patch and test!

This looks cool, but i'm not overly familiar with ExtractingRequestHandler so i'm not really
comfortable commiting just yet because it's not clear to me if this kind of explicit registering
of extractors based on specal params is really the direction we should go -- it seems like
a slipper slope in deciding what should be included and what shouldn't (and from what i understand
i *believe* there are other ways to use tika configuration files to control this sort of thing,
aren't there?

For now i'm going to remove the fixVersion=4.0 since this is a new feature and probably shouldn't
impeed momentum towards the (hopefully) rapidly approaching release.

(of course: if someone with more expertise then me what's to jump on it and commit it they
totally should)
                
> Extraction contrib to utilize Boilerpipe
> ----------------------------------------
>
>                 Key: SOLR-3808
>                 URL: https://issues.apache.org/jira/browse/SOLR-3808
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Markus Jelsma
>            Priority: Minor
>         Attachments: SOLR-3808-trunk-1.patch
>
>
> Solr's extraction contrib uses Tika for document parsing and should be able te use Boilerpipe.
Tika comes with Boilerpipe, a library capable of removing boilerplate text from HTML pages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message