lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Extract footer/header text out of Word docs
Date Sat, 01 Sep 2012 01:17:58 GMT
Tika generates a block-structured stream of events for the document.
It would be cool to have an alternate Tika processor in the DIH that
generates this stream as XML. You could then use the XPath tools to
grab whatever you want.

On Fri, Aug 31, 2012 at 4:25 AM, Erick Erickson <erickerickson@gmail.com> wrote:
> You can also move the Tika processing off Solr to the client and perhaps have
> more control there. I haven't tried this particular thing, so....
>
> see: http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
>
> Best
> Erick
>
> On Thu, Aug 30, 2012 at 9:35 AM, Markus Jelsma
> <markus.jelsma@openindex.io> wrote:
>> Tika can do this but Solr doesn't use the BoilerpipeContentHandler. Perhaps it should
made configurable which content handler Solr uses and in case of the BoilerpipeContentHandler
which extractor implementation to use.
>>
>> -----Original message-----
>>> From:Otis Gospodnetic <otis_gospodnetic@yahoo.com>
>>> Sent: Thu 30-Aug-2012 15:30
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Extract footer/header text out of Word docs
>>>
>>> Hi Alex,
>>>
>>> I think you may get better help on the Tika mailing list - Solr uses Tika to
parse rich text docs and extract text from them.  I don't know if Tika can figure out what's
from a header and a footer...
>>>
>>> Otis
>>> ----
>>> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm
>>>
>>>
>>>
>>> ----- Original Message -----
>>> > From: Alex Cougarman <acougarm@bwc.org>
>>> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>>> > Cc:
>>> > Sent: Thursday, August 30, 2012 9:25 AM
>>> > Subject: Extract footer/header text out of Word docs
>>> >
>>> > Hi. Is it possible to specifically extract footer/header and body text out
of a
>>> > Word document using Solr? In other words, we'd like to index/store those
>>> > items in different Solr fields.
>>> >
>>> > Also, is it possible to search on specific styles within a Word document?
Can
>>> > these attributes be indexed? Thanks.
>>> >
>>> > Sincerely,
>>> > Alex
>>> >
>>>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message