lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zacarias <zacar...@linebee.com>
Subject Re: Solr Cell revamped as an UpdateProcessor?
Date Tue, 05 Jan 2010 18:53:24 GMT
I'd attached a file to the previous mail. Is there any filter for pdf files
or any other reason.

On Tue, Jan 5, 2010 at 12:49 PM, Zacarias <zacarias@linebee.com> wrote:

> Here is my propousal
>
> Regards
>
>
>
>
> On Tue, Jan 5, 2010 at 12:48 PM, Zacarias <zacarias@linebee.com> wrote:
>
>> Hi, I'm developing a directory monitor to add in a Sor implementation.
>> Tell me if it could be interesting for you we will be glad to share it
>> with the comunity. Also I would like your opinion about the propousal if it
>> looks ok for you and if you like to make any change or question it will be
>> very well welcome.
>>
>> Regards
>> Zacarias
>> www.linebee.com
>>
>>
>> 2009/12/8 Noble Paul നോബിള്‍ नोब्ळ् <noble.paul@corp.aol.com>
>>
>> I was refering to SOLR-1358. Anyway , SolrCell as an updateprocessor
>>> is a good idea
>>>
>>> On Tue, Dec 8, 2009 at 4:47 PM, Grant Ingersoll <gsingers@apache.org>
>>> wrote:
>>> >
>>> > On Dec 8, 2009, at 12:22 AM, Noble Paul നോബിള്‍ नोब्ळ्
wrote:
>>> >
>>> >> Integrating Extraction w/ DIH is a better option. DIH makes it easier
>>> >> to do the mapping of fields etc.
>>> >
>>> > Which comment is this directed at?  I'm lacking context here.
>>> >
>>> >>
>>> >>
>>> >> On Tue, Dec 8, 2009 at 4:59 AM, Grant Ingersoll <gsingers@apache.org>
>>> wrote:
>>> >>>
>>> >>> On Dec 7, 2009, at 3:51 PM, Chris Hostetter wrote:
>>> >>>
>>> >>>>
>>> >>>> ASs someone with very little knowledge of Solr Cell and/or Tika,
I
>>> find myself wondering if ExtractingRequestHandler would make more sense as
>>> an extractingUpdateProcessor -- where it could be configured to take take
>>> either binary fields (or string fields containing URLs) out of the
>>> Documents, parse them with tika, and add the various XPath matching hunks of
>>> text back into the document as new fields.
>>> >>>>
>>> >>>> Then ExtractingRequestHandler just becomes a handler that slurps
up
>>> it's ContentStreams and adds them as binary data fields and adds the other
>>> literal params as fields.
>>> >>>>
>>> >>>> Wouldn't that make things like SOLR-1358, and using Tika with
>>> URLs/filepaths in XML and CSV based updates fairly trivial?
>>> >>>
>>> >>> It probably could, but am not sure how it works in a processor chain.
>>>  However, I'm not sure I understand how they work all that much either.  I
>>> also plan on adding, BTW, a SolrJ client for Tika that does the extraction
>>> on the client.  In many cases, the ExtrReqHandler is really only designed
>>> for lighter weight extraction cases, as one would simply not want to send
>>> that much rich content over the wire.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> -----------------------------------------------------
>>> >> Noble Paul | Systems Architect| AOL | http://aol.com
>>> >
>>> > --------------------------
>>> > Grant Ingersoll
>>> > http://www.lucidimagination.com/
>>> >
>>> > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> > http://www.lucidimagination.com/search
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------
>>> Noble Paul | Systems Architect| AOL | http://aol.com
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message