manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Solr Extracting request handler
Date Tue, 17 Jun 2014 23:02:37 GMT
Hi Abe-san,

So just to be sure -- you believe that no changes at all are required to
the Solr Connector as it stands now, other than to use the update handler
rather than the /update/extract handler?

Karl





On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <shinichiro.abe.1@gmail.com>
wrote:

> >As for changing the Solr connector so that it doesn't go to the extracting
> update handler
>
> I don't think it needs to change Solr connector with new checkbox because
> currently we can change "/update/extract" into "/update" at 'Update
> Handler' at Paths tab in Solr connector UI. I confirmed I could post CSV,
> JSON and XML files to Solr by changing that and using File connector. So I
> wish we allow Tika extractor transformation connector to create XML files
> that Solr expects to see.
>
> Regards,
> Shinichiro Abe
>
>
> 2014-06-18 2:55 GMT+09:00 Karl Wright <daddywri@gmail.com>:
>
> > The pipeline code itself is now "complete" in trunk.  Zaizi said they'd
> > contribute a Tika extractor transformation connector - and if they don't
> > get around to that in a month or so, I may take a crack at it myself.
> >
> > As for changing the Solr connector so that it doesn't go to the
> extracting
> > update handler, it would be great if:
> > (1) Someone created a ticket for this, and
> > (2) A patch was provided that maintains backwards compatibility with
> > previous versions of the connector (so a checkbox would probably need to
> go
> > into the UI somewhere).  Do either of you want to start this process?
> >
> > Thanks!
> > Karl
> >
> >
> >
> > On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >
> > > Hi guys,
> > >
> > > You folks may not have looked at 1.7 yet, but it has a full pipeline,
> and
> > > is expected to have a Tika extractor as a transformation connector.
> > >
> > > Karl
> > >
> > >
> > >
> > > On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> > m.grolla@sourcesense.com>
> > > wrote:
> > >
> > >> Thanks Alessandro,
> > >>         that explains the situation clearly.
> > >> And I agree that sending all the metadata as get parameter can be
> > >> problematic
> > >>
> > >> Cheers
> > >>
> > >> --
> > >> Matteo Grolla
> > >> Sourcesense - making sense of Open Source
> > >> http://www.sourcesense.com
> > >>
> > >> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
> scritto:
> > >>
> > >> > mmmm the point is that right now ManifoldCF has no extractors.
> > >> > The Repository connectors extracts directly the binary and there is
> no
> > >> > "Extractor Processor" yet.
> > >> > But recently a pipe-line processor architecture has been thought (
> > >> > https://issues.apache.org/jira/browse/CONNECTORS-959)
> > >> > So can fit there.
> > >> >
> > >> > Cheers
> > >> >
> > >> >
> > >> > 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m.grolla@sourcesense.com
> >:
> > >> >
> > >> >> Since Solr extracting request handler takes the binary and extracts
> > >> text
> > >> >> what is the point of not using Manifold extractor and send text
and
> > >> >> binaries to solr?
> > >> >> I mean the end result is the same solr indexes text and stores
text
> > >> >> So if manifold supports text extraction it seems me this is the
> place
> > >> >> where it should be done
> > >> >>
> > >> >> --
> > >> >> Matteo Grolla
> > >> >> Sourcesense - making sense of Open Source
> > >> >> http://www.sourcesense.com
> > >> >>
> > >> >> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales
> ha
> > >> >> scritto:
> > >> >>
> > >> >>> Hi Matteo
> > >> >>>
> > >> >>> Manifold already handles the extraction, but the only way
to send
> > >> binary
> > >> >>> content and document metadata to Solr is using the update/extract
> > >> >> handler,
> > >> >>> where the metadata is sent as query parameters and the binary
> > content
> > >> is
> > >> >>> sent in the body of the requests, allowing Solr to use Tika
to
> > obtain
> > >> the
> > >> >>> raw content to be stored in Solr.
> > >> >>>
> > >> >>> Regards
> > >> >>>
> > >> >>>
> > >> >>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> > >> m.grolla@sourcesense.com
> > >> >>>
> > >> >>> wrote:
> > >> >>>
> > >> >>>> Hi During my first indexing I noticed that manifold uses
Solr
> > >> extracting
> > >> >>>> request handler to extract the content of an xml file
> > >> >>>> For performance reasons it would be better if Manifold
handled
> the
> > >> >>>> extraction letting Solr do the search engine
> > >> >>>> Is this because of the connector design, framework design
or just
> > to
> > >> be
> > >> >>>> done?
> > >> >>>>
> > >> >>>> --
> > >> >>>> Matteo Grolla
> > >> >>>> Sourcesense - making sense of Open Source
> > >> >>>> http://www.sourcesense.com
> > >> >>>>
> > >> >>>>
> > >> >>>
> > >> >>> --
> > >> >>>
> > >> >>> ------------------------------
> > >> >>> This message should be regarded as confidential. If you have
> > received
> > >> >> this
> > >> >>> email in error please notify the sender and destroy it
> immediately.
> > >> >>> Statements of intent shall only become binding when confirmed
in
> > hard
> > >> >> copy
> > >> >>> by an authorised signatory.
> > >> >>>
> > >> >>> Zaizi Ltd is registered in England and Wales with the registration
> > >> number
> > >> >>> 6440931. The Registered Office is Brook House, 229 Shepherds
Bush
> > >> Road,
> > >> >>> London W6 7AN.
> > >> >>
> > >> >>
> > >> >
> > >> >
> > >> > --
> > >> > --------------------------
> > >> >
> > >> > Benedetti Alessandro
> > >> > Visiting card : http://about.me/alessandro_benedetti
> > >> >
> > >> > "Tyger, tyger burning bright
> > >> > In the forests of the night,
> > >> > What immortal hand or eye
> > >> > Could frame thy fearful symmetry?"
> > >> >
> > >> > William Blake - Songs of Experience -1794 England
> > >>
> > >>
> > >
> >
>
>
>
> --
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Shinichiro Abe
> 阿部 慎一朗
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message