manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shinichiro Abe <shinichiro.ab...@gmail.com>
Subject Re: Solr Extracting request handler
Date Tue, 17 Jun 2014 21:14:29 GMT
>As for changing the Solr connector so that it doesn't go to the extracting
update handler

I don't think it needs to change Solr connector with new checkbox because
currently we can change "/update/extract" into "/update" at 'Update
Handler' at Paths tab in Solr connector UI. I confirmed I could post CSV,
JSON and XML files to Solr by changing that and using File connector. So I
wish we allow Tika extractor transformation connector to create XML files
that Solr expects to see.

Regards,
Shinichiro Abe


2014-06-18 2:55 GMT+09:00 Karl Wright <daddywri@gmail.com>:

> The pipeline code itself is now "complete" in trunk.  Zaizi said they'd
> contribute a Tika extractor transformation connector - and if they don't
> get around to that in a month or so, I may take a crack at it myself.
>
> As for changing the Solr connector so that it doesn't go to the extracting
> update handler, it would be great if:
> (1) Someone created a ticket for this, and
> (2) A patch was provided that maintains backwards compatibility with
> previous versions of the connector (so a checkbox would probably need to go
> into the UI somewhere).  Do either of you want to start this process?
>
> Thanks!
> Karl
>
>
>
> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> > Hi guys,
> >
> > You folks may not have looked at 1.7 yet, but it has a full pipeline, and
> > is expected to have a Tika extractor as a transformation connector.
> >
> > Karl
> >
> >
> >
> > On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> m.grolla@sourcesense.com>
> > wrote:
> >
> >> Thanks Alessandro,
> >>         that explains the situation clearly.
> >> And I agree that sending all the metadata as get parameter can be
> >> problematic
> >>
> >> Cheers
> >>
> >> --
> >> Matteo Grolla
> >> Sourcesense - making sense of Open Source
> >> http://www.sourcesense.com
> >>
> >> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha scritto:
> >>
> >> > mmmm the point is that right now ManifoldCF has no extractors.
> >> > The Repository connectors extracts directly the binary and there is no
> >> > "Extractor Processor" yet.
> >> > But recently a pipe-line processor architecture has been thought (
> >> > https://issues.apache.org/jira/browse/CONNECTORS-959)
> >> > So can fit there.
> >> >
> >> > Cheers
> >> >
> >> >
> >> > 2014-06-16 15:59 GMT+01:00 Matteo Grolla <m.grolla@sourcesense.com>:
> >> >
> >> >> Since Solr extracting request handler takes the binary and extracts
> >> text
> >> >> what is the point of not using Manifold extractor and send text and
> >> >> binaries to solr?
> >> >> I mean the end result is the same solr indexes text and stores text
> >> >> So if manifold supports text extraction it seems me this is the place
> >> >> where it should be done
> >> >>
> >> >> --
> >> >> Matteo Grolla
> >> >> Sourcesense - making sense of Open Source
> >> >> http://www.sourcesense.com
> >> >>
> >> >> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez Morales
ha
> >> >> scritto:
> >> >>
> >> >>> Hi Matteo
> >> >>>
> >> >>> Manifold already handles the extraction, but the only way to send
> >> binary
> >> >>> content and document metadata to Solr is using the update/extract
> >> >> handler,
> >> >>> where the metadata is sent as query parameters and the binary
> content
> >> is
> >> >>> sent in the body of the requests, allowing Solr to use Tika to
> obtain
> >> the
> >> >>> raw content to be stored in Solr.
> >> >>>
> >> >>> Regards
> >> >>>
> >> >>>
> >> >>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> >> m.grolla@sourcesense.com
> >> >>>
> >> >>> wrote:
> >> >>>
> >> >>>> Hi During my first indexing I noticed that manifold uses Solr
> >> extracting
> >> >>>> request handler to extract the content of an xml file
> >> >>>> For performance reasons it would be better if Manifold handled
the
> >> >>>> extraction letting Solr do the search engine
> >> >>>> Is this because of the connector design, framework design or
just
> to
> >> be
> >> >>>> done?
> >> >>>>
> >> >>>> --
> >> >>>> Matteo Grolla
> >> >>>> Sourcesense - making sense of Open Source
> >> >>>> http://www.sourcesense.com
> >> >>>>
> >> >>>>
> >> >>>
> >> >>> --
> >> >>>
> >> >>> ------------------------------
> >> >>> This message should be regarded as confidential. If you have
> received
> >> >> this
> >> >>> email in error please notify the sender and destroy it immediately.
> >> >>> Statements of intent shall only become binding when confirmed in
> hard
> >> >> copy
> >> >>> by an authorised signatory.
> >> >>>
> >> >>> Zaizi Ltd is registered in England and Wales with the registration
> >> number
> >> >>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
> >> Road,
> >> >>> London W6 7AN.
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > --------------------------
> >> >
> >> > Benedetti Alessandro
> >> > Visiting card : http://about.me/alessandro_benedetti
> >> >
> >> > "Tyger, tyger burning bright
> >> > In the forests of the night,
> >> > What immortal hand or eye
> >> > Could frame thy fearful symmetry?"
> >> >
> >> > William Blake - Songs of Experience -1794 England
> >>
> >>
> >
>



-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Shinichiro Abe
阿部 慎一朗

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message