manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: Solr Extracting request handler
Date Wed, 18 Jun 2014 13:21:16 GMT
But guys, why not simply pass to a classic SolrJ SolrDocument creation and
ingestion in the Solr Server ? Easy and Straighforward !

In the end at that point the RepositoryDocument will me only a Map of
metadata and values.
Content will be part of that, so I guess the conversion to a SolrDocument
will be immediate.

Cheers


2014-06-18 3:26 GMT+01:00 Karl Wright <daddywri@gmail.com>:

> Hi Abe-san,
>
> Near as I can tell, the major consumer of disk space is the Maven target
> directories.  This is generating many tens of megabytes of temporary disk
> usage for every connector.  Luckily if you use ant, this is not a problem.
>
> Karl
>
>
> On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> > Hi Abe-san,
> >
> > Tika jars are not very big:
> >
> > C:\wip\mcf\trunk\lib>dir tika*
> >  Volume in drive C has no label.
> >  Volume Serial Number is 002E-D1F0
> >
> >  Directory of C:\wip\mcf\trunk\lib
> >
> > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> >                2 File(s)      1,017,051 bytes
> >                0 Dir(s)  140,792,315,904 bytes free
> >
> > The entire lib directory is 85M:
> >
> > 85,156,330 bytes
> >
> > The built binary image is still about 185Mb, I believe.  So I don't know
> > why you think it is >1Gb?  Temporary class files?  I don't think we can
> > avoid those.
> >
> > I'd rather not make things more complicated than they need to be by
> adding
> > a new required service - even though it would fit naturally with the
> > connector arrangement.
> >
> > Karl
> >
> >
> >
> >
> >
> > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > shinichiro.abe.1@gmail.com> wrote:
> >
> >> Hi Karl,
> >>
> >> Okay, I assumed Tika connector outputs files.
> >> If we post character data metadata got from Tika, "/update/extract"
> >> handler
> >> can handle this(provides params:
> >> literal.content=value&literal.metaField=foobar
> >> with using NullInputStream for binary data like CONNECTORS-936).
> >>
> >> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
> >> connector uses Tika jars.
> >> Tika connector and CloudSearch connector should extract text via
> >> tika-server[1]
> >> and MCF should not have many Tika jars, do you think?
> >>
> >> [1]
> >> http://wiki.apache.org/tika/TikaJAXRS
> >>
> >> Thanks,
> >> Shinichiro Abe
> >>
> >> On 2014/06/18, at 9:45, Karl Wright <daddywri@gmail.com> wrote:
> >>
> >> > Hi Abe-san,
> >> >
> >> > It sounds like you might be thinking that transformation connectors
> are
> >> > like output connectors.  Just so we are clear, transformation
> >> connectors in
> >> > 1.7 receive a RepositoryDocument as input, and then pass a
> >> > RepositoryDocument on to the next connector in the chain.  So I don't
> >> know
> >> > why .xml files would be involved.  I'd expect the Tika connector to
> >> read a
> >> > binary file from one RepositoryDocument object and convert its
> contents
> >> to
> >> > another RepositoryDocument object which would have character data and
> >> > metadata only.  Would this work for your case, do you think?
> >> >
> >> > Karl
> >> >
> >> >
> >> >
> >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> >> shinichiro.abe.1@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi Karl,
> >> >>
> >> >> Yes. I thought the standard update handler met that requirement.
> >> >> For instance, Tika extractor transformation connector creates two
> >> files.
> >> >> 1. addtoSolr.xml for add and update
> >> >> 2. deletetoSolr.xml for delete
> >> >> File connector ingests these xml files, then Solr connector posts
> these
> >> >> files by "/update" handler.
> >> >>
> >> >> In the the Solr Connector, other function as to update handler
> >> >> might not be necessary except for  "/update" handler.
> >> >>
> >> >> Thanks,
> >> >> Shinichiro Abe
> >> >>
> >> >> On 2014/06/18, at 8:02, Karl Wright <daddywri@gmail.com> wrote:
> >> >>
> >> >>> Hi Abe-san,
> >> >>>
> >> >>> So just to be sure -- you believe that no changes at all are
> required
> >> to
> >> >>> the Solr Connector as it stands now, other than to use the update
> >> handler
> >> >>> rather than the /update/extract handler?
> >> >>>
> >> >>> Karl
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> >> >> shinichiro.abe.1@gmail.com>
> >> >>> wrote:
> >> >>>
> >> >>>>> As for changing the Solr connector so that it doesn't go
to the
> >> >> extracting
> >> >>>> update handler
> >> >>>>
> >> >>>> I don't think it needs to change Solr connector with new checkbox
> >> >> because
> >> >>>> currently we can change "/update/extract" into "/update" at
'Update
> >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could
> post
> >> >> CSV,
> >> >>>> JSON and XML files to Solr by changing that and using File
> connector.
> >> >> So I
> >> >>>> wish we allow Tika extractor transformation connector to create
XML
> >> >> files
> >> >>>> that Solr expects to see.
> >> >>>>
> >> >>>> Regards,
> >> >>>> Shinichiro Abe
> >> >>>>
> >> >>>>
> >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> >> >>>>
> >> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi
said
> >> they'd
> >> >>>>> contribute a Tika extractor transformation connector -
and if they
> >> >> don't
> >> >>>>> get around to that in a month or so, I may take a crack
at it
> >> myself.
> >> >>>>>
> >> >>>>> As for changing the Solr connector so that it doesn't go
to the
> >> >>>> extracting
> >> >>>>> update handler, it would be great if:
> >> >>>>> (1) Someone created a ticket for this, and
> >> >>>>> (2) A patch was provided that maintains backwards compatibility
> with
> >> >>>>> previous versions of the connector (so a checkbox would
probably
> >> need
> >> >> to
> >> >>>> go
> >> >>>>> into the UI somewhere).  Do either of you want to start
this
> >> process?
> >> >>>>>
> >> >>>>> Thanks!
> >> >>>>> Karl
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <daddywri@gmail.com
> >
> >> >>>> wrote:
> >> >>>>>
> >> >>>>>> Hi guys,
> >> >>>>>>
> >> >>>>>> You folks may not have looked at 1.7 yet, but it has
a full
> >> pipeline,
> >> >>>> and
> >> >>>>>> is expected to have a Tika extractor as a transformation
> connector.
> >> >>>>>>
> >> >>>>>> Karl
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> >> >>>>> m.grolla@sourcesense.com>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>> Thanks Alessandro,
> >> >>>>>>>       that explains the situation clearly.
> >> >>>>>>> And I agree that sending all the metadata as get
parameter can
> be
> >> >>>>>>> problematic
> >> >>>>>>>
> >> >>>>>>> Cheers
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>> Matteo Grolla
> >> >>>>>>> Sourcesense - making sense of Open Source
> >> >>>>>>> http://www.sourcesense.com
> >> >>>>>>>
> >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro
Benedetti ha
> >> >>>> scritto:
> >> >>>>>>>
> >> >>>>>>>> mmmm the point is that right now ManifoldCF
has no extractors.
> >> >>>>>>>> The Repository connectors extracts directly
the binary and
> there
> >> is
> >> >>>> no
> >> >>>>>>>> "Extractor Processor" yet.
> >> >>>>>>>> But recently a pipe-line processor architecture
has been
> thought
> >> (
> >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> >> >>>>>>>> So can fit there.
> >> >>>>>>>>
> >> >>>>>>>> Cheers
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> >> m.grolla@sourcesense.com
> >> >>>>> :
> >> >>>>>>>>
> >> >>>>>>>>> Since Solr extracting request handler takes
the binary and
> >> extracts
> >> >>>>>>> text
> >> >>>>>>>>> what is the point of not using Manifold
extractor and send
> text
> >> and
> >> >>>>>>>>> binaries to solr?
> >> >>>>>>>>> I mean the end result is the same solr
indexes text and stores
> >> text
> >> >>>>>>>>> So if manifold supports text extraction
it seems me this is
> the
> >> >>>> place
> >> >>>>>>>>> where it should be done
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>> Matteo Grolla
> >> >>>>>>>>> Sourcesense - making sense of Open Source
> >> >>>>>>>>> http://www.sourcesense.com
> >> >>>>>>>>>
> >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51,
Antonio David Perez
> >> Morales
> >> >>>> ha
> >> >>>>>>>>> scritto:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hi Matteo
> >> >>>>>>>>>>
> >> >>>>>>>>>> Manifold already handles the extraction,
but the only way to
> >> send
> >> >>>>>>> binary
> >> >>>>>>>>>> content and document metadata to Solr
is using the
> >> update/extract
> >> >>>>>>>>> handler,
> >> >>>>>>>>>> where the metadata is sent as query
parameters and the binary
> >> >>>>> content
> >> >>>>>>> is
> >> >>>>>>>>>> sent in the body of the requests, allowing
Solr to use Tika
> to
> >> >>>>> obtain
> >> >>>>>>> the
> >> >>>>>>>>>> raw content to be stored in Solr.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Regards
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo
Grolla <
> >> >>>>>>> m.grolla@sourcesense.com
> >> >>>>>>>>>>
> >> >>>>>>>>>> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Hi During my first indexing I noticed
that manifold uses
> Solr
> >> >>>>>>> extracting
> >> >>>>>>>>>>> request handler to extract the
content of an xml file
> >> >>>>>>>>>>> For performance reasons it would
be better if Manifold
> handled
> >> >>>> the
> >> >>>>>>>>>>> extraction letting Solr do the
search engine
> >> >>>>>>>>>>> Is this because of the connector
design, framework design or
> >> just
> >> >>>>> to
> >> >>>>>>> be
> >> >>>>>>>>>>> done?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> --
> >> >>>>>>>>>>> Matteo Grolla
> >> >>>>>>>>>>> Sourcesense - making sense of Open
Source
> >> >>>>>>>>>>> http://www.sourcesense.com
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> --
> >> >>>>>>>>>>
> >> >>>>>>>>>> ------------------------------
> >> >>>>>>>>>> This message should be regarded as
confidential. If you have
> >> >>>>> received
> >> >>>>>>>>> this
> >> >>>>>>>>>> email in error please notify the sender
and destroy it
> >> >>>> immediately.
> >> >>>>>>>>>> Statements of intent shall only become
binding when confirmed
> >> in
> >> >>>>> hard
> >> >>>>>>>>> copy
> >> >>>>>>>>>> by an authorised signatory.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Zaizi Ltd is registered in England
and Wales with the
> >> registration
> >> >>>>>>> number
> >> >>>>>>>>>> 6440931. The Registered Office is Brook
House, 229 Shepherds
> >> Bush
> >> >>>>>>> Road,
> >> >>>>>>>>>> London W6 7AN.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> --
> >> >>>>>>>> --------------------------
> >> >>>>>>>>
> >> >>>>>>>> Benedetti Alessandro
> >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> >> >>>>>>>>
> >> >>>>>>>> "Tyger, tyger burning bright
> >> >>>>>>>> In the forests of the night,
> >> >>>>>>>> What immortal hand or eye
> >> >>>>>>>> Could frame thy fearful symmetry?"
> >> >>>>>>>>
> >> >>>>>>>> William Blake - Songs of Experience -1794 England
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- -
> >> >>>> Shinichiro Abe
> >> >>>> 阿部 慎一朗
> >> >>>>
> >> >>
> >> >>
> >>
> >>
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message