Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@manifoldcf.apache.org
Received-SPF: pass (athena.apache.org: domain of benedetti.alex85@gmail.com
 designates 209.85.220.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALUFAGCrysYNcQSAuEJZM5aOq1MH9fZeLnwULoorD2Ma5xQjDw@mail.gmail.com>
References: <2A48908A-E694-43D2-814F-88073F861716@sourcesense.com>
	<CAPvL4UDo3czAA_MPGS5CML0ppjC17URPYJ9QLzYhQ2R3OqwFpg@mail.gmail.com>
	<C154AE46-8619-48BB-8BA0-4BC5C18C4E9F@sourcesense.com>
	<CAB-fSby=Am4z=1yjrXfUEwWHNUdEm0V+CrVhzeJctY9ZYvzF8Q@mail.gmail.com>
	<B5FFF7A8-0FE0-40FA-BC1F-F448083CC7B0@sourcesense.com>
	<CALUFAGAQNjoYJDxLpBWsTk+66FcD-A58Hgd6LC1mEU-H2NQXDA@mail.gmail.com>
	<CALUFAGD3Jvouw-stRJW1yYNsF417GBPJzNgxkCMiRP_h6TidMg@mail.gmail.com>
	<CA+eTv_Xm48jGmSrwbOTVgrJ640m35mmAhM=Av=KwFmwUT58xng@mail.gmail.com>
	<CALUFAGA0t79981cFPzQWeUgzsmoqa-fzokWKdzn7NZo3rrrxGg@mail.gmail.com>
	<2484FEE3-C0D3-4D37-BF16-510AD03A7C2D@gmail.com>
	<CALUFAGDE9dOGzioZCDL_cKcg_N6cz+tJfJT4Kzs5vK0sUB-uYw@mail.gmail.com>
	<8F0BE598-69E2-41CA-85C4-6CB5EDC49AD9@gmail.com>
	<CALUFAGAVXMmwmbs5F2zY8qTR1nCS-eCtz2Zu74vUiPVFd05C_A@mail.gmail.com>
	<CALUFAGCrysYNcQSAuEJZM5aOq1MH9fZeLnwULoorD2Ma5xQjDw@mail.gmail.com>
Date: Wed, 18 Jun 2014 14:21:16 +0100
Message-ID: 
 <CAB-fSbxRRrnuTFPTm8LMZbHphRBP6zcMLoCeGGMsvVqzJLGOag@mail.gmail.com>
Subject: Re: Solr Extracting request handler
From: Alessandro Benedetti <benedetti.alex85@gmail.com>
To: dev <dev@manifoldcf.apache.org>
Content-Type: multipart/alternative; boundary=001a11c249269bb33104fc1c224c

--001a11c249269bb33104fc1c224c
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

But guys, why not simply pass to a classic SolrJ SolrDocument creation and
ingestion in the Solr Server ? Easy and Straighforward !

In the end at that point the RepositoryDocument will me only a Map of
metadata and values.
Content will be part of that, so I guess the conversion to a SolrDocument
will be immediate.

Cheers


2014-06-18 3:26 GMT+01:00 Karl Wright <daddywri@gmail.com>:

> Hi Abe-san,
>
> Near as I can tell, the major consumer of disk space is the Maven target
> directories.  This is generating many tens of megabytes of temporary disk
> usage for every connector.  Luckily if you use ant, this is not a problem=
.
>
> Karl
>
>
> On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright <daddywri@gmail.com> wrote:
>
> > Hi Abe-san,
> >
> > Tika jars are not very big:
> >
> > C:\wip\mcf\trunk\lib>dir tika*
> >  Volume in drive C has no label.
> >  Volume Serial Number is 002E-D1F0
> >
> >  Directory of C:\wip\mcf\trunk\lib
> >
> > 06/05/2014  08:21 AM           493,374 tika-core.jar
> > 06/05/2014  08:21 AM           523,677 tika-parsers.jar
> >                2 File(s)      1,017,051 bytes
> >                0 Dir(s)  140,792,315,904 bytes free
> >
> > The entire lib directory is 85M:
> >
> > 85,156,330 bytes
> >
> > The built binary image is still about 185Mb, I believe.  So I don't kno=
w
> > why you think it is >1Gb?  Temporary class files?  I don't think we can
> > avoid those.
> >
> > I'd rather not make things more complicated than they need to be by
> adding
> > a new required service - even though it would fit naturally with the
> > connector arrangement.
> >
> > Karl
> >
> >
> >
> >
> >
> > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe <
> > shinichiro.abe.1@gmail.com> wrote:
> >
> >> Hi Karl,
> >>
> >> Okay, I assumed Tika connector outputs files.
> >> If we post character data metadata got from Tika, "/update/extract"
> >> handler
> >> can handle this(provides params:
> >> literal.content=3Dvalue&literal.metaField=3Dfoobar
> >> with using NullInputStream for binary data like CONNECTORS-936).
> >>
> >> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch
> >> connector uses Tika jars.
> >> Tika connector and CloudSearch connector should extract text via
> >> tika-server[1]
> >> and MCF should not have many Tika jars, do you think?
> >>
> >> [1]
> >> http://wiki.apache.org/tika/TikaJAXRS
> >>
> >> Thanks,
> >> Shinichiro Abe
> >>
> >> On 2014/06/18, at 9:45, Karl Wright <daddywri@gmail.com> wrote:
> >>
> >> > Hi Abe-san,
> >> >
> >> > It sounds like you might be thinking that transformation connectors
> are
> >> > like output connectors.  Just so we are clear, transformation
> >> connectors in
> >> > 1.7 receive a RepositoryDocument as input, and then pass a
> >> > RepositoryDocument on to the next connector in the chain.  So I don'=
t
> >> know
> >> > why .xml files would be involved.  I'd expect the Tika connector to
> >> read a
> >> > binary file from one RepositoryDocument object and convert its
> contents
> >> to
> >> > another RepositoryDocument object which would have character data an=
d
> >> > metadata only.  Would this work for your case, do you think?
> >> >
> >> > Karl
> >> >
> >> >
> >> >
> >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe <
> >> shinichiro.abe.1@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi Karl,
> >> >>
> >> >> Yes. I thought the standard update handler met that requirement.
> >> >> For instance, Tika extractor transformation connector creates two
> >> files.
> >> >> 1. addtoSolr.xml for add and update
> >> >> 2. deletetoSolr.xml for delete
> >> >> File connector ingests these xml files, then Solr connector posts
> these
> >> >> files by "/update" handler.
> >> >>
> >> >> In the the Solr Connector, other function as to update handler
> >> >> might not be necessary except for  "/update" handler.
> >> >>
> >> >> Thanks,
> >> >> Shinichiro Abe
> >> >>
> >> >> On 2014/06/18, at 8:02, Karl Wright <daddywri@gmail.com> wrote:
> >> >>
> >> >>> Hi Abe-san,
> >> >>>
> >> >>> So just to be sure -- you believe that no changes at all are
> required
> >> to
> >> >>> the Solr Connector as it stands now, other than to use the update
> >> handler
> >> >>> rather than the /update/extract handler?
> >> >>>
> >> >>> Karl
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe <
> >> >> shinichiro.abe.1@gmail.com>
> >> >>> wrote:
> >> >>>
> >> >>>>> As for changing the Solr connector so that it doesn't go to the
> >> >> extracting
> >> >>>> update handler
> >> >>>>
> >> >>>> I don't think it needs to change Solr connector with new checkbox
> >> >> because
> >> >>>> currently we can change "/update/extract" into "/update" at 'Upda=
te
> >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could
> post
> >> >> CSV,
> >> >>>> JSON and XML files to Solr by changing that and using File
> connector.
> >> >> So I
> >> >>>> wish we allow Tika extractor transformation connector to create X=
ML
> >> >> files
> >> >>>> that Solr expects to see.
> >> >>>>
> >> >>>> Regards,
> >> >>>> Shinichiro Abe
> >> >>>>
> >> >>>>
> >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright <daddywri@gmail.com>:
> >> >>>>
> >> >>>>> The pipeline code itself is now "complete" in trunk.  Zaizi said
> >> they'd
> >> >>>>> contribute a Tika extractor transformation connector - and if th=
ey
> >> >> don't
> >> >>>>> get around to that in a month or so, I may take a crack at it
> >> myself.
> >> >>>>>
> >> >>>>> As for changing the Solr connector so that it doesn't go to the
> >> >>>> extracting
> >> >>>>> update handler, it would be great if:
> >> >>>>> (1) Someone created a ticket for this, and
> >> >>>>> (2) A patch was provided that maintains backwards compatibility
> with
> >> >>>>> previous versions of the connector (so a checkbox would probably
> >> need
> >> >> to
> >> >>>> go
> >> >>>>> into the UI somewhere).  Do either of you want to start this
> >> process?
> >> >>>>>
> >> >>>>> Thanks!
> >> >>>>> Karl
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright <daddywri@gmail.co=
m
> >
> >> >>>> wrote:
> >> >>>>>
> >> >>>>>> Hi guys,
> >> >>>>>>
> >> >>>>>> You folks may not have looked at 1.7 yet, but it has a full
> >> pipeline,
> >> >>>> and
> >> >>>>>> is expected to have a Tika extractor as a transformation
> connector.
> >> >>>>>>
> >> >>>>>> Karl
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla <
> >> >>>>> m.grolla@sourcesense.com>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>> Thanks Alessandro,
> >> >>>>>>>       that explains the situation clearly.
> >> >>>>>>> And I agree that sending all the metadata as get parameter can
> be
> >> >>>>>>> problematic
> >> >>>>>>>
> >> >>>>>>> Cheers
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>> Matteo Grolla
> >> >>>>>>> Sourcesense - making sense of Open Source
> >> >>>>>>> http://www.sourcesense.com
> >> >>>>>>>
> >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha
> >> >>>> scritto:
> >> >>>>>>>
> >> >>>>>>>> mmmm the point is that right now ManifoldCF has no extractors=
.
> >> >>>>>>>> The Repository connectors extracts directly the binary and
> there
> >> is
> >> >>>> no
> >> >>>>>>>> "Extractor Processor" yet.
> >> >>>>>>>> But recently a pipe-line processor architecture has been
> thought
> >> (
> >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959)
> >> >>>>>>>> So can fit there.
> >> >>>>>>>>
> >> >>>>>>>> Cheers
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla <
> >> m.grolla@sourcesense.com
> >> >>>>> :
> >> >>>>>>>>
> >> >>>>>>>>> Since Solr extracting request handler takes the binary and
> >> extracts
> >> >>>>>>> text
> >> >>>>>>>>> what is the point of not using Manifold extractor and send
> text
> >> and
> >> >>>>>>>>> binaries to solr?
> >> >>>>>>>>> I mean the end result is the same solr indexes text and stor=
es
> >> text
> >> >>>>>>>>> So if manifold supports text extraction it seems me this is
> the
> >> >>>> place
> >> >>>>>>>>> where it should be done
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>> Matteo Grolla
> >> >>>>>>>>> Sourcesense - making sense of Open Source
> >> >>>>>>>>> http://www.sourcesense.com
> >> >>>>>>>>>
> >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez
> >> Morales
> >> >>>> ha
> >> >>>>>>>>> scritto:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hi Matteo
> >> >>>>>>>>>>
> >> >>>>>>>>>> Manifold already handles the extraction, but the only way t=
o
> >> send
> >> >>>>>>> binary
> >> >>>>>>>>>> content and document metadata to Solr is using the
> >> update/extract
> >> >>>>>>>>> handler,
> >> >>>>>>>>>> where the metadata is sent as query parameters and the bina=
ry
> >> >>>>> content
> >> >>>>>>> is
> >> >>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika
> to
> >> >>>>> obtain
> >> >>>>>>> the
> >> >>>>>>>>>> raw content to be stored in Solr.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Regards
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla <
> >> >>>>>>> m.grolla@sourcesense.com
> >> >>>>>>>>>>
> >> >>>>>>>>>> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses
> Solr
> >> >>>>>>> extracting
> >> >>>>>>>>>>> request handler to extract the content of an xml file
> >> >>>>>>>>>>> For performance reasons it would be better if Manifold
> handled
> >> >>>> the
> >> >>>>>>>>>>> extraction letting Solr do the search engine
> >> >>>>>>>>>>> Is this because of the connector design, framework design =
or
> >> just
> >> >>>>> to
> >> >>>>>>> be
> >> >>>>>>>>>>> done?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> --
> >> >>>>>>>>>>> Matteo Grolla
> >> >>>>>>>>>>> Sourcesense - making sense of Open Source
> >> >>>>>>>>>>> http://www.sourcesense.com
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> --
> >> >>>>>>>>>>
> >> >>>>>>>>>> ------------------------------
> >> >>>>>>>>>> This message should be regarded as confidential. If you hav=
e
> >> >>>>> received
> >> >>>>>>>>> this
> >> >>>>>>>>>> email in error please notify the sender and destroy it
> >> >>>> immediately.
> >> >>>>>>>>>> Statements of intent shall only become binding when confirm=
ed
> >> in
> >> >>>>> hard
> >> >>>>>>>>> copy
> >> >>>>>>>>>> by an authorised signatory.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> >> registration
> >> >>>>>>> number
> >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherd=
s
> >> Bush
> >> >>>>>>> Road,
> >> >>>>>>>>>> London W6 7AN.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> --
> >> >>>>>>>> --------------------------
> >> >>>>>>>>
> >> >>>>>>>> Benedetti Alessandro
> >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti
> >> >>>>>>>>
> >> >>>>>>>> "Tyger, tyger burning bright
> >> >>>>>>>> In the forests of the night,
> >> >>>>>>>> What immortal hand or eye
> >> >>>>>>>> Could frame thy fearful symmetry?"
> >> >>>>>>>>
> >> >>>>>>>> William Blake - Songs of Experience -1794 England
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >> >>>> Shinichiro Abe
> >> >>>> =E9=98=BF=E9=83=A8 =E6=85=8E=E4=B8=80=E6=9C=97
> >> >>>>
> >> >>
> >> >>
> >>
> >>
> >
>


--=20
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

--001a11c249269bb33104fc1c224c--