Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 58F2D1128B for ; Wed, 18 Jun 2014 13:32:30 +0000 (UTC) Received: (qmail 56871 invoked by uid 500); 18 Jun 2014 13:32:30 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 56823 invoked by uid 500); 18 Jun 2014 13:32:30 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 56811 invoked by uid 99); 18 Jun 2014 13:32:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2014 13:32:29 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.213.46 as permitted sender) Received: from [209.85.213.46] (HELO mail-yh0-f46.google.com) (209.85.213.46) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2014 13:32:26 +0000 Received: by mail-yh0-f46.google.com with SMTP id c41so587448yho.19 for ; Wed, 18 Jun 2014 06:32:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=c6ya9wdv6VusvY64CEUu2x44TrcTx2u2FCkyWwcul10=; b=DeoUcmUvEi/M2apRIxOArNUCy29RcnqmAVaoXN5Iw/oW9z9kjqSxG4Ux2/Un1hypI3 RMvjl9DbLlo7S90uaH4bL3USDsgabsxHL1/w+9ZW6ghYSq0lDmZBe7KTcQLbVe0pJM8v QCDCO5U1pECMPEe3HUOY8YWn6V7/dvtxOrHJfv8oPMh2fZFitbv/UYtfz/0vaRjvregi 7m4euLWUFPASWcfn3FlqitmPvif+To02K79OdWU9BKYp+4d9oes7rCWxYmpaScICjanz hfmRHAXcKRsUFFc+4daTW14RLcvim5aqN9rBsFcT5qH6//Ptu3U5iFa6eyVmdCZK6MhA bN/A== MIME-Version: 1.0 X-Received: by 10.236.45.10 with SMTP id o10mr54779648yhb.49.1403098322090; Wed, 18 Jun 2014 06:32:02 -0700 (PDT) Received: by 10.170.118.196 with HTTP; Wed, 18 Jun 2014 06:32:02 -0700 (PDT) In-Reply-To: References: <2A48908A-E694-43D2-814F-88073F861716@sourcesense.com> <2484FEE3-C0D3-4D37-BF16-510AD03A7C2D@gmail.com> <8F0BE598-69E2-41CA-85C4-6CB5EDC49AD9@gmail.com> Date: Wed, 18 Jun 2014 09:32:02 -0400 Message-ID: Subject: Re: Solr Extracting request handler From: Karl Wright To: dev Content-Type: multipart/alternative; boundary=089e011615fc12bf6c04fc1c4900 X-Virus-Checked: Checked by ClamAV on apache.org --089e011615fc12bf6c04fc1c4900 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Alessandro, What is your concrete proposal to change the Solr connector? Bear in mind that we do need to maintain backwards compatibility. If you list your specific changes, not in any huge detail, but with enough detail that we understand your proposal, that would help. What happens to the UI? What happens to the internals? Thanks, Karl On Wed, Jun 18, 2014 at 9:21 AM, Alessandro Benedetti < benedetti.alex85@gmail.com> wrote: > But guys, why not simply pass to a classic SolrJ SolrDocument creation an= d > ingestion in the Solr Server ? Easy and Straighforward ! > > In the end at that point the RepositoryDocument will me only a Map of > metadata and values. > Content will be part of that, so I guess the conversion to a SolrDocument > will be immediate. > > Cheers > > > 2014-06-18 3:26 GMT+01:00 Karl Wright : > > > Hi Abe-san, > > > > Near as I can tell, the major consumer of disk space is the Maven targe= t > > directories. This is generating many tens of megabytes of temporary di= sk > > usage for every connector. Luckily if you use ant, this is not a > problem. > > > > Karl > > > > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright wrote= : > > > > > Hi Abe-san, > > > > > > Tika jars are not very big: > > > > > > C:\wip\mcf\trunk\lib>dir tika* > > > Volume in drive C has no label. > > > Volume Serial Number is 002E-D1F0 > > > > > > Directory of C:\wip\mcf\trunk\lib > > > > > > 06/05/2014 08:21 AM 493,374 tika-core.jar > > > 06/05/2014 08:21 AM 523,677 tika-parsers.jar > > > 2 File(s) 1,017,051 bytes > > > 0 Dir(s) 140,792,315,904 bytes free > > > > > > The entire lib directory is 85M: > > > > > > 85,156,330 bytes > > > > > > The built binary image is still about 185Mb, I believe. So I don't > know > > > why you think it is >1Gb? Temporary class files? I don't think we c= an > > > avoid those. > > > > > > I'd rather not make things more complicated than they need to be by > > adding > > > a new required service - even though it would fit naturally with the > > > connector arrangement. > > > > > > Karl > > > > > > > > > > > > > > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe < > > > shinichiro.abe.1@gmail.com> wrote: > > > > > >> Hi Karl, > > >> > > >> Okay, I assumed Tika connector outputs files. > > >> If we post character data metadata got from Tika, "/update/extract" > > >> handler > > >> can handle this(provides params: > > >> literal.content=3Dvalue&literal.metaField=3Dfoobar > > >> with using NullInputStream for binary data like CONNECTORS-936). > > >> > > >> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch > > >> connector uses Tika jars. > > >> Tika connector and CloudSearch connector should extract text via > > >> tika-server[1] > > >> and MCF should not have many Tika jars, do you think? > > >> > > >> [1] > > >> http://wiki.apache.org/tika/TikaJAXRS > > >> > > >> Thanks, > > >> Shinichiro Abe > > >> > > >> On 2014/06/18, at 9:45, Karl Wright wrote: > > >> > > >> > Hi Abe-san, > > >> > > > >> > It sounds like you might be thinking that transformation connector= s > > are > > >> > like output connectors. Just so we are clear, transformation > > >> connectors in > > >> > 1.7 receive a RepositoryDocument as input, and then pass a > > >> > RepositoryDocument on to the next connector in the chain. So I > don't > > >> know > > >> > why .xml files would be involved. I'd expect the Tika connector t= o > > >> read a > > >> > binary file from one RepositoryDocument object and convert its > > contents > > >> to > > >> > another RepositoryDocument object which would have character data > and > > >> > metadata only. Would this work for your case, do you think? > > >> > > > >> > Karl > > >> > > > >> > > > >> > > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe < > > >> shinichiro.abe.1@gmail.com> > > >> > wrote: > > >> > > > >> >> Hi Karl, > > >> >> > > >> >> Yes. I thought the standard update handler met that requirement. > > >> >> For instance, Tika extractor transformation connector creates two > > >> files. > > >> >> 1. addtoSolr.xml for add and update > > >> >> 2. deletetoSolr.xml for delete > > >> >> File connector ingests these xml files, then Solr connector posts > > these > > >> >> files by "/update" handler. > > >> >> > > >> >> In the the Solr Connector, other function as to update handler > > >> >> might not be necessary except for "/update" handler. > > >> >> > > >> >> Thanks, > > >> >> Shinichiro Abe > > >> >> > > >> >> On 2014/06/18, at 8:02, Karl Wright wrote: > > >> >> > > >> >>> Hi Abe-san, > > >> >>> > > >> >>> So just to be sure -- you believe that no changes at all are > > required > > >> to > > >> >>> the Solr Connector as it stands now, other than to use the updat= e > > >> handler > > >> >>> rather than the /update/extract handler? > > >> >>> > > >> >>> Karl > > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe < > > >> >> shinichiro.abe.1@gmail.com> > > >> >>> wrote: > > >> >>> > > >> >>>>> As for changing the Solr connector so that it doesn't go to th= e > > >> >> extracting > > >> >>>> update handler > > >> >>>> > > >> >>>> I don't think it needs to change Solr connector with new checkb= ox > > >> >> because > > >> >>>> currently we can change "/update/extract" into "/update" at > 'Update > > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could > > post > > >> >> CSV, > > >> >>>> JSON and XML files to Solr by changing that and using File > > connector. > > >> >> So I > > >> >>>> wish we allow Tika extractor transformation connector to create > XML > > >> >> files > > >> >>>> that Solr expects to see. > > >> >>>> > > >> >>>> Regards, > > >> >>>> Shinichiro Abe > > >> >>>> > > >> >>>> > > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright : > > >> >>>> > > >> >>>>> The pipeline code itself is now "complete" in trunk. Zaizi sa= id > > >> they'd > > >> >>>>> contribute a Tika extractor transformation connector - and if > they > > >> >> don't > > >> >>>>> get around to that in a month or so, I may take a crack at it > > >> myself. > > >> >>>>> > > >> >>>>> As for changing the Solr connector so that it doesn't go to th= e > > >> >>>> extracting > > >> >>>>> update handler, it would be great if: > > >> >>>>> (1) Someone created a ticket for this, and > > >> >>>>> (2) A patch was provided that maintains backwards compatibilit= y > > with > > >> >>>>> previous versions of the connector (so a checkbox would probab= ly > > >> need > > >> >> to > > >> >>>> go > > >> >>>>> into the UI somewhere). Do either of you want to start this > > >> process? > > >> >>>>> > > >> >>>>> Thanks! > > >> >>>>> Karl > > >> >>>>> > > >> >>>>> > > >> >>>>> > > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright < > daddywri@gmail.com > > > > > >> >>>> wrote: > > >> >>>>> > > >> >>>>>> Hi guys, > > >> >>>>>> > > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a full > > >> pipeline, > > >> >>>> and > > >> >>>>>> is expected to have a Tika extractor as a transformation > > connector. > > >> >>>>>> > > >> >>>>>> Karl > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla < > > >> >>>>> m.grolla@sourcesense.com> > > >> >>>>>> wrote: > > >> >>>>>> > > >> >>>>>>> Thanks Alessandro, > > >> >>>>>>> that explains the situation clearly. > > >> >>>>>>> And I agree that sending all the metadata as get parameter c= an > > be > > >> >>>>>>> problematic > > >> >>>>>>> > > >> >>>>>>> Cheers > > >> >>>>>>> > > >> >>>>>>> -- > > >> >>>>>>> Matteo Grolla > > >> >>>>>>> Sourcesense - making sense of Open Source > > >> >>>>>>> http://www.sourcesense.com > > >> >>>>>>> > > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti = ha > > >> >>>> scritto: > > >> >>>>>>> > > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no > extractors. > > >> >>>>>>>> The Repository connectors extracts directly the binary and > > there > > >> is > > >> >>>> no > > >> >>>>>>>> "Extractor Processor" yet. > > >> >>>>>>>> But recently a pipe-line processor architecture has been > > thought > > >> ( > > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959) > > >> >>>>>>>> So can fit there. > > >> >>>>>>>> > > >> >>>>>>>> Cheers > > >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla < > > >> m.grolla@sourcesense.com > > >> >>>>> : > > >> >>>>>>>> > > >> >>>>>>>>> Since Solr extracting request handler takes the binary and > > >> extracts > > >> >>>>>>> text > > >> >>>>>>>>> what is the point of not using Manifold extractor and send > > text > > >> and > > >> >>>>>>>>> binaries to solr? > > >> >>>>>>>>> I mean the end result is the same solr indexes text and > stores > > >> text > > >> >>>>>>>>> So if manifold supports text extraction it seems me this i= s > > the > > >> >>>> place > > >> >>>>>>>>> where it should be done > > >> >>>>>>>>> > > >> >>>>>>>>> -- > > >> >>>>>>>>> Matteo Grolla > > >> >>>>>>>>> Sourcesense - making sense of Open Source > > >> >>>>>>>>> http://www.sourcesense.com > > >> >>>>>>>>> > > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez > > >> Morales > > >> >>>> ha > > >> >>>>>>>>> scritto: > > >> >>>>>>>>> > > >> >>>>>>>>>> Hi Matteo > > >> >>>>>>>>>> > > >> >>>>>>>>>> Manifold already handles the extraction, but the only way > to > > >> send > > >> >>>>>>> binary > > >> >>>>>>>>>> content and document metadata to Solr is using the > > >> update/extract > > >> >>>>>>>>> handler, > > >> >>>>>>>>>> where the metadata is sent as query parameters and the > binary > > >> >>>>> content > > >> >>>>>>> is > > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to use Ti= ka > > to > > >> >>>>> obtain > > >> >>>>>>> the > > >> >>>>>>>>>> raw content to be stored in Solr. > > >> >>>>>>>>>> > > >> >>>>>>>>>> Regards > > >> >>>>>>>>>> > > >> >>>>>>>>>> > > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla < > > >> >>>>>>> m.grolla@sourcesense.com > > >> >>>>>>>>>> > > >> >>>>>>>>>> wrote: > > >> >>>>>>>>>> > > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses > > Solr > > >> >>>>>>> extracting > > >> >>>>>>>>>>> request handler to extract the content of an xml file > > >> >>>>>>>>>>> For performance reasons it would be better if Manifold > > handled > > >> >>>> the > > >> >>>>>>>>>>> extraction letting Solr do the search engine > > >> >>>>>>>>>>> Is this because of the connector design, framework desig= n > or > > >> just > > >> >>>>> to > > >> >>>>>>> be > > >> >>>>>>>>>>> done? > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> -- > > >> >>>>>>>>>>> Matteo Grolla > > >> >>>>>>>>>>> Sourcesense - making sense of Open Source > > >> >>>>>>>>>>> http://www.sourcesense.com > > >> >>>>>>>>>>> > > >> >>>>>>>>>>> > > >> >>>>>>>>>> > > >> >>>>>>>>>> -- > > >> >>>>>>>>>> > > >> >>>>>>>>>> ------------------------------ > > >> >>>>>>>>>> This message should be regarded as confidential. If you > have > > >> >>>>> received > > >> >>>>>>>>> this > > >> >>>>>>>>>> email in error please notify the sender and destroy it > > >> >>>> immediately. > > >> >>>>>>>>>> Statements of intent shall only become binding when > confirmed > > >> in > > >> >>>>> hard > > >> >>>>>>>>> copy > > >> >>>>>>>>>> by an authorised signatory. > > >> >>>>>>>>>> > > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the > > >> registration > > >> >>>>>>> number > > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 > Shepherds > > >> Bush > > >> >>>>>>> Road, > > >> >>>>>>>>>> London W6 7AN. > > >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> -- > > >> >>>>>>>> -------------------------- > > >> >>>>>>>> > > >> >>>>>>>> Benedetti Alessandro > > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti > > >> >>>>>>>> > > >> >>>>>>>> "Tyger, tyger burning bright > > >> >>>>>>>> In the forests of the night, > > >> >>>>>>>> What immortal hand or eye > > >> >>>>>>>> Could frame thy fearful symmetry?" > > >> >>>>>>>> > > >> >>>>>>>> William Blake - Songs of Experience -1794 England > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>> > > >> >>>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> -- > > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > >> >>>> Shinichiro Abe > > >> >>>> =E9=98=BF=E9=83=A8 =E6=85=8E=E4=B8=80=E6=9C=97 > > >> >>>> > > >> >> > > >> >> > > >> > > >> > > > > > > > > > -- > -------------------------- > > Benedetti Alessandro > Visiting card : http://about.me/alessandro_benedetti > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > --089e011615fc12bf6c04fc1c4900--