Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6370411259 for ; Wed, 18 Jun 2014 13:21:43 +0000 (UTC) Received: (qmail 40469 invoked by uid 500); 18 Jun 2014 13:21:43 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 40419 invoked by uid 500); 18 Jun 2014 13:21:43 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 40406 invoked by uid 99); 18 Jun 2014 13:21:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2014 13:21:42 +0000 X-ASF-Spam-Status: No, hits=2.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of benedetti.alex85@gmail.com designates 209.85.220.172 as permitted sender) Received: from [209.85.220.172] (HELO mail-vc0-f172.google.com) (209.85.220.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Jun 2014 13:21:37 +0000 Received: by mail-vc0-f172.google.com with SMTP id hy10so758899vcb.31 for ; Wed, 18 Jun 2014 06:21:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=/v4VpEPMlU1+wZZCY6o3vleldkHScJgNl1Q9s7TuH3o=; b=mOoJnPbNl99r9cZarSua/YBF0yT7JFr8lW+etdhX3O7luovIY9B2ZGukA3K1pOnwb4 9EguhkYZlvzxwZk8JrYZDmKd05IvTexxMJ2KS9EQMacMhHEH//FT21D1YOH0EU/HkcDP XQWJlVZyVfgbqW5pcIbNpc8wqWHZHuCZfRgxCZpgclGdvX5chzokQIYs5DvO2leL3ihf 4KRQHcPvIRIOT7Xk55hbulv88xtngv6wufisjrwys0QIbZQomZczmE3bEXiqfHia0+FX eD3ETqWODYRqwswPm8CrRFEWrr2GKqmazL6DjrRAX4UxgK4Nyz16AzJEELa4csFCCVce NmpA== MIME-Version: 1.0 X-Received: by 10.52.163.161 with SMTP id yj1mr11432380vdb.35.1403097676751; Wed, 18 Jun 2014 06:21:16 -0700 (PDT) Received: by 10.58.110.162 with HTTP; Wed, 18 Jun 2014 06:21:16 -0700 (PDT) In-Reply-To: References: <2A48908A-E694-43D2-814F-88073F861716@sourcesense.com> <2484FEE3-C0D3-4D37-BF16-510AD03A7C2D@gmail.com> <8F0BE598-69E2-41CA-85C4-6CB5EDC49AD9@gmail.com> Date: Wed, 18 Jun 2014 14:21:16 +0100 Message-ID: Subject: Re: Solr Extracting request handler From: Alessandro Benedetti To: dev Content-Type: multipart/alternative; boundary=001a11c249269bb33104fc1c224c X-Virus-Checked: Checked by ClamAV on apache.org --001a11c249269bb33104fc1c224c Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable But guys, why not simply pass to a classic SolrJ SolrDocument creation and ingestion in the Solr Server ? Easy and Straighforward ! In the end at that point the RepositoryDocument will me only a Map of metadata and values. Content will be part of that, so I guess the conversion to a SolrDocument will be immediate. Cheers 2014-06-18 3:26 GMT+01:00 Karl Wright : > Hi Abe-san, > > Near as I can tell, the major consumer of disk space is the Maven target > directories. This is generating many tens of megabytes of temporary disk > usage for every connector. Luckily if you use ant, this is not a problem= . > > Karl > > > On Tue, Jun 17, 2014 at 9:55 PM, Karl Wright wrote: > > > Hi Abe-san, > > > > Tika jars are not very big: > > > > C:\wip\mcf\trunk\lib>dir tika* > > Volume in drive C has no label. > > Volume Serial Number is 002E-D1F0 > > > > Directory of C:\wip\mcf\trunk\lib > > > > 06/05/2014 08:21 AM 493,374 tika-core.jar > > 06/05/2014 08:21 AM 523,677 tika-parsers.jar > > 2 File(s) 1,017,051 bytes > > 0 Dir(s) 140,792,315,904 bytes free > > > > The entire lib directory is 85M: > > > > 85,156,330 bytes > > > > The built binary image is still about 185Mb, I believe. So I don't kno= w > > why you think it is >1Gb? Temporary class files? I don't think we can > > avoid those. > > > > I'd rather not make things more complicated than they need to be by > adding > > a new required service - even though it would fit naturally with the > > connector arrangement. > > > > Karl > > > > > > > > > > > > On Tue, Jun 17, 2014 at 9:42 PM, Shinichiro Abe < > > shinichiro.abe.1@gmail.com> wrote: > > > >> Hi Karl, > >> > >> Okay, I assumed Tika connector outputs files. > >> If we post character data metadata got from Tika, "/update/extract" > >> handler > >> can handle this(provides params: > >> literal.content=3Dvalue&literal.metaField=3Dfoobar > >> with using NullInputStream for binary data like CONNECTORS-936). > >> > >> BTW, now trunk built size is too big(1G+). Maybe because CloudSearch > >> connector uses Tika jars. > >> Tika connector and CloudSearch connector should extract text via > >> tika-server[1] > >> and MCF should not have many Tika jars, do you think? > >> > >> [1] > >> http://wiki.apache.org/tika/TikaJAXRS > >> > >> Thanks, > >> Shinichiro Abe > >> > >> On 2014/06/18, at 9:45, Karl Wright wrote: > >> > >> > Hi Abe-san, > >> > > >> > It sounds like you might be thinking that transformation connectors > are > >> > like output connectors. Just so we are clear, transformation > >> connectors in > >> > 1.7 receive a RepositoryDocument as input, and then pass a > >> > RepositoryDocument on to the next connector in the chain. So I don'= t > >> know > >> > why .xml files would be involved. I'd expect the Tika connector to > >> read a > >> > binary file from one RepositoryDocument object and convert its > contents > >> to > >> > another RepositoryDocument object which would have character data an= d > >> > metadata only. Would this work for your case, do you think? > >> > > >> > Karl > >> > > >> > > >> > > >> > On Tue, Jun 17, 2014 at 8:38 PM, Shinichiro Abe < > >> shinichiro.abe.1@gmail.com> > >> > wrote: > >> > > >> >> Hi Karl, > >> >> > >> >> Yes. I thought the standard update handler met that requirement. > >> >> For instance, Tika extractor transformation connector creates two > >> files. > >> >> 1. addtoSolr.xml for add and update > >> >> 2. deletetoSolr.xml for delete > >> >> File connector ingests these xml files, then Solr connector posts > these > >> >> files by "/update" handler. > >> >> > >> >> In the the Solr Connector, other function as to update handler > >> >> might not be necessary except for "/update" handler. > >> >> > >> >> Thanks, > >> >> Shinichiro Abe > >> >> > >> >> On 2014/06/18, at 8:02, Karl Wright wrote: > >> >> > >> >>> Hi Abe-san, > >> >>> > >> >>> So just to be sure -- you believe that no changes at all are > required > >> to > >> >>> the Solr Connector as it stands now, other than to use the update > >> handler > >> >>> rather than the /update/extract handler? > >> >>> > >> >>> Karl > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Tue, Jun 17, 2014 at 5:14 PM, Shinichiro Abe < > >> >> shinichiro.abe.1@gmail.com> > >> >>> wrote: > >> >>> > >> >>>>> As for changing the Solr connector so that it doesn't go to the > >> >> extracting > >> >>>> update handler > >> >>>> > >> >>>> I don't think it needs to change Solr connector with new checkbox > >> >> because > >> >>>> currently we can change "/update/extract" into "/update" at 'Upda= te > >> >>>> Handler' at Paths tab in Solr connector UI. I confirmed I could > post > >> >> CSV, > >> >>>> JSON and XML files to Solr by changing that and using File > connector. > >> >> So I > >> >>>> wish we allow Tika extractor transformation connector to create X= ML > >> >> files > >> >>>> that Solr expects to see. > >> >>>> > >> >>>> Regards, > >> >>>> Shinichiro Abe > >> >>>> > >> >>>> > >> >>>> 2014-06-18 2:55 GMT+09:00 Karl Wright : > >> >>>> > >> >>>>> The pipeline code itself is now "complete" in trunk. Zaizi said > >> they'd > >> >>>>> contribute a Tika extractor transformation connector - and if th= ey > >> >> don't > >> >>>>> get around to that in a month or so, I may take a crack at it > >> myself. > >> >>>>> > >> >>>>> As for changing the Solr connector so that it doesn't go to the > >> >>>> extracting > >> >>>>> update handler, it would be great if: > >> >>>>> (1) Someone created a ticket for this, and > >> >>>>> (2) A patch was provided that maintains backwards compatibility > with > >> >>>>> previous versions of the connector (so a checkbox would probably > >> need > >> >> to > >> >>>> go > >> >>>>> into the UI somewhere). Do either of you want to start this > >> process? > >> >>>>> > >> >>>>> Thanks! > >> >>>>> Karl > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> On Mon, Jun 16, 2014 at 12:37 PM, Karl Wright > > >> >>>> wrote: > >> >>>>> > >> >>>>>> Hi guys, > >> >>>>>> > >> >>>>>> You folks may not have looked at 1.7 yet, but it has a full > >> pipeline, > >> >>>> and > >> >>>>>> is expected to have a Tika extractor as a transformation > connector. > >> >>>>>> > >> >>>>>> Karl > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> On Mon, Jun 16, 2014 at 11:14 AM, Matteo Grolla < > >> >>>>> m.grolla@sourcesense.com> > >> >>>>>> wrote: > >> >>>>>> > >> >>>>>>> Thanks Alessandro, > >> >>>>>>> that explains the situation clearly. > >> >>>>>>> And I agree that sending all the metadata as get parameter can > be > >> >>>>>>> problematic > >> >>>>>>> > >> >>>>>>> Cheers > >> >>>>>>> > >> >>>>>>> -- > >> >>>>>>> Matteo Grolla > >> >>>>>>> Sourcesense - making sense of Open Source > >> >>>>>>> http://www.sourcesense.com > >> >>>>>>> > >> >>>>>>> Il giorno 16/giu/2014, alle ore 17:09, Alessandro Benedetti ha > >> >>>> scritto: > >> >>>>>>> > >> >>>>>>>> mmmm the point is that right now ManifoldCF has no extractors= . > >> >>>>>>>> The Repository connectors extracts directly the binary and > there > >> is > >> >>>> no > >> >>>>>>>> "Extractor Processor" yet. > >> >>>>>>>> But recently a pipe-line processor architecture has been > thought > >> ( > >> >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-959) > >> >>>>>>>> So can fit there. > >> >>>>>>>> > >> >>>>>>>> Cheers > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> 2014-06-16 15:59 GMT+01:00 Matteo Grolla < > >> m.grolla@sourcesense.com > >> >>>>> : > >> >>>>>>>> > >> >>>>>>>>> Since Solr extracting request handler takes the binary and > >> extracts > >> >>>>>>> text > >> >>>>>>>>> what is the point of not using Manifold extractor and send > text > >> and > >> >>>>>>>>> binaries to solr? > >> >>>>>>>>> I mean the end result is the same solr indexes text and stor= es > >> text > >> >>>>>>>>> So if manifold supports text extraction it seems me this is > the > >> >>>> place > >> >>>>>>>>> where it should be done > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> Matteo Grolla > >> >>>>>>>>> Sourcesense - making sense of Open Source > >> >>>>>>>>> http://www.sourcesense.com > >> >>>>>>>>> > >> >>>>>>>>> Il giorno 16/giu/2014, alle ore 16:51, Antonio David Perez > >> Morales > >> >>>> ha > >> >>>>>>>>> scritto: > >> >>>>>>>>> > >> >>>>>>>>>> Hi Matteo > >> >>>>>>>>>> > >> >>>>>>>>>> Manifold already handles the extraction, but the only way t= o > >> send > >> >>>>>>> binary > >> >>>>>>>>>> content and document metadata to Solr is using the > >> update/extract > >> >>>>>>>>> handler, > >> >>>>>>>>>> where the metadata is sent as query parameters and the bina= ry > >> >>>>> content > >> >>>>>>> is > >> >>>>>>>>>> sent in the body of the requests, allowing Solr to use Tika > to > >> >>>>> obtain > >> >>>>>>> the > >> >>>>>>>>>> raw content to be stored in Solr. > >> >>>>>>>>>> > >> >>>>>>>>>> Regards > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> On Mon, Jun 16, 2014 at 4:35 PM, Matteo Grolla < > >> >>>>>>> m.grolla@sourcesense.com > >> >>>>>>>>>> > >> >>>>>>>>>> wrote: > >> >>>>>>>>>> > >> >>>>>>>>>>> Hi During my first indexing I noticed that manifold uses > Solr > >> >>>>>>> extracting > >> >>>>>>>>>>> request handler to extract the content of an xml file > >> >>>>>>>>>>> For performance reasons it would be better if Manifold > handled > >> >>>> the > >> >>>>>>>>>>> extraction letting Solr do the search engine > >> >>>>>>>>>>> Is this because of the connector design, framework design = or > >> just > >> >>>>> to > >> >>>>>>> be > >> >>>>>>>>>>> done? > >> >>>>>>>>>>> > >> >>>>>>>>>>> -- > >> >>>>>>>>>>> Matteo Grolla > >> >>>>>>>>>>> Sourcesense - making sense of Open Source > >> >>>>>>>>>>> http://www.sourcesense.com > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> -- > >> >>>>>>>>>> > >> >>>>>>>>>> ------------------------------ > >> >>>>>>>>>> This message should be regarded as confidential. If you hav= e > >> >>>>> received > >> >>>>>>>>> this > >> >>>>>>>>>> email in error please notify the sender and destroy it > >> >>>> immediately. > >> >>>>>>>>>> Statements of intent shall only become binding when confirm= ed > >> in > >> >>>>> hard > >> >>>>>>>>> copy > >> >>>>>>>>>> by an authorised signatory. > >> >>>>>>>>>> > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the > >> registration > >> >>>>>>> number > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherd= s > >> Bush > >> >>>>>>> Road, > >> >>>>>>>>>> London W6 7AN. > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> -- > >> >>>>>>>> -------------------------- > >> >>>>>>>> > >> >>>>>>>> Benedetti Alessandro > >> >>>>>>>> Visiting card : http://about.me/alessandro_benedetti > >> >>>>>>>> > >> >>>>>>>> "Tyger, tyger burning bright > >> >>>>>>>> In the forests of the night, > >> >>>>>>>> What immortal hand or eye > >> >>>>>>>> Could frame thy fearful symmetry?" > >> >>>>>>>> > >> >>>>>>>> William Blake - Songs of Experience -1794 England > >> >>>>>>> > >> >>>>>>> > >> >>>>>> > >> >>>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> -- > >> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > >> >>>> Shinichiro Abe > >> >>>> =E9=98=BF=E9=83=A8 =E6=85=8E=E4=B8=80=E6=9C=97 > >> >>>> > >> >> > >> >> > >> > >> > > > --=20 -------------------------- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England --001a11c249269bb33104fc1c224c--