Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 78C4A200B16 for ; Mon, 20 Jun 2016 20:50:49 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 775C2160A55; Mon, 20 Jun 2016 18:50:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 724BB160A26 for ; Mon, 20 Jun 2016 20:50:48 +0200 (CEST) Received: (qmail 34411 invoked by uid 500); 20 Jun 2016 18:50:47 -0000 Mailing-List: contact users-help@nifi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@nifi.apache.org Delivered-To: mailing list users@nifi.apache.org Received: (qmail 34401 invoked by uid 99); 20 Jun 2016 18:50:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jun 2016 18:50:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 0BD921A07A0 for ; Mon, 20 Jun 2016 18:50:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.679 X-Spam-Level: ** X-Spam-Status: No, score=2.679 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URI_NOVOWEL=0.5] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 7LMOXX7d6MoS for ; Mon, 20 Jun 2016 18:50:43 +0000 (UTC) Received: from mail-qk0-f174.google.com (mail-qk0-f174.google.com [209.85.220.174]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 632FC5F3F2 for ; Mon, 20 Jun 2016 18:50:42 +0000 (UTC) Received: by mail-qk0-f174.google.com with SMTP id p10so174627838qke.3 for ; Mon, 20 Jun 2016 11:50:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:mime-version:to:cc:from:subject:date:importance :in-reply-to:references; bh=Gg5RgWOl/y4JLbfpypMuYF5/g1jZKY0arlKJp6/p4Bk=; b=AZDy5317SMaqD381VQpO8i/tzduUokr5zsghvw1bBsPzPWGEgOtbBbNKwq3uoWFS2y jyOKQPkgqjD+Qs9xfUCFJYkKIHYXyfsnTDFlvTbsuPua+nZqhpvUhNHK2RtXW/ca9zdZ /7h1CsWUGGGSlQjWx3PiTKrMGcdm8ULYYb8/rr6t33Y4a3Y554ASt47qeioc0GqnAplO /jD+qv3pasRJpgPv7hjd7yOcG5sqk/M7EIcc36YpWFc6VxXEeL/k1sG7ot4op/bwi4UR kJ7qJQHWF88RwHNiaDKFYoVTK1dWaSTOO+HzjMMbDIrj2jMRjUFouc2RW8vvoO50R8AO hdng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:mime-version:to:cc:from:subject:date :importance:in-reply-to:references; bh=Gg5RgWOl/y4JLbfpypMuYF5/g1jZKY0arlKJp6/p4Bk=; b=kNWnYwhXLp+6IzQReAXV5rAusqtbo/xZHC6N0Safs2kWW+it5F1N+VzBxxG/vQPSi+ XFV8nGWWZuNbQlXdd1xd7JnujU7AnEwsUo+rzFgq2ckJgmrmTfbGliqyS7s3DRDa1+4b hYwEP97yIswgVE+uaWa4cvDvQm2Qr2oR0roBdH4KcYK7dCSvgHImgeASONJTHPOjQ53g Lw72iOHMn4kKHk1fIQpYK8+F8iplv/MgMXTx2UVDD93XghiM+ZPPQjU0NrNeKfN+h/df w2sB4tD9m4RfACbCz96pTWEZ57i+00eUQn4lIxbmlcgRVJcbXLnQ6Jt+DWST8YrM14Uy /Ifg== X-Gm-Message-State: ALyK8tLtJQf61VIOZU0xfWBIEZy8pGXAg+0bNJMw/1J+WRKDzfGEGCI26v1EpRa2vigYOg== X-Received: by 10.55.89.2 with SMTP id n2mr24117701qkb.169.1466448641483; Mon, 20 Jun 2016 11:50:41 -0700 (PDT) Received: from ?IPv6:::ffff:192.168.2.215? (pool-108-28-17-93.washdc.fios.verizon.net. [108.28.17.93]) by smtp.gmail.com with ESMTPSA id n137sm21468196qke.0.2016.06.20.11.50.40 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 20 Jun 2016 11:50:40 -0700 (PDT) Message-ID: <57683b00.8fa4370a.129c7.7ff6@mx.google.com> MIME-Version: 1.0 To: Simon Elliston Ball , "users@nifi.apache.org" Cc: Lee Laim From: Sven Davison Subject: RE: GetHTTP->ExtractText (Regex/User problem?) Date: Mon, 20 Jun 2016 14:50:42 -0400 Importance: normal X-Priority: 3 In-Reply-To: <52FDFCFA-3280-41EC-AD8D-CAADD5417748@simonellistonball.com> References: <576818e2.4694370a.d1f69.5602@mx.google.com> <57682b53.422ded0a.13171.ffff8279@mx.google.com> <52FDFCFA-3280-41EC-AD8D-CAADD5417748@simonellistonball.com> Content-Type: multipart/alternative; boundary="_4F1A6BC2-7E44-4CA5-A806-475558841B0D_" archived-at: Mon, 20 Jun 2016 18:50:49 -0000 --_4F1A6BC2-7E44-4CA5-A806-475558841B0D_ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Awsome. It=E2=80=99s coming along and the overhead it might have is worth t= he stress free setup. NICE! Now I want to take the content of it and put it into a variable so it=E2=80= =99s easier for me to understand how to use it. I=E2=80=99m TRYING to get t= he variable (joke) filled with the content of the tag. I can put it out to = a file just fine, but trying to avoid a bunch of FileI/O overhead. http://prntscr.com/bisfy9 -Sven Sent from Mail for Windows 10 From: Simon Elliston Ball Sent: Monday, June 20, 2016 1:52 PM To: users@nifi.apache.org Cc: Lee Laim Subject: Re: GetHTTP->ExtractText (Regex/User problem?) http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xht= ml-self-contained-tags/1732454#1732454=C2=A0is something of a classic on th= is subject.=C2=A0 I would recommend using the ExtractXPath/XQuery or GetHTMLElement =C2=A0htt= ps://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.GetHTMLEleme= nt/index.html=C2=A0these may be a little heavier on the processing, but wil= l certainly save you a lot of problems with parsing. This lets you use css = selectors against html, which is more intuitive and robust to parse HTML. Simon On 20 Jun 2016, at 18:43, Sven Davison wrote: I had tried that but got a NULL value result. =C2=A0Is there a setting w/in= the extractor that I need to change too? =C2=A0 =C2=A0 =C2=A0 -Sven Sent from=C2=A0Mail=C2=A0for Windows 10 =C2=A0 From:=C2=A0Lee Laim Sent:=C2=A0Monday, June 20, 2016 12:56 PM To:=C2=A0users@nifi.apache.org Subject:=C2=A0Re: GetHTTP->ExtractText (Regex/User problem?) =C2=A0 Hi Sven,=C2=A0 =C2=A0 give this a try: =C2=A0
(.*?)<\/div> =C2=A0 =C2=A0 =C2=A0 On Mon, Jun 20, 2016 at 10:25 AM, Sven Davison wrot= e: I have looked at the example for extracting text. I seen the example pulls = the content between the tags. I=E2=80=99ve changed it to pull from = the <h3> tags w/o problem. The problem I=E2=80=99m having is pulling form s= omething a bit more specific. I=E2=80=99m sure the problem is with my under= standing/usage of REGEX. =C2=A0 I=E2=80=99m trying to pull the content from this example. =C2=A0 <div class=3D=E2=80=9Dcontent=E2=80=9D>this is the content I want to pull</= div> =C2=A0 Any help would be super awesome. I=E2=80=99ve been banging my head for a bi= t here. =C2=A0 =C2=A0 =C2=A0 -Sven =C2=A0 Sent from=C2=A0Mail=C2=A0for Windows 10 --_4F1A6BC2-7E44-4CA5-A806-475558841B0D_ Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8" <html xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:w=3D"urn:sc= hemas-microsoft-com:office:word" xmlns:m=3D"http://schemas.microsoft.com/of= fice/2004/12/omml" xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta ht= tp-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8"><meta name= =3DGenerator content=3D"Microsoft Word 15 (filtered medium)"><style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:11.0pt; font-family:"Calibri",sans-serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:#954F72; text-decoration:underline;} span.apple-converted-space {mso-style-name:apple-converted-space;} .MsoChpDefault {mso-style-type:export-only;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --></style></head><body lang=3DEN-US link=3Dblue vlink=3D"#954F72"><div cla= ss=3DWordSection1><p class=3DMsoNormal>Awsome. It=E2=80=99s coming along an= d the overhead it might have is worth the stress free setup. NICE!</p><p cl= ass=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>Now I want to tak= e the content of it and put it into a variable so it=E2=80=99s easier for m= e to understand how to use it. I=E2=80=99m TRYING to get the variable (joke= ) filled with the content of the tag. I can put it out to a file just fine,= but trying to avoid a bunch of FileI/O overhead.</p><p class=3DMsoNormal><= o:p> </o:p></p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMs= oNormal><a href=3D"http://prntscr.com/bisfy9">http://prntscr.com/bisfy9</a>= </p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal><o:p>&nb= sp;</o:p></p><p class=3DMsoNormal>-Sven</p><p class=3DMsoNormal><o:p> = </o:p></p><p class=3DMsoNormal>Sent from <a href=3D"https://go.microsoft.co= m/fwlink/?LinkId=3D550986">Mail</a> for Windows 10</p><p class=3DMsoNormal>= <span style=3D'font-size:12.0pt;font-family:"Times New Roman",serif'><o:p>&= nbsp;</o:p></span></p><div style=3D'mso-element:para-border-div;border:none= ;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in'><p class=3DMsoNo= rmal style=3D'border:none;padding:0in'><b>From: </b><a href=3D"mailto:simon= @simonellistonball.com">Simon Elliston Ball</a><br><b>Sent: </b>Monday, Jun= e 20, 2016 1:52 PM<br><b>To: </b><a href=3D"mailto:users@nifi.apache.org">u= sers@nifi.apache.org</a><br><b>Cc: </b><a href=3D"mailto:lee.laim@gmail.com= ">Lee Laim</a><br><b>Subject: </b>Re: GetHTTP->ExtractText (Regex/User p= roblem?)</p></div><p class=3DMsoNormal><span style=3D'font-size:12.0pt;font= -family:"Times New Roman",serif'><o:p> </o:p></span></p><p class=3DMso= Normal><span style=3D'font-size:12.0pt;font-family:"Times New Roman",serif'= ><a href=3D"http://stackoverflow.com/questions/1732348/regex-match-open-tag= s-except-xhtml-self-contained-tags/1732454#1732454">http://stackoverflow.co= m/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/= 1732454#1732454</a> is something of a classic on this subject. </= span><span style=3D'font-size:12.0pt;font-family:"Times New Roman",serif'><= o:p></o:p></span></p><div><p class=3DMsoNormal><span style=3D'font-size:12.= 0pt;font-family:"Times New Roman",serif'><o:p> </o:p></span></p></div>= <div><p class=3DMsoNormal><span style=3D'font-size:12.0pt;font-family:"Time= s New Roman",serif'>I would recommend using the ExtractXPath/XQuery or GetH= TMLElement  <a href=3D"https://nifi.apache.org/docs/nifi-docs/componen= ts/org.apache.nifi.GetHTMLElement/index.html">https://nifi.apache.org/docs/= nifi-docs/components/org.apache.nifi.GetHTMLElement/index.html</a> the= se may be a little heavier on the processing, but will certainly save you a= lot of problems with parsing. This lets you use css selectors against html= , which is more intuitive and robust to parse HTML.<o:p></o:p></span></p><d= iv><div><p class=3DMsoNormal><span style=3D'font-size:12.0pt;font-family:"T= imes New Roman",serif'><o:p> </o:p></span></p></div><div><p class=3DMs= oNormal><span style=3D'font-size:12.0pt;font-family:"Times New Roman",serif= '>Simon<o:p></o:p></span></p></div><p class=3DMsoNormal><span style=3D'font= -size:12.0pt;font-family:"Times New Roman",serif'><o:p> </o:p></span><= /p><div><blockquote style=3D'margin-top:5.0pt;margin-bottom:5.0pt'><div><p = class=3DMsoNormal><span style=3D'font-size:12.0pt;font-family:"Times New Ro= man",serif'>On 20 Jun 2016, at 18:43, Sven Davison <<a href=3D"mailto:sv= endavison@gmail.com">svendavison@gmail.com</a>> wrote:<o:p></o:p></span>= </p></div><p class=3DMsoNormal><span style=3D'font-size:12.0pt;font-family:= "Times New Roman",serif'><o:p> </o:p></span></p><div><div><p class=3DM= soNormal>I had tried that but got a NULL value result.  Is there a set= ting w/in the extractor that I need to change too?<o:p></o:p></p></div><div= ><p class=3DMsoNormal> <o:p></o:p></p></div><div><p class=3DMsoNormal>=  <o:p></o:p></p></div><div><p class=3DMsoNormal> <o:p></o:p></p><= /div><div><p class=3DMsoNormal>-Sven<o:p></o:p></p></div><div><p class=3DMs= oNormal>Sent from<span class=3Dapple-converted-space> </span><a href= =3D"https://go.microsoft.com/fwlink/?LinkId=3D550986"><span style=3D'color:= #954F72'>Mail</span></a><span class=3Dapple-converted-space> </span>fo= r Windows 10<o:p></o:p></p></div><div><p class=3DMsoNormal><span style=3D'f= ont-size:12.0pt;font-family:"Times New Roman",serif'> </span><o:p></o:= p></p></div><div style=3D'border:none;border-top:solid #E1E1E1 1.0pt;paddin= g:3.0pt 0in 0in 0in'><div><p class=3DMsoNormal><b>From:<span class=3Dapple-= converted-space> </span></b><a href=3D"mailto:lee.laim@gmail.com"><spa= n style=3D'color:#954F72'>Lee Laim</span></a><br><b>Sent:<span class=3Dappl= e-converted-space> </span></b>Monday, June 20, 2016 12:56 PM<br><b>To:= <span class=3Dapple-converted-space> </span></b><a href=3D"mailto:user= s@nifi.apache.org"><span style=3D'color:#954F72'>users@nifi.apache.org</spa= n></a><br><b>Subject:<span class=3Dapple-converted-space> </span></b>R= e: GetHTTP->ExtractText (Regex/User problem?)<o:p></o:p></p></div></div>= <div><p class=3DMsoNormal><span style=3D'font-size:12.0pt;font-family:"Time= s New Roman",serif'> </span><o:p></o:p></p></div><div><div><div><p cla= ss=3DMsoNormal><span style=3D'font-size:12.0pt;font-family:"Times New Roman= ",serif'>Hi Sven, </span><o:p></o:p></p></div></div><div><div><p class= =3DMsoNormal><span style=3D'font-size:12.0pt;font-family:"Times New Roman",= serif'> </span><o:p></o:p></p></div></div><div><div><p class=3DMsoNorm= al><span style=3D'font-size:12.0pt;font-family:"Times New Roman",serif'>giv= e this a try:</span><o:p></o:p></p></div></div><div><div><p class=3DMsoNorm= al><span style=3D'font-size:12.0pt;font-family:"Times New Roman",serif'>&nb= sp;</span><o:p></o:p></p></div></div><div><p class=3DMsoNormal><span style= =3D'font-size:12.0pt;font-family:"Courier New"'><div class=3D=E2=80=9Dco= ntent=E2=80=9D>(.*?)<\/div></span><o:p></o:p></p></div><div><div><= p class=3DMsoNormal><span style=3D'font-size:12.0pt;font-family:"Times New = Roman",serif'> </span><o:p></o:p></p></div></div><div><div><p class=3D= MsoNormal><span style=3D'font-size:12.0pt;font-family:"Times New Roman",ser= if'> </span><o:p></o:p></p></div></div></div><div><div><p class=3DMsoN= ormal><span style=3D'font-size:12.0pt;font-family:"Times New Roman",serif'>=  </span><o:p></o:p></p></div><div><div><p class=3DMsoNormal><span styl= e=3D'font-size:12.0pt;font-family:"Times New Roman",serif'>On Mon, Jun 20, = 2016 at 10:25 AM, Sven Davison <<a href=3D"mailto:svendavison@gmail.com"= target=3D"_blank"><span style=3D'color:#954F72'>svendavison@gmail.com</spa= n></a>> wrote:</span><o:p></o:p></p></div><blockquote style=3D'border:no= ne;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.= 8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt'><div><div><div><= p class=3DMsoNormal>I have looked at the example for extracting text. I see= n the example pulls the content between the <title> tags. I=E2=80=99v= e changed it to pull from the <h3> tags w/o problem. The problem I=E2= =80=99m having is pulling form something a bit more specific. I=E2=80=99m s= ure the problem is with my understanding/usage of REGEX.<o:p></o:p></p></di= v><p class=3DMsoNormal> </p><div><p class=3DMsoNormal>I=E2=80=99m tryi= ng to pull the content from this example.<o:p></o:p></p></div><p class=3DMs= oNormal> </p><div><p class=3DMsoNormal><b><div class=3D=E2=80=9Dcon= tent=E2=80=9D>this is the content I want to pull</div></b><o:p></o= :p></p></div><p class=3DMsoNormal> </p><div><p class=3DMsoNormal>Any h= elp would be super awesome. I=E2=80=99ve been banging my head for a bit her= e.<o:p></o:p></p></div><p class=3DMsoNormal> </p><p class=3DMsoNormal>=  </p><p class=3DMsoNormal> </p><div><p class=3DMsoNormal>-Sven<o:= p></o:p></p></div><p class=3DMsoNormal> </p><div><p class=3DMsoNormal>= Sent from<span class=3Dapple-converted-space> </span><a href=3D"https:= //go.microsoft.com/fwlink/?LinkId=3D550986" target=3D"_blank"><span style= =3D'color:#954F72'>Mail</span></a><span class=3Dapple-converted-space> = ;</span>for Windows 10<o:p></o:p></p></div></div></div></blockquote></div><= /div></div></blockquote></div></div></div><p class=3DMsoNormal><span style= =3D'font-size:12.0pt;font-family:"Times New Roman",serif'><o:p> </o:p>= </span></p><p class=3DMsoNormal><o:p> </o:p></p></div></body></html>= --_4F1A6BC2-7E44-4CA5-A806-475558841B0D_--