Mailing-List: contact modules-dev-help@httpd.apache.org; run by ezmlm
Precedence: bulk
Reply-To: modules-dev@httpd.apache.org
Received-SPF: pass (nike.apache.org: domain of sindhi.for@gmail.com designates
 209.85.223.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAGKR+ECATQiw21G-01Vs1Jx4nPykGdm4No7PL4dqXtpH+R=HOg@mail.gmail.com>
References: 
 <CANOjuGEBEJr+uTH=1WoA+7+n+xsMEs8S9uEauHisE8tyi30mVQ@mail.gmail.com>
	<CAGKR+ECATQiw21G-01Vs1Jx4nPykGdm4No7PL4dqXtpH+R=HOg@mail.gmail.com>
Date: Wed, 1 May 2013 19:18:50 +0530
Message-ID: 
 <CANOjuGEq2Fb9+McDnar8eCN=fkNfiXN16jThUb56O7wrJ1xEXw@mail.gmail.com>
Subject: Re: Apache Buckets and Brigade
From: Sindhi Sindhi <sindhi.for@gmail.com>
To: modules-dev@httpd.apache.org
Content-Type: multipart/alternative; boundary=e89a8f642e40bb1cf104dba860d4

--e89a8f642e40bb1cf104dba860d4
Content-Type: text/plain; charset=ISO-8859-1

Thanks.
I'd definitely be interested in discussing further.

Theres one more thing, I doubt if I can use ModPagespeedSubstitute, because
our string replacement actually uses some business logic. For ex. in
"oldString", if i find a "old" string at offset 0 i'll replace it with
"new" otherwise I'll replace it with "temp". The one I mentioned in my
previous email was just a very simple and straight forward example. When
our business logic runs over the huge html file we have it executes a lot
more rules to find out if it should replace "oldString" with "newString" or
with "tempString" or with some other string. So for me its very critical
that the HTML tags are read in complete and not partially when the string
replacement function is called.

The HTML-centric fetch of data as you mentioned suits the best for me. But
I dont want mod_pagespeed to actually modify anything in my HTML file, if
it can give me either the entire HTML file OR HTML-centric fetch of data
that will solve my problem :)


On Wed, May 1, 2013 at 6:52 PM, Joshua Marantz <jmarantz@google.com> wrote:

> I have a crazy idea for you.  Maybe this is overkill but this sounds like
> it'd be natural to add to mod_pagespeed <http://modpagespeed.com> as a new
> filter.
>
> Here's some code you might use as a template
>
>
> https://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/rewriter/collapse_whitespace_filter.cc
>
> one thing we've thought of doing is providing a generic text-substitution
> filter that would take strings in character-blocks and do arbitrary
> substitutions in them, that could be specified in the .conf file:
>   ModPagespeedSubstitute "oldString" "newString"
>
> You are right that text-blocks in Apache output filters can be split
> arbitrarily across buckets, but mod_pagespeed takes care of that in an
> HTML-centric way, breaking up blocks on html tokens. A block of free-format
> text would be treated as a single atomic token independent of the structure
> of the incoming bucket brigade.
>
> Let me know if you'd like to discuss this further.
>
> -Josh
>
>
> On Wed, May 1, 2013 at 8:54 AM, Sindhi Sindhi <sindhi.for@gmail.com>
> wrote:
>
> > Hello,
> >
> > Thanks a lot for providing answers to my earlier emails with subject
> > "Apache C++ equivalent of javax.servlet.Filter". I really appreciate your
> > help.
> >
> > I had another question. My requirement is something like this -
> >
> > I have a huge html file that I have copied into the Apache htdocs folder.
> > In my C++ Apache module, I want to get this html file contents and
> > remove/replace some strings.
> >
> > Say I have a HTML file that has the string "oldString" appearing 3 times
> in
> > the file. My requirement is to replace "oldString" with the new string
> > "newString". I have already written a C++ function that has a signature
> > like this -
> >
> > char* processHTML(char* inHTMLString) {
> > //
> > char* newHTMLWithNewString = <code to replace all occurrences of
> > "oldString" with "newString">
> > return newHTMLWithNewString;
> > }
> >
> > The above function does a lot more than just string replace, it has lot
> of
> > business logic implemented and finally returns the new HTML string.
> >
> > I want to call processHTML() inside my C++ Apache module. As I know
> Apache
> > maintains an internal data structure called Buckets and Brigades which
> > actually contain the HTML file data. My question is, is the entire HTML
> > file content (in my case the html file is huge) residing in a single
> > bucket? Means, when I fetch one bucket at a time from a brigade, can I be
> > sure that the entire HTML file data from <html> to </html> can be found
> in
> > a single bucket? For ex. if my html file looks like this -
> > <html>
> > ..
> > ..
> > oldString
> > ... oldString...........oldString..
> > ..
> > </html>
> >
> > When I iterate through all buckets of a brigade, will I find my entire
> HTML
> > file content in a single bucket OR the HTML file content can be present
> in
> > multiple buckets, say like this -
> >
> > case1:
> > bucket-1 contents =
> > "<html>
> > ..
> > ..
> > oldString
> > ... oldString...........oldString..
> > ..
> > </html>"
> >
> > case2:
> > bucket-1 contents =
> > "<html>
> > ..
> > ..
> > oldStr"
> >
> > bucket-2 contents =
> > "ing
> > ... oldString...........oldString..
> > ..
> > </html>"
> >
> > If its case2, then the the function processHTML() I have written will not
> > work because it searches for the entire string "oldString" and in case2
> > "oldString" is found only partially.
> >
> > Thanks a lot.
> >
>

--e89a8f642e40bb1cf104dba860d4--