Mailing-List: contact esme-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: esme-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of vdichev@gmail.com designates
 209.85.219.220 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=ARE0xeXLZp1hz46+kpclx5sInkVonL7Uf8MbIgUiLG8RbMR+7Zq+Vq/SB3AROh8clq
         6HPFCASF2Um4DASVeHwgnsN9t4nlB7GbfhJJjvEnibJLhO4ngc/nQLmAdmB1yIXFbgu+
         EuR1u6bAejBzrPWMfbkCUGpasSYwHTLoop+pA=
MIME-Version: 1.0
Sender: vdichev@gmail.com
In-Reply-To: <cdbebedf0911260536p238979pa6505cb9a9f546ed@mail.gmail.com>
References: <fa2d9f450911260455g4ba32debu502e49598f7f856e@mail.gmail.com>
	 <cdbebedf0911260536p238979pa6505cb9a9f546ed@mail.gmail.com>
Date: Thu, 26 Nov 2009 16:41:09 +0200
Message-ID: <c7fc82820911260641k762d3acbxfe989cf542966731@mail.gmail.com>
Subject: Re: Removing textile from code base
From: Vassil Dichev <vdichev@apache.org>
To: esme-dev@incubator.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

> We're not looking at the root cause of the problem. =A0The Textile stuff =
is a
> hit if we run it on each message for each user. =A0This is no different t=
han
> having an SQL query in the code that's a Cartesian product and throwing o=
ut
> SQL because of it.
>
> Let's find out where and why we keep loading the same message from the RD=
BMS
> rather than going to the message cache.
>
> Let's find out why we're hitting the RDBMS in general... there are
> abstractions in the system (or at least were) that make RDBMS access a lo=
cal
> thing rather than a global thing.
>
> I'll have time on Monday to look at this, but running around chopping off
> pieces of code and changing functionality isn't going to get us any close=
r
> to solving the problem... it's just going to cause the problem to be
> manifest elsewhere.

I did not remove the Textile parser only because it potentially causes
problems. I think it doesn't fit very well and it's a bit of an
overkill. First of all, for messages headings, tables and paragraphs
are not such a good fit conceptually.

Second, some elements from MsgParser clash with the Textile parser
ones. For instance, links to images cannot be parsed because MsgParser
takes turn first and converts it to an URL element first.

Third, the way parsing with Textile is done is inefficient currently
anyway. I parse every separate text element. Since text can be
separated by urls, tags and usernames, that means I could invoke the
Textile parser several times per message. For instance, this message
has 4 text elements =3D> 4 Textile invocations:

    message with #tag and @username and http://blog.esme.us url in text

Yes, if the performance analysis is correct, the Textile parser is not
the cause of the problem. It might be easier to solve the problem
without it. We even intended to include pluggable parser
implementations some day.

AFAICT, the problem was not that the RDBMS is queried every time
(although that's how the PublicTimeline has worked from day 1 if I
remember correctly). The problem, as explained by Markus, was that the
message was formatted from the raw string every time it's accessed for
rendering a timeline. The RDBMS was mentioned tangentially by Michael
Bechauf(or someone else?). Markus, did I get this correctly?

I still don't see how the message could be parsed several times, since
digestedXHTML is lazy and so will be cached (this alone should make it
*way* easier for Scala to write efficient implementations over Java).

I want to profile the stacktrace where most strings are allocated.
This should answer some questions.

I also plan to remove rendering the public timeline on each user's
timeline page. First of all because it's not cached, and second
because it's not updated in real-time like the friends' timeline, but
only after an explicit refresh of the browser. So the public timeline
is not only slow, but might be confusing for the user, as they will
expect it to work similarly to the personal timeline (as the layout is
the same).

Vassil