harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Blewitt" <alex.blew...@gmail.com>
Subject Re: [classlib][pack200] Interested in Pack200
Date Thu, 30 Aug 2007 23:15:20 GMT
> Thanks very much for your reply.  I hadn't found HARMONY-3290, so I will
> have a look at that.  Do you happen to remember what the main changes were
> that you made for EclipseCon?
>
> I would definitely be interesting in talking about the current
> implementation and getting up to speed with it.  I've read most of your past
> e-mails and had a look at some of the code, so that's where I am at the
> moment.

The state of play in the Harmony codebase at the moment was that I
hadn't got around to decoding the bytecode stored in the pack200 file,
so at that point, I could extract interfaces and fully abstract
classes (e.g. those with native parts) but nothing that had any code
or initialisation (e.g. constant expressions or method calls). I
started putting together something to represent the bytecode/class
structure to handle it in the .bytecode. package, but in the dump in
3290 I added something which helped to decode some of the bytecode
instructions themselves.

IIRC the bytecode fields are stored as a variable-sized byte array at
the end of the segment, and you essentially iterate over them (with
0x0 terminators? or was it 0xff?), one for each non-abstract member in
the code. The difference with them is that the bytecode sequence
doesn't have any argument values; instead, they're references into the
appropriate constant pool. So 'ldc 5' would actually mean load
constant pool reference 5, which might turn out to be a string or
something. Secondly, whilst Java bytecode instructions are weakly
typed, they're strongly typed in the packed bytecode, so a load of an
int is different from load as a double, because they come from
different locations in the segment's constant pools. Thus there's a
mapping such that (say) 486,586 and 686 all map to the instruction
'86' but with different arg types. (The numbers are different and are
in the pack200 spec; I forget exactly what they are, but that's the
idea).

In addition, some common constructs are condensed into a single byte.
So the default constructor super() is usually init(), which is usually
represented as 'aload_0, invokespecial n' where n is the entry
java.lang.Object#<init> or some such. That gets boiled down to a
single code (231?) and so when decoding, you not only replace '238'
with the codes for aload_0/invokespecial, but you also potentially
have to infer the method/object reference for the superclass'
constructor as well.

In the EclipseCon demo, I bodged the ability to put the <init> in
place whilst the bytecode was being extracted:

--- 8< ---
        protected ClassFileEntry[] getNestedClassFileEntries() {
                if (opcode == 231) // TODO HACK
                        return new ClassFileEntry[] { new CPMethodRef("java/lang
/Object",
                                        "<init>:()V") };
                else
                        return nested;
        }
--- 8< ---

Clearly, that wouldn't work when the class wasn't a direct subtype of
java.lang.Object or had different arguments.

Once the bytecode extraction is done, then looking at the exception
handlers is probably the next thing that would make it slightly
useful. The debug symbols are also not handled, and nor are any of the
annotation code that's used by the Java 5 stuff.

I seem to recall that when I was working out the parsing of a simple
class, I had an off-by-one error in the number of bytes that the
packed file contained versus what I was expecting. I didn't get that
when I had interfaces. I never really found out what the solution was
for that one :-(

I don't know if this gives you any more of an idea where the state of
play is, but if you were to compile/pack the following:

--- 8< ---
public interface Foo {
  public void abstract foo();
}
--- 8< ---
and then pack it, it should be possible to extract the contents with
the current implementation. That would be a start finding out where
the code paths lie and what's going on. You'll need to compile with
debug symbols disabled (i.e. javac -g:none) and I can't remember
whether the current simple implementation assumes the pack file isn't
GZIpped, or whether I'd fixed that. (By default, the Sun pack200 tool
will auto-gzip the pack200 output.) The next stage would be to get:

--- 8< ---
public class Foo {
  public Foo() {
    super();
  }
  pubiic void abstact foo();
}
--- 8< ---

working, since that will contain the implicit call to the constructor.
The remainder of the bytecodes are either going to have no args (e.g.
'rtn') or some args (e.g. 'getstatic') and the ones with args will
need to be mapped to the appropriate pool entry. If I recall, the arg
values are specific to the per-class pool, rather than the global
pool, but you'd have to re-read the spec to know for sure. Once that's
done, you might be able to start decoding more interesting classes
and/or have ones with 'try/catch' in place.

BTW the code in Segment is ugly and could certainly use a good dose of
refactoring; and I'm not sure that the flyweight pattern in the
ByteCode was doing much good. To be honest, the biggest problem I had
when decoding the bytecode packed values was how much size to allocate
for the resulting stream, and where to fill the values from. I suspect
rather than attempting to do it in one pass (like I did) it might be
better to do a multi-pass, first extracting the real bytecodes (and
any extra additions to the constant pool) and then afterwards post
filling the argument values in. There's also the knotty problem that
the bytecode pool that should get written to the output .class should
be sorted using some fairly weird sorting rules (see cp.resolve() in
buildClassFile of Segment.java) that will affect how the values get
written to the final .class file. It doesn't make any difference from
an execution perspective, but the pack200 spec is clear that they need
to be sorted to a canonical order such that any signatures of the
files will result in the same binary structure of the class file.

That's your starter ... you might want to download the snapshot I made
for the bug and/or commit some of it; it was ugly, and had some hacks,
but it never really got worked on post EclipseCon so it might be a
better place to start from.

By the way, the pack200 spec is mind bending enough the first few
hundred times you read it. If you want to pick my brains on how
something works, feel free to drop me a line and I'll see if I can
help out.

Alex.

Mime
View raw message