hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris White <chriswhite...@gmail.com>
Subject Re: what is the code for WritableComparator.readVInt and WritableUtils.decodeVIntSize doing?
Date Sat, 31 Mar 2012 17:17:11 GMT
A text object is written out as a vint representing the number of bytes and
then the byte array contents of the text object

Because a vintage can be between 1-5 bytes in length, the decodeVIntSize
method examines the first byte of the vint to work out how many bytes to
skip over before the text bytes start.

readVInt then actually reads the vint bytes to get the length of the
following byte array.

So when you call the compareBytes method you need to pass in where the
actual bytes start (s1 + vIntLen) and how many bytes to compare (vint)
On Mar 31, 2012 12:38 AM, "Jane Wayne" <jane.wayne2978@gmail.com> wrote:

> in tom white's book, Hadoop, The Definitive Guide, in the second edition,
> on page 99, he shows how to compare the raw bytes of a key with Text
> fields. he shows an example like the following.
> int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
> int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
> his explanation is that firstL1 is the length of the first String/Text in
> b1, and firstL2 is the length of the first String/Text in b2. but i'm
> unsure of what the code is actually doing.
> what is WritableUtils.decodeVIntSize(...) doing?
> what is WritableComparator.readVInt(...) doing?
> why do we have to add the outputs of these 2 methods to get the length of
> the String/Text?
> could someone please explain in plain terms what's happening here? it seems
> WritableComparator.readVInt(...) is already getting the length of the
> byte[] corresponding to the string. it seems
> WritableUtils.decodeVIntSize(...) is also doing the same thing (from
> reading the javadoc).
> when i look at WritableUtils.writeString(...), two things happen. the
> length of the byte[] is written, followed by writing the byte[] itself. why
> can't we simply do something like the following to get the length?
> int firstL1 = readInt(b1[s1]);
> int firstL2 = readInt(b2[s2]);

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message