Mailing-List: contact avro-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: avro-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <34fd060d1001221820n33babef0qaf6e6089e077e06c@mail.gmail.com>
References: <34fd060d1001221820n33babef0qaf6e6089e077e06c@mail.gmail.com>
From: Philip Zeyliger <philip@cloudera.com>
Date: Fri, 22 Jan 2010 18:38:48 -0800
Message-ID: <15da8a101001221838r40bdf4ddna5c679df7b53d3fe@mail.gmail.com>
Subject: Re: lazy deserialization?
To: avro-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Not with any of today's APIs.  "SELECT col1, col3 FROM t" is handled
easily: you construct a schema that only has those columns, and col2
is skipped at read time.

Does Hive have a use case for this that you're interested in?  If you
don't mind paying the buffer copy, you could probably write a
"DeferredFoo" class that doesn't de-serialize certain structures...

-- Philip

On Fri, Jan 22, 2010 at 6:20 PM, Zheng Shao <zshao9@gmail.com> wrote:
> I noticed that avro has the "skip" functions which can help skip a
> field when deserializing data.
> This is good for column pruning in most cases, but we might be able to
> do better in the following case.
>
>
> Let's say we have a query like this:
>
> CREATE TABLE t (col1 STRING, col2 STRING, col3 STRING);
> SELECT col2 FROM t WHERE col3 = 'abcde';
>
> We want to get field col3 first, if that matches what we want, then we
> want to get to field col2.
>
>
> Is there anyway to "remember" the current location of deserialization,
> so that we can "resume" from that point?
>
>
> --
> Yours,
> Zheng
>