avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peterson, Michael" <michael.d.peter...@truvenhealth.com>
Subject RE: Building skip table for Avro data
Date Mon, 08 Dec 2014 15:07:35 GMT
Ken,

Did you get this working? If so, I'd love to get some pointers (or a blog post) on how you
did it.  I'm interested in the same functionality in the near future, but don't have time
right now to dive in and figure it out.

-Michael

From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
Sent: Thursday, December 04, 2014 11:05 PM
To: user@avro.apache.org
Subject: RE: Building skip table for Avro data

Hi Doug,

________________________________

From: Doug Cutting

Sent: December 4, 2014 7:41:05am PST

To: user@avro.apache.org<mailto:user@avro.apache.org>

Subject: Re: Building skip table for Avro data


Have you looked at SortedKeyValueFile?

https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html

This may already provide what you need.
Seems pretty darn close.

So to use this is a regular Hadoop job, it looks like I'd essentially not emit anything via
the reducer's standard context.write() call, but instead I'd need to set up a SortedKeyValueFile.Writer
to write to the job's output directory (using a sub-directory Path based on the task number),
yes?

Which means I'd wind up with <job output path>/part-xxxxx/data and <job output path>/part-xxxxx/data/index

However I'm a little confused about the /data file being "an ordinary Avro container file".
The doc says:

     Each record has exactly two fields, 'key' and 'value'. The keys are sorted lexicographically

But both the key and value fields have their own Avro Schema, right? Or at least that's what
I assume from the withValueSchema<https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.Writer.Options.html#getValueSchema()>
and withKeySchema<https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.Writer.Options.html#withKeySchema(org.apache.avro.Schema)>
calls for setting up the writer options.

So in that case:

a. how is it sorted lexicographically (as per the SortedKeyValueFile<https://avro.apache.org/docs/current/api/java/org/apache/avro/hadoop/file/SortedKeyValueFile.html>
JavaDocs)?

b. How would a reader who's expecting a regular Avro file read the records? Would they get
records that were the union of fields in the key + value schemas?

Thanks again,

-- Ken



On Dec 3, 2014 10:14 PM, "Joey Echeverria" <joey@cloudera.com<mailto:joey@cloudera.com>>
wrote:
It sounds feasible to me. You can certainly seek to a specific sync
marker and so long as you're periodically calling sync to get the last
position, then you can save those offsets in a separate file(s) that
you load into memory or search sequentially.

This sounds very similar to MapFiles which used a pair of
SequenceFiles, one with the data and one with an index of every Nth
key to speed up lookups of sorted data.

-Joey

On Wed, Dec 3, 2014 at 6:06 PM, Ken Krugler <kkrugler_lists@transpac.com<mailto:kkrugler_lists@transpac.com>>
wrote:
> Hi all,
>
> I'm looking for suggestions on how to optimize a number of Hadoop jobs
> (written using Cascading) that only need a fraction of the records store in
> Avro files.
>
> Essentially I have a small number (let's say 10K) of essentially random keys
> out of a total of 100M unique values, and I need to select & process all and
> only those records in my Avro files where the key field matches. The set of
> keys that are of interest changes with each run.
>
> I have about 1TB of compressed data to scan through, saved as about 200 5GB
> files. This represents about 10B records.
>
> The data format has to stay as Avro, for interchange with various groups.
>
> As I'm building the Avro files, I could sort by the key field.
>
> I'm wondering if it's feasible to build a skip table that would let me seek
> to a sync position in the Avro file and read from it. If the default sync
> interval is 16K, then I'd have 65M of these that I could use, and even if
> every key of interest had 100 records that were each in a separate block,
> this would still dramatically cut down on the amount of data I'd have to
> scan over.
>
> But is that possible? Any input would be appreciated.
>
> Thanks,
>
> -- Ken
>


--
Joey Echeverria


--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Mime
View raw message