pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: record numbers
Date Sat, 24 Jan 2009 20:32:33 GMT
I can't say for sure, but I think that this will only work for relatively
small input files or small clusters.  If Pig parallelizes the loading, then
you are likely to have multiple lines with each count.  This happens because
you will have multiple VM's running, each with their own counter.

This is a good example of a program that will probably pass simple unit
tests but then fail in production because it implicitly depends on the way
that the computation is layed out in the cluster.

On Sat, Jan 24, 2009 at 3:48 AM, Baldo Faieta <baldofaieta@yahoo.com> wrote:

> What i did is create a Loader that loads with line numbers:
>
> So, I do the following call:
>
> H = LOAD '/tmp/users' USING com.adobe.okotto.utils.SeqLoader('\t',
> '2000000');
>
> and the result is that the first column is a sequence of numbers that
> starts with 2000000
>
> the code is as follows.  It probably can be done better, but it does the
> job
>
> I hope it helps.
>
> Baldo
>
> --------
>
> package com.adobe.okotto.utils;
>
> import java.io.IOException;
> import java.nio.charset.Charset;
>
> import org.apache.pig.LoadFunc;
> import org.apache.pig.data.Tuple;
> import org.apache.pig.impl.io.BufferedPositionedInputStream;
>
>
> /**
>  * A load function that parses a line of input into fields using a
> delimiter to set the fields. The
>  * delimiter is given as a regular expression. See String.split(delimiter)
> and
>  * http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html
> for more information.
>  */
> public class SeqLoader implements LoadFunc
> {
>    protected BufferedPositionedInputStream in = null;
>    long                end            = Long.MAX_VALUE;
>    private byte recordDel = (byte)'\n';
>    private String fieldDel = "\t";
>    private int seqStart = 0;
>
>    final private static Charset utf8 = Charset.forName("UTF8");
>
>    public SeqLoader() {
>    }
>
>    /**
>     * Constructs a Pig loader that uses specified regex as a field
> delimiter.
>     *
>     * @param delimiter
>     *            the regular expression that is used to separate fields.
> ("\t" is the default.) See
>     *
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html for
> complete
>     *            explanation.
>     */
>    public SeqLoader(String delimiter) {
>        this.fieldDel = delimiter;
>    }
>
>    /**
>     * Constructs a Pig loader that uses specified regex as a field
> delimiter and where the
>     * first column is a sequence number
>     *
>     * @param delimiter
>     *            the regular expression that is used to separate fields.
> ("\t" is the default.) See
>     *
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html for
> complete
>     *            explanation.
>     * @param seqStart
>     *                    The number to start the sequence with
>     */
>    public SeqLoader(String delimiter, String seq)
>    {
>        this.fieldDel = delimiter;
>        this.seqStart = Integer.parseInt(seq);
>    }
>
>
>    public Tuple getNext() throws IOException
>    {
>        if (in == null || in.getPosition() > end) {
>            return null;
>        }
>        String line;
>        if((line = in.readLine(utf8, recordDel)) != null) {
>            if (line.length()>0 && line.charAt(line.length()-1)=='\r') {
>                line = line.substring(0, line.length()-1);
>            }
>            String seqLine = seqStart + fieldDel + line;
>            seqStart++;
>            return new Tuple(seqLine, fieldDel);
>        }
>        return null;
>    }
>
>    public void bindTo(String fileName, BufferedPositionedInputStream in,
> long offset, long end) throws IOException
>    {
>        this.in = in;
>        this.end = end;
>
>        // Since we are not block aligned we throw away the first
>        // record and could on a different instance to read it
>        if (offset != 0) {
>            getNext();
>        }
>    }
>
>    public void finish() throws IOException {
>    }
>
>    public boolean equals(Object obj)
>    {
>        return equals((SeqLoader)obj);
>    }
>
>    public boolean equals(SeqLoader other)
>    {
>        return this.fieldDel.equals(other.fieldDel) && this.seqStart ==
> other.seqStart;
>    }
>
> }
>
>
>
> On 24 Jan 2009, at 01:04, Vadim Zaliva wrote:
>
>  I need to add a column, to a data file, with unique integer value for each
>> record.
>> In simplest case it could be a record number in a dataset. For example:
>>
>> (A)
>> (B)
>> (C)
>>
>> should become
>>
>> (1,A)
>> (2,B)
>> (3,C)
>>
>> Looks like there is no way to do it in a pure PIG. I am even unsure how to
>> do it via UDF.
>> Did anybody have been solving similar problem?
>>
>> Sincerely,
>> Vadim
>>
>> --
>> "Hated by fools, and fools to hate, be this my motto and my fate"
>> (Jonathan Swift)
>>
>>
>>
>>
>>
>>
>> --
>> "La perfection est atteinte non quand il ne reste rien a ajouter, mais
>> quand il ne reste rien a enlever."  (Antoine de Saint-Exupery)
>>
>>
>>
>>
>


-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message