hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Stewart <robstewar...@googlemail.com>
Subject Join Hadoop Example problem
Date Tue, 26 Jan 2010 01:43:45 GMT
Hi there, I'm using Hadoop 0.20.1 and I'm trying to use the Join application
within the hadoop-*examples.jar . I can't seem to figure it out, where am I
going wrong? It isn't grouping the keys together, as I would expect....
------------------------
> bin/hadoop dfs -cat join/a.txt
AAAAAAAA,a0
BBBBBBBB,a1
CCCCCCCC,a2
CCCCCCCC,a3

> bin/hadoop dfs -cat join/b.txt
AAAAAAAA,b0
BBBBBBBB,b1
BBBBBBBB,b2
BBBBBBBB,b3

> bin/hadoop dfs -cat join/c.txt
AAAAAAAA,c0
BBBBBBBB,c1
DDDDDDDD,c2
DDDDDDDD,c3

>

-----*RESULT*-----
>bin/hadoop dfs -text theOutputs/part-00000
AAAAAAAA        [a0]
AAAAAAAA        [b0]
AAAAAAAA        [c0]
BBBBBBBB        [c1]
BBBBBBBB        [a1]
BBBBBBBB        [b1]
BBBBBBBB        [b2]
BBBBBBBB        [b3]
CCCCCCCC        [a2]
CCCCCCCC        [a3]
DDDDDDDD        [c2]
DDDDDDDD        [c3]
-----------------------


So, why has it not grouped all the AAAAAAAA's etc so that it, instead looks
like this:

AAAAAAAA        [a0,b0,c0]
BBBBBBBB        [a1,b1,c1]
BBBBBBBB        [a1,b2,c1]
BBBBBBBB        [a1,b3,c1]
CCCCCCCC        [a2,,]
CCCCCCCC        [a3,,]
DDDDDDDD        [,,c2]
DDDDDDDD        [,,c3]

?

---------------------

I have another question. Instead of these Key/Value pairs, what if I
have two input files list1.txt and list2.txt, both containing a list
of names, one line per name. I want to JOIN these input files BY the
names in each list. i.e. I want to create an output file containing a
list of the names that appear in both the input lists. Is it possible
to adapt the Join example packaged with Hadoop to implement this?


Many thanks,

Rob Stewart

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message