hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mapred Learn <mapred.le...@gmail.com>
Subject Re: Delimiter selection for Sequence Files
Date Wed, 15 Jun 2011 21:25:14 GMT
Hi Harsh,
I am also trying to something like:
 hadoop fs -text /user/cloudera/staging/test_file_0.seq|cut -f 3 -d '\x01'

And file contents are as:
0       340\x01234\x010067\x010.00\x01
1       230\x01454\x010045\x010.00\x01

But I get :
cut: the delimiter must be a single character
Try `cut --help' for more information.




On Wed, Jun 15, 2011 at 11:16 AM, Harsh J <harsh@cloudera.com> wrote:

> For Hive, its best if the delimiter character's value is also < 127 in
> decimal. I think Hive uses a signed byte to represent the delimiter
> and that may lead to issues if greater is chosen.
>
> I've seen Hive take ascii and octal representations in its statements
> for delimiters. You can use a hex value in your shell simply by
> passing it as a literal.
>
> For ex., on Bash/ZSH I do:
> $ echo $'\x1B' # For the 'escape' character.
>
> On Wed, Jun 15, 2011 at 11:37 PM, Mapred Learn <mapred.learn@gmail.com>
> wrote:
> > If I use hex value of a delimiter as delimiter for eg. \x01 for ctrl A.
> Can
> > I use it as a delimiter in hive/unix cut commands ?
> >
> >
> > On Tue, Jun 14, 2011 at 7:10 AM, Mapred Learn <mapred.learn@gmail.com>
> > wrote:
> >>
> >> Thanks Joe fit the reply !
> >> "@@##@@" looks like a big value for a delimiter.
> >> I will also choose something like a hex number so that it does not
> appear
> >> in the data.
> >> Sent from my iPhone
> >> On Jun 13, 2011, at 5:33 PM, Joe Stein <joe.stein@medialets.com> wrote:
> >>
> >> I have had quite a few data sets that I have had no idea if my delimiter
> >> was in there so what I did was replaced my delimiter with a string I
> knew
> >> would not be in there during map and then in the reducer replaced it
> back
> >> again.
> >>
> >> e.g.
> >>
> >> replace("^","@@##@@") for each line
> >>
> >> then use ^ as your delimiter
> >>
> >> and in the reducer replace("@@##@@","^") for each line
> >>
> >> and in your reducer output qualify things appropriately for how you
> >> want/need to deal with the output
> >>
> >> now if your problem is splitting each line during your map and not
> knowing
> >> what to split on... well that is very related to your context
> >>
> >> you could JOIN map side a list of all possible characters with your data
> >> set and then reduce output only characters not found and use that as
> your
> >> delimiter.... who knows maybe you will find out that ~ is not in your
> >> data...
> >>
> >> On Mon, Jun 13, 2011 at 8:18 PM, Mapred Learn <mapred.learn@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>> I was thinking of using CTRL A as delimiter but data that I am loading
> to
> >>> Hadoop already has CTRL A in it. What are other good choices of
> delimiters
> >>> that anybody might have used in this kind of scenario, considering
> that I
> >>> also want to query this data using Hive.
> >>>
> >>> Thanks in advance
> >>> -JJ
> >>
> >>
> >> --
> >> /*
> >> Joe Stein, 973-944-0094
> >> http://www.medialets.com
> >> Twitter: @allthingshadoop
> >> */
> >
> >
>
>
>
> --
> Harsh J
>

Mime
View raw message