hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Stepachev <oct...@gmail.com>
Subject Re: Sorting columns
Date Mon, 21 Jun 2010 17:18:54 GMT
2010/6/21 Jonathan Gray <jgray@facebook.com>

> Yes, when using Scan, even on 0.20, everything will be sorted.
>

Good. And is this a case for infra row. (as I understand, sorting is
achieved by merge scan of stores).


>
> Re: OOM, you'll need more memory or you'll need to break stuff up across
> rows.  Not much else to be done about that :)
>

But with infrarow scan i can avoid OOM (and it works) :). But question was
in order of infra row scanning.


>
> > -----Original Message-----
> > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > Sent: Monday, June 21, 2010 6:40 AM
> > To: user@hbase.apache.org
> > Subject: Re: Sorting columns
> >
> > 2010/6/19 Jonathan Gray <jgray@facebook.com>
> >
> > > So there is no confusion, everything is sorted in HBase.  All columns
> > in
> > > each family are sorted, always.
> > >
> >
> > Thans a good news!. Thanks. I have no time (and enought knowlage of
> > hbase)
> > to check this myself. No it's clear (and I use scan always for now).
> >
> >
> > >
> > > There are optimizations for Get queries (in 0.20 but gone in trunk)
> > that
> > > make it so that what gets returned to the client is not completely
> > sorted
> > > though it would be mostly sorted.
> >
> > Is it true, that if i use Scan (even when scan is really get) in 0.20,
> > i'll
> > got all things sorted?
> >
> >
> > > Are you returning millions of columns at once?  Otherwise it
> > shouldn't be
> > > too expensive to do the sorted() call in the client.
> > >
> > I got a OOM when i try to build index (i have 1 index key which points
> > to
> > 5mil another keys, so I got OOM in server). With infrarow I can scan
> > this
> > columns (in mr job mostly) to doing some work.
> > After I got OOM, i change schema to use compound keys. It is a bit
> > complicated to make such keys (instead of simple LongWritable and
> > friends).
> > May be avro can help, but i don't try yet. With infra row I got
> > slightly
> > complicated Result scan (i need to detect real key change), but this
> > way is
> > less complicated, then compound keys.
> >
> >
> >
> > >
> > > > -----Original Message-----
> > > > From: Andrey Stepachev [mailto:octo47@gmail.com]
> > > > Sent: Saturday, June 19, 2010 5:45 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Sorting columns
> > > >
> > > > 2010/6/19 Stack <stack@duboce.net>
> > > >
> > > > > On Thu, Jun 17, 2010 at 12:18 PM, Andrey Stepachev
> > <octo47@gmail.com>
> > > > > wrote:
> > > > > > As i see in sources there no place, where kv sorted (except
> > client
> > > > > > Result.sorted() method). So we can get keyvalues from store
and
> > > > from
> > > > > > memstore (and in this case we can get 1 3 5 from stores and
4
> > from
> > > > > memstore)
> > > > > > in incorrect order.
> > > > > >
> > > > > > Or I miss something?
> > > > > >
> > > > >
> > > > > Data is sorted in hbase.  Scanning, we'll be running a scanner
> > > > against
> > > > > each data store element -- memstore and one for each store file -
> > -
> > > > and
> > > > > we'll pop off the elements in order.  Thats the general story.
> > There
> > > > > may once have been a legitimate reason for the client-side sort -
> > -
> > > > > perhaps when our Get and Scan code paths differed it was needed -
> > -
> > > > but
> > > > > as to whether it still required, I'm not sure.  I'd have to dig.
> > Any
> > > > > one else?
> > > > >
> > > >
> > > > It is very interesting to know, is hbase guarantee ordering in
> > columns.
> > > > Because if
> > > > someone will use very wide rows, in absence of sorting, it is not
> > very
> > > > useful (and of course
> > > > someone should know about partitioning problem for wide rows).
> > > > Suppose, that we want to work with time data, in that case we can
> > use
> > > > qualifiers as
> > > > date and expect data in sorted order and we can't order it
> > somewhere
> > > > else,
> > > > because
> > > > we will lost most of hbase advantage.
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >> > The rest of the data needs to be accessed occasionally.
We
> > want
> > > > to
> > > > > avoid
> > > > > >> > getting it shipped to the client as it makes our map
reduce
> > job
> > > > go out
> > > > > of
> > > > > >> > memory.
> > > > > >> >
> > > > > >>
> > > > > >> You are not using incremental get on a row?  You should
be
> > able to
> > > > get
> > > > > >> your big rows piecemeal.
> > > > > >>
> > > > > > This scanner api changes was not included in 0.20.4 :( (infra
> > row
> > > > > scanner).
> > > > > >
> > > > >
> > > > > Oh.
> > > > >
> > > > > Sorry about that Andrey.  Somehow we missed your backport of
> > > > > HBASE-1537.  I just applied it.  It'll appear in the 0.20.5RC4
> > I'm
> > > > > rolling now.  Please excuse our bungling.
> > > > >
> > > >
> > > > Not a problem. I'll wait 0.20.5. But I should warn, that with this
> > > > patch
> > > > 0.20.5 will be not wire compatible with 0.20.4 (because this patch
> > adds
> > > > additional
> > > > field in Scan, and this make Scan binary incompatible).
> > > >
> > > > I'm, personnaly, not using now infrarow scanner, because of unknown
> > > > ordering, i use
> > > > compound keys.
> > > > More over, infrarow scanning should use separate api (giving Result
> > the
> > > > ability
> > > > to fetch additional kvs for given row) to be mo usable and easy to
> > use.
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message