hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taeyun Kim <taeyun....@innowireless.com>
Subject RE: Get addColumn + ColumnRangeFilter
Date Fri, 16 Jan 2015 07:47:28 GMT
Some more.

The files cannot be physically merged (that is, each file must retain its identity) since
there is a requirement that the individual file group must be able to be deleted.
And since the files are individually postprocessed, there is no need to scan through all the
file groups, so HBase' 'slow' scan speed relative to the HDFS sequential read is not a concern.

-----Original Message-----
From: Taeyun Kim [mailto:taeyun.kim@innowireless.com] 
Sent: Friday, January 16, 2015 4:36 PM
To: 'user@hbase.apache.org'
Subject: RE: Get addColumn + ColumnRangeFilter

It's a somewhat long story.
Maybe I use HBase some weird way.

My use case is as follows:

I didn't want to put many small file into HDFS. (Since it is bad for HDFS, both for scalability
and performance)

The small files are grouped by a test log, since the files are many facets of the result of
the analysis of one test log. So, they could be the members of one SequentialFile.
But I felt SequentialFile (or other similar ones) not attractive, since anyway I would get
many not-so-big(about ~20MB, except for rare cases) Sequential files since the analysis result
files are not so big and the test log files are continually generated.
So some manual file management and merge could be a must.

So, I decided to use a HBase record as a kind of 'directory' to avoid the manual file management.
(directory = file group) By this, the 'files' are automatically 'merged' into appropriately
sized HFiles, and as a bonus that 'files' can be automatically deleted when it's lifetime
is done.

The 'directory' has the following files.

- 'm': meta file. (to check the version of the 'directory' format)
- 'Result.csv.0'
- 'Result.csv.1'
- ...
- 'Result.csv.p': parts file. (has the split count and each size. 'p' is for 'parts')
- 'AnotherResultA.csv.0'
- 'AnotherResultA.csv.1'
- ...
- 'AnotherResultA.csv.p'
- 'TestEnvironment.txt'

Each 'file' is saved as a column.

Result files are split for the following reasons:
- To handle extreme case the file is too big to be processed by one task.
- To save the task process memory: the split size is actually smaller than 64MB(size for one
task) and individually compressed. By this, a task process can have at most one column uncompressed.
A task is assigned multiple 'splits'.

For this, I've written an InputFormat class.

Now, the InputFormat class can first Get both 'm' and a parts file to get the inputSplit information.
This is not a problem. Single Get with 2 addColumn() is sufficient.
But when the whole content of a file must be read(like Files.readAllBytes()), must Get 'm'
and unknown number of splits that has a name range(Result.csv.0 ~ Result.csv.7) to Get the
whole content by single Get. (addColumn() + ColumnRangeFilter) But for the current HBase status,
it seems that I have to invoke 2 Gets, or disable the version check. (Maybe not a big deal?)

That's all.

If you think that this Record is not efficient, or there is better solution, please let me
know.
 
BTW, for the current status, when both addColumn() and ColumnRangeFilter are applied, they
are practically combined by 'AND' operator. Right?

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com]
Sent: Friday, January 16, 2015 3:39 PM
To: user@hbase.apache.org
Subject: Re: Get addColumn + ColumnRangeFilter

I reproduced the failed test (testAddColumnWithColumnRangeFilter) after modifying your test
case to fit master branch.

The reason for one Cell being returned is that ExplicitColumnTracker is used by ScanQueryMatcher
to first check if the column is part of the requested columns (f:fc in your case). The other
columns don't pass this check, hence they're not included in the result.

Before this part of code is changed, can I ask why you need to call
g.addColumn() when g has ColumnRangeFilter associated with it.

Cheers

On Thu, Jan 15, 2015 at 6:22 PM, Taeyun Kim <taeyun.kim@innowireless.com>
wrote:

> (Sorry if this mail is a duplicate)
>
> Hi Ted,
>
> I've attached 2 unit test classes.
>
> Both have one failed test.
>
> -
> HBaseAddColumnWithColumnRangeFilterTest1.testAddColumnWithColumnRangeFilter():
> Expected: 10, Actual 1
> -
> HBaseAddColumnWithColumnRangeFilterTest2.testAddColumnWithColumnRangeFilter():
> Result is empty
>
> If the tests have problems, please let me know.
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Thursday, January 15, 2015 6:59 PM
> To: user@hbase.apache.org
> Subject: Re: Get addColumn + ColumnRangeFilter
>
> Can you write a unit test which shows this behavior?
>
> Thanks
>
>
>
> > On Jan 14, 2015, at 9:09 PM, Taeyun Kim <
> taeyun.kim.innowireless@gmail.com> wrote:
> >
> > Hi,
> >
> >
> >
> > I have a situation that both Get.addColumn() and Get.setFilter(new
> > ColumnRangeFilter(…)) needed to Get.
> >
> > The source code snippet is as follows:
> >
> >
> >
> >        Get g = new Get(getRowKey(lfileId));
> >
> >        g.addColumn(Schema.ColumnFamilyNameBytes,
> > MetaColumnNameBytes);
> >
> >        g.setFilter(new ColumnRangeFilter(Bytes.toBytes(name), false,
> >
> >            Bytes.toBytes(name + "~"), false));
> >
> >        Result r = table.get(g);
> >
> >
> >
> >        if (r.isEmpty())
> >
> >            throw new FileNotFoundException(
> >
> >                String.format("%d:%d:%s", projectId, lfileId, name));
> >
> >
> >
> > When g.addColumn() is commented out, the Result is not empty, while 
> > with g.addColumn the Result is empty(FileNotFoundException is thrown).
> >
> > Is it illegal to use both methods?
> >
> >
> >
> > BTW, ther version of HBase used is 0.98. (Hortonworks HDP 2.1)
> >
> >
> >
> > Thanks.
>


Mime
View raw message