hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-16425) [Operability] Autohandling 'bad data'
Date Tue, 16 Aug 2016 22:03:20 GMT
stack created HBASE-16425:
-----------------------------

             Summary: [Operability] Autohandling 'bad data'
                 Key: HBASE-16425
                 URL: https://issues.apache.org/jira/browse/HBASE-16425
             Project: HBase
          Issue Type: Brainstorming
          Components: Operability
            Reporter: stack


This is a brainstorming issue. It came up chatting w/ a couple of operators talking about
'bad data'; i.e. no matter how you control your clients, someone by mistake or under a misconception
will load an out-of-spec Cell or Row. In this particular case, two types of 'bad data' were
talked about:

(on) The Big Cell: An upload of a 'big cell' came in via bulkload but it so happened that
their frontend all arrived at the malignant Cell at the same time so hundreds of threads requesting
the big cell. The RS OOME'd. Then when the region opened on the new RS, it OOME'd, etc. Could
we switch to chunking when a Server sees that it has a large Cell on its hands? I suppose
bulk load could defeat any Put chunking we had in place but would be good to have this too.
Chatting w/ Matteo, we probably want to just move to the streaming Interface that we've talked
of in the past at various times; the Get would chunk out the big Cell for assembly on the
Client, or just give back the Cell in pieces -- an OutputStream for the Application to suck
on. New API and/or old API could use it when Cells are big.

(on) The user had a row with 29M Columns in it because the default entity had id=-1.... In
this case chunking the Scan (v1.1+) helps but the operator was having trouble finding the
problem row. How could we surface anomalies like this for operators? On flush, add even more
meta data to the HFile (Yahoo! Data Sketches as [~jleach] has been suggesting) and then an
offline tool to read metadata and run it through a few simple rules. Data Sketches are mergeable
so could build up a region-view or store-view....

This is sketchy and I'm pretty sure repeats stuff in old issues but parking this note here
while the encounter still fresh.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message