hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sushanth Sowmyan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5020) HCat reading null-key map entries causes NPE
Date Tue, 01 Jul 2014 19:47:26 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049238#comment-14049238
] 

Sushanth Sowmyan commented on HIVE-5020:
----------------------------------------

Sorry for the late response to this jira, and thanks for the input, all. I'd initially wanted
to give it time for more people to respond, and then this fell by the wayside.

Thrift structures do not support map null keys. I agree that sortedness is not important for
maps, and in fact, we should not guarantee it for something that's just called a map.

And while I'd like to see a usecase for nulls in keys supported, it looks like the conventional
hive semantics for maps ignores null keys, and changing rcfile users so that they suddenly
start getting null keys is a recipe for trouble for a lot of users. So having orc map to rc
behaviour, and make that the standard "hive" behaviour might make more sense. [~owen.omalley]/[~prasanth_j],
could you comment on what you think the impact of changing orc behaviour that way might be?

HCat should adopt whatever behaviour we standardize on for hive, and can follow after that.

> HCat reading null-key map entries causes NPE
> --------------------------------------------
>
>                 Key: HIVE-5020
>                 URL: https://issues.apache.org/jira/browse/HIVE-5020
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>
> Currently, if someone has a null key in a map, HCatInputFormat will terminate with an
NPE while trying to read it.
> {noformat}
> java.lang.NullPointerException
> at java.lang.String.compareTo(String.java:1167)
> at java.lang.String.compareTo(String.java:92)
> at java.util.TreeMap.put(TreeMap.java:545)
> at org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
> at org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
> at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
> at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
> at org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
> {noformat}
> This is because we use a TreeMap to preserve order of elements in the map when reading
from the underlying storage/serde.
> This problem is easily fixed in a number of ways:
> a) Switch to HashMap, which allows null keys. That does not preserve order of keys, which
should not be important for map fields, but if we desire that, we have a solution for that
too - LinkedHashMap, which would both retain order and allow us to insert null keys into the
map.
> b) Ignore null keyed entries - check if the field we read is null, and if it is, then
ignore that item in the record altogether. This way, HCat is robust in what it does - it does
not terminate with an NPE, and it does not allow null keys in maps that might be problematic
to layers above us that are not used to seeing nulls as keys in maps.
> Why do I bring up the second fix? First, I bring it up because of the way we discovered
this bug. When reading from an RCFile, we do not notice this bug. If the same query that produced
the RCFile instead produces an Orcfile, and we try reading from it, we see this problem.
> RCFile seems to be quietly stripping any null key entries, whereas Orc retains them.
This is why we didn't notice this problem for a long while, and suddenly, now, we are. Now,
if we fix our code to allow nulls in map keys through to layers above, we expose layers above
to this change, which may then cause them to break. (Technically, this is stretching the case
because we already break now if they care) More importantly, though, we have a case now, where
the same data will be exposed differently if it were stored as orc or if it were stored as
rcfile. And as a layer that is supposed to make storage invisible to the end user, HCat should
attempt to provide some consistency in how data behaves to the end user.
> Secondly, whether or not nulls should be supported as keys in Maps seems to be almost
a religious view. Some people see it from a perspective of a "mapping", which lends itself
to a "Sure, if we encounter a null, we map to this other value" kind of a view, whereas other
people view it from a "lookup index" kind of view, which lends itself to a "null as a key
makes no sense - What kind of lookup do you expect to perform?" kind of view. Both views have
their points, and it makes sense to see if we need to support it.
> That said...
> There is another important concern at hand here: nulls in map keys might be due to bad
data(corruption or loading error), and by stripping them, we might be silently hiding that
from the user. So "silent stripping" is bad. This is an important point that does steer me
towards the former approach, of passing it on to layers above, and standardize on an understanding
that null keys in maps are acceptable data that layers above us have to handle. After that,
it could be taken on as a further consistency fix, to fix RCFile so that it allows nulls in
map keys.
> Having gone through this discussion of standardization, another important question is
whether or not there is actually a use-case for null keys in maps in data. If there isn't,
maybe we shouldn't allow writing that in the first place, and both orc and rcfile must simply
error out to the end user if they try to write a  null map key? Well, it is true that it is
possible that data errors lead to null keys, but it's also possible that the user wants to
store a mapping for value transformations, and they might have a transformation for null as
well. In the case I encountered it, they were writing out an intermediate table after having
read from a sparse table using a custom input format that generated an arbitrary number of
columns, and were using the map to store column name mappings that would eventually be written
out to another table. That seems a valid use, and we shouldn't prevent users from this sort
of usage.
> Another reason for not allowing null keys from a java perspective is locking and concurrency
concerns, where locking on a null is a pain, per disagreements between Joshua Bloch and Doug
Lea in the design of HashMap and ConcurrentHashMap. However, given that HCatalog reads are
happening in a thread on a drone where there should be no parallel access of that record,
and more importantly, this should strictly be used in a read-only kind of usage, we should
not have to worry about that.
> Increasingly, my preference is to change to LinkedHashMaps to allow null keys, and for
consistency's sake, after this is tackled, to see if we should be fixing RCFile to allow null
keys(this might be trickier since RCFile has a lot of other users that are probably currently
working.)
> Another option is to change to LinkedHashMap, but also add a conf key to hcat to allow
the user to specify whether or not we want to strip nulls. That way, a user can specify what
behaviour they like. That's more cruft though, and I don't want to go down that path unless
there is a user that explicitly wants/needs that.
> Anyone have any other thoughts on the matter?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message