arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wes McKinney (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
Date Tue, 02 Oct 2018 17:49:00 GMT
Wes McKinney created ARROW-3408:
-----------------------------------

             Summary: [C++] Add option to CSV reader to dictionary encode individual columns
or all string / binary columns
                 Key: ARROW-3408
                 URL: https://issues.apache.org/jira/browse/ARROW-3408
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++
            Reporter: Wes McKinney
             Fix For: 0.12.0


For many datasets, dictionary encoding everything can result in drastically lower memory usage
and subsequently better performance in doing analytics

One difficulty of dictionary encoding in multithreaded conversions is that ideally you end
up with one dictionary at the end. So you have two options:

* Implement a concurrent hashing scheme -- for low cardinality dictionaries, the overhead
associated with mutex contention will not be meaningful, for high cardinality it can be more
of a problem

* Hash each chunk separately, then normalize at the end

My guess is that a crude concurrent hash table with a mutex to protect mutations and resizes
is going to outperform the latter



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message