mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Subramanian, Anirudh" <ani...@amazon.com>
Subject Re: UTF-8 Support for TextParser
Date Tue, 27 Feb 2018 01:00:42 GMT
Hi Tianqi,

The UTF-8 support would enable other formats like CSV more usable. Otherwise, they have to
handle normalizing their data in some way before using mxnet. 
I understand that there is a tradeoff here because of the efficiency gains from the parser
but the expectation of having to normalize their UTF-8
files may turn users away.

Anirudh

On 2/26/18, 3:54 PM, "workcrow@gmail.com on behalf of Tianqi Chen" <workcrow@gmail.com
on behalf of tqchen@cs.washington.edu> wrote:

    Since LibSVM format is only going to involve numbers and possibly ascii
    characters, is there any reason adding UTF-8 support? Note that
    generalization always comes with cost of efficiency and there is some
    effort spent on making parser fast
    
    Tianqi
    
    On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <anirudh2290@gmail.com> wrote:
    
    > Hi all,
    >
    > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text parsers.
    > I am currently working on adding UTF-8 support for Text parsers. Since C++
    > doesn't have a great built-in support for UTF-8, I am looking at
    > third-party libraries which provide Unicode support. I am considering ICU
    > currently. Any comments, suggestions, past experience, gotchas about
    > unicode third party libraries or adding unicode support in general is
    > highly appreciated.
    >
    > I have created an issue about the same:
    > https://github.com/dmlc/dmlc-core/issues/372
    > Please feel free to reply to this email or comment on the github issue if
    > you have any inputs.
    >
    > Anirudh
    >
    

Mime
View raw message