thrift-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jens Geyer (Jira)" <>
Subject [jira] [Commented] (THRIFT-5231) Improve Haskell parsing performance
Date Wed, 10 Jun 2020 20:37:00 GMT


Jens Geyer commented on THRIFT-5231:

The c++ headers seem to need some cleanup. I could track it back to commit d42a2c2bf9630cfb4d9d49cbee1fc812e5e5777d
when the  various string type constants had bveen added.

These have never neen used AFAIK:

 * T_UTF8       = 16,
 * T_UTF16      = 17

This is the right type for strings:

 * T_STRING     = 11,   ...

And this seems plain wrong. As per Whitepaper, all strings in Thrift are transmitted as UTF-8
across the wire, not UTF-7.

 * T_UTF7       = 11,

> Improve Haskell parsing performance
> -----------------------------------
>                 Key: THRIFT-5231
>                 URL:
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Haskell - Library
>    Affects Versions: 0.13.0
>            Reporter: Philipp Hausmann
>            Priority: Major
>         Attachments: Main.hs, parse_benchmark.html
> We are using Thrift for (de-)serializing some Kafka messages and noticed that already
at low throughput (1000 messages / second) a lot of CPU is used.
> I did a small benchmark just parsing a single T_BINARY value and if I use `readVal` for
that it takes ~3ms per iteration. If instead I directly run the attoparsec parser, it only
takes ~ 300ns. This is a difference by 4 orders of magnitude! Some difference is reasonable
as when using `readVal` some IO and shuffling around bytestrings is involved, but the difference
looks huge.
> I strongly suspect the implementation of `runParser` is not optimal. Basically it runs
the parser with 1 Byte, and until it succeeds it appends 1 byte and retries. This means that
for a value of size 1024 bytes, we e.g. try to parse it 1023 times. This seems rather inefficient.
> I am not really sure how to best fix this. In principle, it makes sense to feed bigger
chunks to attoparsec and store the left-overs somewhere for the next parse. However, if we
store it in the transport or protocol we have to implement it for each transport/protocol.
Maybe an API change is necessary?

This message was sent by Atlassian Jira

View raw message