logging-log4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Remko Popma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LOG4J2-1305) Binary Layout
Date Thu, 03 Mar 2016 06:57:18 GMT

     [ https://issues.apache.org/jira/browse/LOG4J2-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Remko Popma updated LOG4J2-1305:
--------------------------------
         Labels: binary  (was: )
    Description: 
Logging in a binary format instead of in text can give large performance improvements. 

Logging text means going from a LogEvent object to formatted text, and then converting this
text to bytes. Performance investigations with text-based logging formats like PatternLayout
(see LOG4J2-930), and encoding Strings to bytes (LOG4J2-935, LOG4J2-1151) suggest that formatting
and encoding text is expensive and imposes limits on the performance that can be achieved.


A different approach would be to convert the LogEvent to a binary representation directly
without creating a text representation first. This would result in extremely compact log files
that are fast to write. The trade-off is that a binary log cannot easily be read in a general-purpose
editor like VI or Notepad. A specialized tool would be necessary to either display or convert
to human-readable form. 

Note: custom Messages that implement the {{Encoder}} interface (introduced with LOG4J2-1274)
can be written in binary form directly without first being converted to text (LOG4J2-506).
Any specialized tool for reading binary log files should handle messages of type "text" out
of the box, but could have some plugin mechanism for decoding custom messages.

This ticket proposes a simple BinaryLayout, where each LogEvent is logged in a binary format.

*Example BinaryLayout format*
||Offset||Type||Description||
|0|long|TimeMillis|
|8|long|NanoTime|
|16|int|Level|
|20|int|Logger name index - string value in separate file|
|24|int|Thread name index - string value in separate file|
|28|long|Thread ID|
|36|int|Marker index - value & hierarchy in separate file|
|40|int|Message length|
|44|int|Message type|
|48|byte[]|Message data - below offset assumes 12 bytes of message data|
|60|int| Throwable data length|
|64|byte[]|Throwable data - below offset assumes 16 bytes of Throwable data|
|80|int|ThreadContext key/value pair count|
|84|int|ThreadContext key index - string value in separate file|
|88|int|ThreadContext value index - string value in separate file|

*Versioning*
The binary file must start with a header, indicating version information and perhaps schema
information providing meta data on the log record. Schema information may make it possible
to include/exclude fields. For version 1.0, the schema can either be fixed like the above
example, or it could be a simple bitmask for the fields mentioned above.

*Byte Order*
TBD: Are multi-byte values like ints and longs written in big Endian or little Endian? This
could be specified in the header, or we could fix it to either one. Exchange protocols like
ITCH tend to select a fixed byte order (ITCH uses big Endian - network byte order). I like
the simplicity of this approach.

*Multiple Files*
Repeating String data like thread names, logger names, marker names and ThreadContextMap keys
and values are saved to a separate string-data file. The main log file contains an index (the
line number, zero-based) into the string-data file instead of the full string. The format
of this file can simply be: each unique string on a separate line (separated by '\n' (0x0A)
character). Any '\n' characters embedded in the string value are Unicode escaped and writen
as "\u000A".

TBD, as Matt points out in the comment, Markers are special since they are hierarchic. One
way to deal with this is to manage a separate file to save the Marker hierarchy. Another way
is to do something similar to PatternLayout: treat it as a String value, where the string
includes hierarchy information. I like the simplicity of the latter approach.

  was:
Logging in a binary format instead of in text can give large performance improvements. Text-based
logging formats are supported by layouts like PatternLayout, and performance investigations
like done in LOG4J2-930 suggest that it may be difficult to achieve good performance when
logging text. 

Logging text means going from a LogEvent object to formatted text, and then converting this
text to bytes. A different approach would be to convert the LogEvent to a binary representation
directly without creating a text representation first. This may make fast synchronous logging
possible, although that would additionally require a lock-free appender (LOG4J2-928).

This proposes a simple BinaryLayout, where each LogEvent is logged in a binary record like
this:
||Offset||Type||Description||
|0|long|TimeMillis|
|8|long|NanoTime|
|16|int|Level|
|20|int|Logger name index - string value in separate file|
|24|int|Thread name index - string value in separate file|
|28|long|Thread ID|
|36|short|Marker count|
|38|int|marker name index - string value in separate file ... (below offset assumes only one
marker)|
|42|int|Message length|
|46|int|Message type|
|50|byte[]|Message data - below offset assumes 10 bytes of message data|
|60|int| Throwable data length|
|64|byte[]|Throwable data - below offset assumes 10 bytes of Throwable data|
|74|int|ThreadContext key/value pair count|
|78|int|ThreadContext key index - string value in separate file|
|82|int|ThreadContext value index - string value in separate file|





> Binary Layout
> -------------
>
>                 Key: LOG4J2-1305
>                 URL: https://issues.apache.org/jira/browse/LOG4J2-1305
>             Project: Log4j 2
>          Issue Type: New Feature
>          Components: Layouts
>            Reporter: Remko Popma
>              Labels: binary
>
> Logging in a binary format instead of in text can give large performance improvements.

> Logging text means going from a LogEvent object to formatted text, and then converting
this text to bytes. Performance investigations with text-based logging formats like PatternLayout
(see LOG4J2-930), and encoding Strings to bytes (LOG4J2-935, LOG4J2-1151) suggest that formatting
and encoding text is expensive and imposes limits on the performance that can be achieved.

> A different approach would be to convert the LogEvent to a binary representation directly
without creating a text representation first. This would result in extremely compact log files
that are fast to write. The trade-off is that a binary log cannot easily be read in a general-purpose
editor like VI or Notepad. A specialized tool would be necessary to either display or convert
to human-readable form. 
> Note: custom Messages that implement the {{Encoder}} interface (introduced with LOG4J2-1274)
can be written in binary form directly without first being converted to text (LOG4J2-506).
Any specialized tool for reading binary log files should handle messages of type "text" out
of the box, but could have some plugin mechanism for decoding custom messages.
> This ticket proposes a simple BinaryLayout, where each LogEvent is logged in a binary
format.
> *Example BinaryLayout format*
> ||Offset||Type||Description||
> |0|long|TimeMillis|
> |8|long|NanoTime|
> |16|int|Level|
> |20|int|Logger name index - string value in separate file|
> |24|int|Thread name index - string value in separate file|
> |28|long|Thread ID|
> |36|int|Marker index - value & hierarchy in separate file|
> |40|int|Message length|
> |44|int|Message type|
> |48|byte[]|Message data - below offset assumes 12 bytes of message data|
> |60|int| Throwable data length|
> |64|byte[]|Throwable data - below offset assumes 16 bytes of Throwable data|
> |80|int|ThreadContext key/value pair count|
> |84|int|ThreadContext key index - string value in separate file|
> |88|int|ThreadContext value index - string value in separate file|
> *Versioning*
> The binary file must start with a header, indicating version information and perhaps
schema information providing meta data on the log record. Schema information may make it possible
to include/exclude fields. For version 1.0, the schema can either be fixed like the above
example, or it could be a simple bitmask for the fields mentioned above.
> *Byte Order*
> TBD: Are multi-byte values like ints and longs written in big Endian or little Endian?
This could be specified in the header, or we could fix it to either one. Exchange protocols
like ITCH tend to select a fixed byte order (ITCH uses big Endian - network byte order). I
like the simplicity of this approach.
> *Multiple Files*
> Repeating String data like thread names, logger names, marker names and ThreadContextMap
keys and values are saved to a separate string-data file. The main log file contains an index
(the line number, zero-based) into the string-data file instead of the full string. The format
of this file can simply be: each unique string on a separate line (separated by '\n' (0x0A)
character). Any '\n' characters embedded in the string value are Unicode escaped and writen
as "\u000A".
> TBD, as Matt points out in the comment, Markers are special since they are hierarchic.
One way to deal with this is to manage a separate file to save the Marker hierarchy. Another
way is to do something similar to PatternLayout: treat it as a String value, where the string
includes hierarchy information. I like the simplicity of the latter approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: log4j-dev-unsubscribe@logging.apache.org
For additional commands, e-mail: log4j-dev-help@logging.apache.org


Mime
View raw message