Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jira@kafka.apache.org
Date: Wed, 5 Jul 2017 11:04:00 +0000 (UTC)
From: "Neil Avery (JIRA)" <jira@apache.org>
To: jira@kafka.apache.org
Message-ID: <JIRA.13082492.1498480789000.172400.1499252640067@Atlassian.JIRA>
In-Reply-To: <JIRA.13082492.1498480789000@Atlassian.JIRA>
References: <JIRA.13082492.1498480789000@Atlassian.JIRA> <JIRA.13082492.1498480789186@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (KAFKA-5515) Consider removing date
 formatting from Segments class
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 05 Jul 2017 11:04:05 -0000


    [ https://issues.apache.org/jira/browse/KAFKA-5515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068546#comment-16068546 ] 

Neil Avery edited comment on KAFKA-5515 at 7/5/17 11:03 AM:
------------------------------------------------------------

*Investigation:*
Taking a look at the use shows SimpleDateFormat (SFD) is used for *parsing* Segment file names during initialisation, and *format*ting during runtime. I presume the suggested problem lies in the formatting.

*Micro benchmark SDF*
Formatting 1,000,000 items takes 250ms once hotspotting has kicked in.  Per/M items (ms): [707, 572, 543, 591, 546.0, 545.0, 363.0, 250 etc]
Parsing is slow - 2500ms per 1,000,000 items

Commons-lang3-FastDateFormat is available in the project but not as a dependency on this particular module. FDF micro-bench starts at 400ms/million then gets down to 350ms (not very convincing). 

Calendar usage sucks performance and there is a degree of caching inside both of the impls. 

Looking at this in a different way "Segments" is a time-series slice/bucketing function to group/allocate/lookup segments etc. 

I've knocked together a simple math alternative that breaks into time-slice where all months/years are equals size i.e. not using a calendar - you get an approximate idea of performance: 150-200ms without hotspotting. The problem is that a real-calendar is still used upon initialisation extract segment-ids - there will be inconsistencies and likely breakage.

*Best performance*
The best alternative would be to ditch calendars for parsing and formatting and to trunc/floor unix time to minutes/hours etc (at the cost a segment-filename readability). I'm not sure if there will be operational upgrade paths etc in order to make the change seamless. 


was (Author: neil.avery):
I've taken a look at dropping SimpleDateFormat and replacing it with commons-lang3-FastDateFormat (available in project but not a dependency on this module). 

Microbenchmarking diffs show SDF starts at 800ms/million then hotspots down to 250ms. Interestingly FDF starts at 400ms/million then gets down to 350ms (not very convincing). Calendar usage sucks performance and there is a degree of caching inside both of the impls. Looking at this in a different way "Segments" is a time-series slice/bucketing function to group/allocate/lookup segments etc. 

Does a real world calendar matter? - I've knocked together a simple math alternative that break into time-slice where all months/years are equals size. The time formatting is identical but day/month will be incorrect as a result of no calendar. This gets down to 150ms pretty much straight away. (still using SDF is still used for parsing).

All tests pass, system runs fine etc - but I'm not sure of the gravity of this as a possible change - will it break things - any advice or feedback?

> Consider removing date formatting from Segments class
> -----------------------------------------------------
>
>                 Key: KAFKA-5515
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5515
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Bill Bejeck
>            Assignee: Neil Avery
>              Labels: performance
>
> Currently the {{Segments}} class uses a date when calculating the segment id and uses {{SimpleDateFormat}} for formatting the segment id.  However this is a high volume code path and creating a new {{SimpleDateFormat}} and formatting each segment id is expensive.  We should look into removing the date from the segment id or at a minimum use a faster alternative to {{SimpleDateFormat}}.  We should also consider keeping a lookup of existing segments to avoid as many string operations as possible.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)