incubator-kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neha Narkhede (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-521) Refactor Log subsystem
Date Wed, 28 Nov 2012 02:46:59 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505183#comment-13505183
] 

Neha Narkhede commented on KAFKA-521:
-------------------------------------

Pretty good cleanup and great usage of Java data structures for the segment list stuff

1. Some new unit tests fail -
[error] Test Failed: testRecoveryFixesCorruptIndex(kafka.log.LogSegmentTest)
java.lang.NullPointerException
        at kafka.log.LogSegmentTest$$anonfun$testRecoveryFixesCorruptIndex$2.apply$mcVI$sp(LogSegmentTest.scala:175)
        at scala.collection.immutable.Range$ByOne$class.foreach$mVc$sp(Range.scala:282)
        at scala.collection.immutable.Range$$anon$2.foreach$mVc$sp(Range.scala:265)
        at kafka.log.LogSegmentTest.testRecoveryFixesCorruptIndex(LogSegmentTest.scala:174)

2. FileMessageSet
The param end says it is the absolute position in the file at which the message set ends.
However, when isSlice is false, it is passed in as Int.MaxValue. Can we change the comment
to reflect that ? Also, the API docs should also explain isSlice

3. KafkaApis
3.1 This is not introduced by your code, but will be good to optimize -

            if (allOffsets.exists(_ > hw))
              hw +: allOffsets.dropWhile(_ > hw)
            else
              allOffsets
This code does 2 O(n) operation, when it suffices to do just one. Just dropWhile should suffice
here, it will return all offsets when none is greater than hw and will filter out the right
ones that are greater than hw.
3.2 Both LogManager and ReplicaManager take in the Time object that KafkaServer was created
with. This helps unit testing since we can pass in MockTime. Since we moved the getOffsetsBefore
to KafkaApis, does it make sense to have it use the same time object as well ?

4. Log
4.1 To be consistent, probably makes sense to change the comments on the public APIs to use
java docs conventions instead. Right now, it documents each param but doesn't add the @param
tag.
4.2 Typo above logSegments() - oldes
4.3 In deleteOldSegments, it is probably better to acquire the lock before iterating over
segments. If not, the list of segments that should be deleted can be stale and can lead to
us inadvertently deleting segments that we newly rolled over.
4.4. We talked about this offline, but the actual delete being outside the synchronized block
is another bug.
4.5 Probably a good idea to defensively check for empty segments in truncateTo since segments.firstEntry
will throw a NPE
4.6 Maybe activeSegment should protect against NPE as well ? Al though not entirely sure when
this can happen


5. LogSegment
5.1 The lastUpdateTime has changed to create time. So we use a create time vs the first append
time to decide if a new segment should be rolled. So if no new data has been added to a segment,
we still roll it ?
5.2 touch() API is unused.

6. ProducerConfig
Was the change to drop the number of retries intended ?

                
> Refactor Log subsystem
> ----------------------
>
>                 Key: KAFKA-521
>                 URL: https://issues.apache.org/jira/browse/KAFKA-521
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jay Kreps
>         Attachments: KAFKA-521-v1.patch, KAFKA-521-v2.patch, KAFKA-521-v3.patch, KAFKA-521-v4.patch
>
>
> There are a number of items it would be nice to cleanup in the log subsystem:
> 1. Misc. funky apis in Log and LogManager
> 2. Much of the functionality in Log should move into LogSegment along with corresponding
tests
> 3. We should remove SegmentList and instead use a ConcurrentSkipListMap
> The general idea of the refactoring fall into two categories. First, improve and thoroughly
document the public APIs. Second, have a clear delineation of responsibility between the various
layers:
> 1. LogManager is responsible for the creation and deletion of logs as well as the retention
of data in log segments. LogManager is the only layer aware of partitions and topics. LogManager
consists of a bunch of individual Log instances and interacts with them only through their
public API (mostly true today).
> 2. Log represents a totally ordered log. Log is responsible for reading, appending, and
truncating the log. A log consists of a bunch of LogSegments. Currently much of the functionality
in Log should move into LogSegment with Log interacting only through the Log interface. Currently
we reach around this a lot to call into FileMessageSet and OffsetIndex.
> 3. A LogSegment consists of an OffsetIndex and a FileMessageSet. It supports largely
the same APIs as Log, but now localized to a single segment.
> This cleanup will simplify testing and debugging because it will make the responsibilities
and guarantees at each layer more clear.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message