pulsar-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [pulsar] Jennifer88huang commented on a change in pull request #3844: [Doc]Fix typo and language issues in the faq.md file
Date Thu, 28 Mar 2019 02:06:38 GMT
Jennifer88huang commented on a change in pull request #3844: [Doc]Fix typo and language issues
in the faq.md file
URL: https://github.com/apache/pulsar/pull/3844#discussion_r269831891
 
 

 ##########
 File path: faq.md
 ##########
 @@ -112,162 +111,136 @@ It’s a component that was introduced recently. Essentially it’s
a stateless
 Yes, you can split a given bundle manually.
 
 ### Is the producer kafka wrapper thread-safe?
-The producer wrapper should be thread-safe.
+The producer wrapper is thread-safe.
 
 ### Can I just remove a subscription?
-Yes, you can use the cli tool `bin/pulsar-admin persistent unsubscribe $TOPIC -s $SUBSCRIPTION`.
+Yes, you can remove a subscription by using the cli tool `bin/pulsar-admin persistent unsubscribe
$TOPIC -s $SUBSCRIPTION`.
 
-### How are subscription modes set? Can I create new subscriptions over the WebSocket API?
-Yes, you can set most of the producer/consumer configuration option in websocket, by passing
them as HTTP query parameters like:
+### How to set subscription modes? Can I create new subscriptions over the WebSocket API?
+Yes, you can set most of the producer/consumer configuration option in websocket, by passing
them as HTTP query parameters as follows:
 `ws://localhost:8080/ws/consumer/persistent/sample/standalone/ns1/my-topic/my-sub?subscriptionType=Shared`
 
-see [the doc](http://pulsar.apache.org/docs/latest/clients/WebSocket/#RunningtheWebSocketservice-1fhsvp).
+You can create new subscriptions over the WebSocket API. For details, see [Pulsar's WebSocket
API](http://pulsar.apache.org/docs/latest/clients/WebSocket/#RunningtheWebSocketservice-1fhsvp).
 
 ### Is there any sort of order of operations or best practices on the upgrade procedure for
a geo-replicated Pulsar cluster?
-In general, updating the Pulsar brokers is an easy operation, since the brokers don't have
local state. The typical rollout is a rolling upgrade, either doing 1 broker at a time or
some percentage of them in parallel.
+In general, it is easy to update the Pulsar brokers, since the brokers don't have local state.
The typical rollout is a rolling upgrade, either updating a broker at a time or some percentage
of brokers in parallel.
 
-There are not complicated requirements to upgrade geo-replicated clusters, since we take
particular care in ensuring backward and forward compatibility.
+There are no complicated requirements to upgrade geo-replicated clusters, we ensure backward
and forward compatibility.
 
-Both the client and the brokers are reporting their own protocol version and they're able
to disable newer features if the other side doesn't support them yet.
+Both clients and brokers report their own protocol versions, and they can disable newer features
if the other side does not support them yet.
 
-Additionally, when making metadata breaking format changes (if the need arises), we make
sure to spread the changes along at least 2 releases.
+Additionally, when making metadata breaking format changes (if the need arises), we make
sure to spread the changes along at least two releases.
 
-This is to always allow the possibility to downgrade a running cluster to a previous version,
in case any server problem is identified in production.
+So, when Pulsar cluster recognizes the new formats in a release, it starts using new formats
in the next release. 
 
-So, one release will understand the new format while the next one will actually start using
it.
+It is allowed to downgrade a running cluster to a previous version, in case any server problem
is identified in production.
 
-### Since Pulsar has configurable retention per namespace, can I set a "forever" value, ie.,
always keep all data in the namespaces?
-So, retention applies to "consumed" messages. Ones, for which the consumer has already acknowledged
the processing. By default, retention is 0, so it means data is deleted as soon as all consumers
acknowledge. You can set retention to delay the retention.
+### Since Pulsar has configurable retention per namespace, can I set a "forever" value, and
keep all data in the namespaces forever?
+Retention applies to "consumed" messages, for which the consumer has already acknowledged
the processing. By default, retention is set to 0, which means data is deleted as soon as
all consumers acknowledge. You can set retention to delay the retention.
 
-That also means, that data is kept forever, by default, if the consumers are not acknowledging.
+It also means that data is kept forever, by default, if the consumers are not acknowledging.
 
-There is no currently "infinite" retention, other than setting to very high value.
-
-### How can a consumer "replay" a topic from the beginning, ie., where can I set an offset
for the consumer?
+### How can a consumer "replay" a topic from the beginning? Where can I set an offset for
the consumer?
 1. Use admin API (or CLI tool):
     - Reset to a specific point in time (3h ago)
     - Reset to a message id
 2. You can use the client API `seek`.
 
-### When create a consumer, does this affect other consumers ?
-The key is that you should use different subscriptions for each consumer. Each subscription
is completely independent from others.
-
-### The default when creating a consumer, is it to "tail" from "now" on the topic, or from
the "last acknowledged" or something else?
-So when you spin up a consumer, it will try to subscribe to the topic, if the subscription
doesn't exist, a new one will be created, and it will be positioned at the end of the topic
("now"). 
-
-Once you reconnect, the subscription will still be there and it will be positioned on the
last acknowledged messages from the previous session.
-
-### I want some produce lock, i.e., to pessimistically or optimistically lock a specified
topic so only one producer can write at a time and all further producers know they have to
reprocess data before trying again to write a topic.
-To ensure only one producer is connected, you just need to use the same "producerName", the
broker will ensure that no 2 producers with same name are publishing on a given topic.
-
-### I tested the performance using PerformanceProducer between two server node with 10,000Mbits
NIC(and I tested tcp throughput can be larger than 1GB/s). I saw that the max msg throughput
is around 1000,000 msg/s when using little msg_size(such as 64/128Bytes), when I increased
the msg_size to 1028 or larger , then the msg/s will decreased sharply to 150,000msg/s, and
both has max throughput around   1600Mbit/s, which is far from 1GB/s.  And I'm curious that
the throughput between producer and broker why can't excess 1600Mbit/s ?  It seems that the
Producer executor only use one thread, is this the reason?Then I start two producer client
jvm, the throughput increased not much, just about little beyond 1600Mbit/s. Any other reasons?
-Most probably, when increasing the payload size, you're reaching the disk max write rate
on a single bookie.
-
-There are few tricks that can be used to increase throughput (other than just partitioning)
-
-1. Enable striping in BK, by setting ensemble to bigger than write quorum. E.g. e=5 w=2 a=2.
Write 2 copies of each message but stripe them across 5 bookies
-
-2. If there are already multiple topics/partitions, you can try to configure the bookies
with multiple journals (e.g. 4). This should increase the throughput when the journal is on
SSDs, since the controller has multiple IO queues and can efficiently sustain multiple threads
each doing sequential writes
-
-- Option (1) you just configure it on a given pulsar namespace, look at "namespaces set-persistence"
command
-
-- Option (2) needs to be configured in bookies
-
-### Is there any work on a Mesos Framework for Pulsar/Bookkeeper this point? Would this be
useful?
-We don’t have anything ready available for Mesos/DCOS though there should be nothing preventing
it
-
-It would surely be useful.
-
+### When creating a consumer, is the default set to "tail" from "now" on the topic, or from
the "last acknowledged" or something else?
+When you spin up a consumer, it tries to subscribe to the topic. If the subscription doesn't
exist, a new one is created, and it is positioned at the end of the topic ("now"). 
 
-### Is there an HDFS like interface?
-Not for Pulsar.There was some work in BK / DistributedLog community to have it but not at
the messaging layer.
+Once you reconnect, the subscription is still there and it is positioned on the last acknowledged
messages from the previous session.
 
-### Where can I find information about `receiveAsync` parameters? In particular, is there
a timeout as in `receive`?
-There’s no other info about `receiveAsync()`. The method doesn’t take any parameters.
Currently there’s no timeout on it. You can always set a timeout on the `CompletableFuture`
itself, but the problem is how to cancel the future and avoid “getting” the message.
+### I want some produce lock, i.e., to pessimistically or optimistically lock a specified
topic, so only one producer can write at a time and all further producers know that they have
to reprocess data before trying again to write a topic.
+To ensure only one producer is connected, you just need to use the same "producerName". The
broker ensures that no two producers with the same name are publishing on a given topic.
 
-What’s your use case for timeout on the `receiveAsync()`? Could that be achieved more easily
by using the `MessageListener`?
+### Is there any work on a Mesos Framework for Pulsar/Bookkeeper at this point? Would this
be useful?
+We don’t have anything available for Mesos/DCOS, yet nothing could prevent it.
+Surely, it is useful.
 
-### Why do we choose to use bookkeeper to store consumer offset instead of zookeeper? I mean
what's the benefits?
-ZooKeeper is a “consensus” system that while it exposes a key/value interface is not
meant to support a large volume of writes per second.
+### Where can I find information about `receiveAsync` parameters? In particular, is there
a timeout on `receive`?
+There’s no other information about the `receiveAsync()` method. The method doesn’t take
any parameters. Currently there’s no timeout on it. You can always set a timeout on the
`CompletableFuture` instance, but the problem is how to cancel the future and avoid “getting”
the message.
 
-ZK is not an “horizontally scalable” system, because every node receive every transaction
and keeps the whole data set. Effectively, ZK is based on a single “log” that is replicated
consistently across the participants. 
+### I'm facing some issues using `.receiveAsync` that it seems to be related with `UnAckedMessageTracker`
and `PartitionedConsumerImpl`. We are consuming messages with `receiveAsync`, doing instant
`acknowledgeAsync` when message is received, after that the process will delay the next execution
of itself. In such scenario we are consuming a lot more messages (repeated) than the number
of messages produced. We are using Partitioned topics with setAckTimeout 30 seconds and I
believe this issue could be related with `PartitionedConsumerImpl` because the same test in
a non-partitioned topic does not generate any repeated message.
+PartitionedConsumer is composed of a set of regular consumers, one per partition. To have
a single `receive()` abstraction, messages from all partitions are pushed into a shared queue.

 
-The max throughput we have observed on a well configured ZK on good hardware was around ~10K
writes/s. If you want to do more than that, you would have to shard it.. 
+The thing is that the unacked message tracker works at the partition level. So when timeout
happens, it’s able to request redelivery for the messages and clear them from the queue
when that happens. But if messages were already pushed into the shared queue, the “clearing”
part will not happen.
 
-To store consumers cursor positions, we need to write potentially a large number of updates
per second. Typically we persist the cursor every 1 second, though the rate is configurable
and if you want to reduce the amount of potential duplicates, you can increase the persistent
frequency.
-
-With BookKeeper it’s very efficient to have a large throughput across a huge number of
different “logs”. In our case, we use 1 log per cursor, and it becomes feasible to persist
every single cursor update.
-
-### I'm facing some issue using `.receiveAsync` that it seems to be related with `UnAckedMessageTracker`
and `PartitionedConsumerImpl`. We are consuming messages with `receiveAsync`, doing instant
`acknowledgeAsync` when message is received, after that the process will delay the next execution
of itself. In such scenario we are consuming a lot more messages (repeated) than the num of
messages produced. We are using Partitioned topics with setAckTimeout 30 seconds and I believe
this issue could be related with `PartitionedConsumerImpl` because the same test in a non-partitioned
topic does not generate any repeated message.
-PartitionedConsumer is composed of a set of regular consumers, one per partition. To have
a single `receive()` abstraction, messages from all partitions are then pushed into a shared
queue. 
-
-The thing is that the unacked message tracker works at the partition level.So when the timeout
happens, it’s able to request redelivery for the messages and clear them from the queue
when that happens,
-but if the messages were already pushed into the shared queue, the “clearing” part will
not happen.
-
-- the only quick workaround that I can think of is to increase the “ack-timeout” to a
level in which timeout doesn’t occur in processing
-- another option would be to reduce the receiver queue size, so that less messages are sitting
in the queue
+- a quick workaround is to increase the “ack-timeout” to a level in which timeout doesn’t
occur in processing
+- another option is to reduce the receiver queue size, so that less messages are sitting
in the queue
 
 ### Can I use bookkeeper newer v3 wire protocol in Pulsar? How can I enable it?
-The answer is currently not, because we force the broker to use v2 protocol and that's not
configurable at the moment.
+Currently, you cannot use bookkeeper v3 wire protocol in Pulsar. Because the broker is designed
to use v2 protocol, and it is not configurable at the moment.
 
 ### Is "kubernetes/generic/proxy.yaml" meant to be used whenever we want to expose a Pulsar
broker outside the Kubernetes cluster?
-Yes, the “proxy” is an additional component to deploy a stateless proxy frontend that
can be exposed through a load balancer and that doesn’t require direct connectivity to the
actual brokers. No need to use it from within Kubernetes cluster. Also in some cases it’s
simpler to have expose the brokers through `NodePort` or `ClusterIp` for other outside producer/consumers.
-
-### Is there a way of having both authentication and the Pulsar dashboard working at same
time?
-The key is that with authorization, the stats collector needs to access the APIs that require
the credentials. That’s not a problem for stats collected through Prometheus but it is for
the “Pulsar dashboard” which is where the per-topic stats are shown. I think that should
be quite easy to fix.
+Yes, the “proxy” is an additional component to deploy a stateless proxy frontend that
can be exposed through a load balancer, and that doesn’t require direct connectivity to
the actual brokers. You do not need to use the proxy within Kubernetes cluster. In some cases,
it’s simpler to expose the brokers through `NodePort` or `ClusterIp` for other producer/consumers
outsides.
 
 ### How can I know when I've reached the end of the stream during replay?
-There is no direct way because messages can still be published in the topic, and relying
on the `readNext(timeout)` is not precise because the client might be temporarily disconnected
from broker in that moment.
-
-One option is to use `publishTimestamp` of messages. When you start replaying you can check
current "now", then you replay util you hit a message with timestamp >= now.
+There is no direct way because messages can still be published in the topic, and relying
on the `readNext(timeout)` is not precise because the client might be temporarily disconnected
from broker at that moment.
 
-Another option is to "terminate" the topic. Once a topic is "terminated", no more message
can be published on the topic, a reader/consumer can check the `hasReachedEndOfTopic()` condition
to know when that happened.
+One option is to use `publishTimestamp` of messages. When you start replaying, you can check
current "now", and then you replay until you hit a message with timestamp >= now.
 
-A final option is to check the topic stats. This is a tiny bit involved, because it requires
the admin client (or using REST) to get the stats for the topic and checking the "backlog".
If the backlog is 0, it means we've hit the end.
+Another option is to "terminate" the topic. Once a topic is "terminated", no more message
can be published on the topic. To know when a topic is terminated, you can check the `hasReachedEndOfTopic()`
condition.
 
-### How can I prevent an inactive topic to be deleted under any circumstance? I want to set
no time or space limit for a certain namespace.
-There’s not currently an option for “infinite” (though it sounds a good idea! maybe
we could use `-1` for that). The only option now is to use INT_MAX for `retentionTimeInMinutes`
and LONG_MAX for `retentionSizeInMB`. It’s not “infinite” but 4085 years of retention
should probably be enough!
+A third option is to check the topic stats. This is a tiny bit involved, because it requires
the admin client (or using REST) to get the stats for the topic and checking the "backlog".
If the backlog is 0, it means you've hit the end.
 
-### Is there a profiling option in Pulsar, so that we can breakdown the time costed in every
stage? For instance, message A stay in queue 1ms, bk writing time 2ms(interval between sending
to bk and receiving ack from bk) and so on.
-There are latency stats at different stages. In the client (eg: reported every 1min in info
logs). 
-In the broker: accessible through the broker metrics, and finally in bookies where there
are several different latency metrics. 
+### How can I prevent an inactive topic to be deleted under any circumstance? I do not want
to set time or space limit for a certain namespace.
+You can set *infinite* retention time or size, by setting `-1` for either time or
+size retention.
+For more details, see [Pulsar retention policy](http://pulsar.apache.org/docs/en/cookbooks-retention-expiry/#retention-policies).
 
-In broker there’s just the write latency on BK, because there is no other queuing involved
in the write path.
+### Is there a profiling option in Pulsar, so that we can breakdown the time costed in every
stage? For instance, message A stays in queue 1ms, bookkeeper writes in 2ms(interval between
sending to bk and receiving ack from bk) and so on.
+There are latency stats at different stages: client, bookie and broker. 
+For example, the client reports every minute in info logs. 
+The broker is accessible through the broker metrics, and bookies store several different
latency metrics. 
 
-### How can I have multiple readers that each get all the messages from a topic from the
beginning concurrently? I.e., no round-robin, no exclusivity
-you can create reader with `MessageId.earliest`
+In broker, there’s just the write latency on Bookkeeper, because there is no other queuing
involved in the write path.
 
+### How can multiple readers get all the messages from a topic from the beginning concurrently?
I.e., no round-robin, no exclusivity.
+You can create reader with `MessageId.earliest`.
 
 ### Does broker validate if a property exists or not when producer/consumer connects ?
-yes, broker performs auth&auth while creating producer/consumer and this information
presents under namespace policies.. so, if auth is enabled then broker does validation
+Yes, broker performs auth&auth while creating producer/consumer and this information
presents under namespace policies. So, if auth is enabled, and then broker performs validation.
+
+### For subscribing to a large number of topics like that, would I need to call `subscribe`
for each one individually, or is there some sort of wildcard capability?
 
 Review comment:
   Shall we answer it as follows:
   Yes, you can use regular expression to subscribe a large number of topics.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message