learning apache kafka读书笔记-bill

bill_cpp的ChinaUnix博客

首页　| 　博文目录　| 　关于我

bill_cpp

博客访问： 75514
博文数量： 29
博客积分： 0
博客等级：民兵
技术积分： 272
用户组：普通用户
注册时间： 2015-01-05 20:32

文章分类

全部博文（29）

apache（2）

flume（1）

kafka（1）
个人（1）
机器学习（1）
mysql（1）
R（7）
python（8）

python基础（8）
C/C++（4）
python爬虫（5）
未分配的博文（0）

文章存档

2016年（2）

2015年（27）

我的朋友

相关博文

learning apache kafka读书笔记

分类：大数据

2016-06-04 11:11:21

1, kafka存取时间复杂度是O(1)

2, one node multi broker(broker的数量最好大于单个topic的最大partition数)

3, multi zookeeper

4, 一个consumer是一个进程

5, broker是无状态的，由consumer保存状态，一般consumer会通过zookeeper来保存状态，但是也完全可以由别的组件或consumer自身来完成，比如spark streaming的direct模式就是自已保存状态。

6, kafka集群并没有master的概念，所有的broker都是平等的，它们的metadata都由zookeeper保管。

7, 在默认的状态下，kafka的是以round robin的方式来决每一条消息去往哪一个partition中的，但是用户是可以指每一条消息的key的，kafka会将这个key hash然后决定这message发往哪个partition，这样就可以将相同key的message发往同一个partition，方便取数据（partition规则参见elastic search文档）。

8, 在异步模式下，可以通过设置queue.time或batch.size来批入库（减少网络i/o的次数）

9, consumer group动态增加consumer的危险性－－第64页

The consumer group name is unique and global across the Kafka cluster and any new consumers with an in-use consumer group name may cause ambiguous behavior in the system. When a new process is started with the existing consumer group name,Kafka triggers a rebalance between the new and existing process threads for the consumer group.After the rebalance, some messages that are intended for a new process may go to an old process, causing unexpected results. To avoid this ambiguous behavior, any existing consumers should be shut down before starting new consumers for an existing consumer group name.

10, 多线程连接－－第65页

In the create call from the ConsumerConnector class, clients can specify the number of desired streams, where each stream object is used for single-threaded processing. These stream objects may represent the merging of multiple unique partitions.

代码实例－－第70页

//Define single thread for topic

topicMap.put(topic, new Integer(1))

11, high-level Kafka consumer的线程数量设置问题－－第72页

A multithreaded, high-level, consumer-API-based design is usually based on the number of partitions in the topic and follows a one-to-one mapping approach between the thread and the partitions within the topic. For example, if four partitions are defined for any topic, as a best practice, only four threads should be initiated with the sonsumer application to read the data; otherwise, some conflicting behavior, such as threads never receiving a message or a thred receiving messages from multiple partitions, may occur. Also. receiving multiple messages will not guarantee that the messages will be placed in order. For example, a thread may receive two messages from the first partition and three from the second partition, then three more from the first partition, then three more from the first partition, followed by some more from the first partition, even if the second partition has data available.

12, 日前Kafka支持topic增加分区数，但是不支持减少分区数和改变复制数。

13, 绕过zookeeper的low-level Kafka consumer（比如实现spark对消息消费exactly once）

应用场景：

Read a message multiple times
Consume only a subset of the partitions in a topic in a process
Manage transactions to make sure a message is processed once and only once

15, Kafka磁盘空间的计算 http://vilkeliskis.com/blog/2014/11/10/infrastructure_for_data_streams.html

阅读(1197) | 评论(0) | 转发(0) |

上一篇：R读数据神器readr, readxl

下一篇：flume +　spark streaming(pull模式) + Elasticsearch

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6