thrift — 记关于"Read a frame size of "的痛苦经历
thrift开源出来好多年,得到了大家认可。我本人也觉得thrift非常优秀。CTO说要公司定制的RPC框架,起初基于netty开发了一套RPC框架,协议为json,因为考虑通用性和实用性,最终选择了轻量级的RPC框架thrift为通信框架。选定thrift之后,面临着优化,扩展等等一系列问题。今天要讲的就是因为一次“优化”产生的BUG及排查、解决的痛苦经历。
异常信息:
Read a frame size of 16842752, which is bigger than the maximum allowable buffer size for ALL connections
此异常是因为一次调整参数导致,为了限制客户端发送数据大小,MAX_READ_BUFFER_BYTES设置成1M,对于大多数接口足够用了,代码如下:
protected static final int DEFAULT_MAX_LENGTH = 1024 * 1024;//1 m
TThreadedSelectorServer.Args tArgs = new TThreadedSelectorServer.Args(serverTransport);
tArgs.maxReadBufferBytes = DEFAULT_MAX_LENGTH;
起初未太在意,因为不影响业务,就忙其他事情了。最近闲下来,查看线上日志时,发现此异常不少反增(业务量增加),就开始了痛苦的排查过程。
1.仔细读了thrift源码,了解了异常产生原因。thrift协议不是太复杂,在接受数据时会调用FrameBuffer.read()方法,首先会读取4个字节,
ByteBuffer buffer_ = ByteBuffer.allocate(4);
private boolean internalRead() {
try {
if (trans_.read(buffer_) < 0) {
return false;
}
return true;
} catch (IOException e) {
LOGGER.warn("Got an IOException in internalRead!", e);
return false;
}
}
再更改状态机state_为FrameBufferState.READING_FRAME
state_ = FrameBufferState.READING_FRAME;
读取数据。异常正是此时产生,
// if this frame will always be too large for this server,
// log the
// error and close the connection.
if (frameSize > MAX_READ_BUFFER_BYTES) {
打印异常
return false;
}
数据长度超过了限制,此处MAX_READ_BUFFER_BYTES为1024 × 1024。
知道原因后开始排查
增加打印
首先需要知道哪些连接会报此异常。所以在错误处增加了socket channel信息的打印。
LOGGER.error("[" + Thread.currentThread().getName() + "] Read a frame size of " + frameSize + ", which is bigger than the maximum allowable buffer size for ALL connections. client[" + socketChannel.socket().getRemoteSocketAddress() + "] will be closed in future. server[" + socketChannel.socket().getLocalSocketAddress() + "]");
打印出来的信息为另一个idc机房的连接出现次数比较多,所以想是不是因为循环调用或者网络问题导致的该错误,所以进行了下一步。
2. 单元测试
写了个客户端demo,把demo搬到idc执行,循环调用上千次,未发现此异常。增加测试复杂度,在多线程情况下调用,还是未产生此异常。这时感觉到事情没想的那么简单。
3. 增加详细打印
在客户端发送数据处和服务端接受处均增加了打印。客户端代码在TServiceClient的sendBase()方法中,如下:
private void sendBase(String methodName, TBase<?, ?> args, byte type) throws TException {
oprot_.writeMessageBegin(new TMessage(methodName, type, ++seqid_));
args.write(oprot_);
oprot_.writeMessageEnd();
try {
TTransport trans_ = oprot_.getTransport();
if(trans_ instanceof TFramedTransport) {
TFramedTransport framedTrans_ = (TFramedTransport)trans_;
byte[] data_ = framedTrans_.getWriteBuffer();
if(data_ != null && data_.length > 0) {
int len = framedTrans_.getWriteBufferLength();
StringBuffer sb = new StringBuffer("method name: " + methodName + ", data len: " + len + ",");
byte[] buf = new byte[4];
buf[0] = (byte)(0xff & (len >> 24));
buf[1] = (byte)(0xff & (len >> 16));
buf[2] = (byte)(0xff & (len >> 8));
buf[3] = (byte)(0xff & (len));
for(byte lbyte : buf) {
sb.append(lbyte).append(",");
}
sb.append(",data:");
for(int i = 0; i < data_.length; i++) {
sb.append(data_[i]).append(",");
}
LOGGER.error(sb.toString());
}
}
} catch (Exception e) {
LOGGER.error(e.getMessage(), e);
}
oprot_.getTransport().flush();
}
//为了跟thrift的flush方法效果一样,我在TFramedTransport类里增加了getWriteBufferLength()方法,获取数据长度。
int len = framedTrans_.getWriteBufferLength();
服务端在异常处增加了接受字节的打印,代码如下:
if (frameSize > MAX_READ_BUFFER_BYTES) {
SocketChannel socketChannel = (SocketChannel) selectionKey_.channel();
try {
ByteBuffer dst = ByteBuffer.allocate(100);
buffer_.flip();
dst.put(buffer_);
int numberOfBytes = socketChannel.read(dst);
dst.flip();
LOGGER.error("number of bytes read: " + numberOfBytes);
byte[] bytes = dst.array();
if (bytes != null && bytes.length > 0) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < bytes.length; i++) {
sb.append(bytes[i] + ",");
}
LOGGER.error(sb.toString() + ", byteString: " + new String(bytes, "utf8"));
} else {
LOGGER.error("bytes is empty.");
}
} catch (IOException e) {
LOGGER.error(e.getMessage(), e);
}
LOGGER.error("[" + Thread.currentThread().getName() + "] Read a frame size of " + frameSize + ", which is bigger than the maximum allowable buffer size for ALL connections. client["+ socketChannel.socket().getRemoteSocketAddress() + "] will be closed in future. server[" + socketChannel.socket().getLocalSocketAddress() + "]");
return false;
}
客户端打印结果如下:
2017-07-21 01:56:31,993 ERROR [org.apache.thrift.TServiceClient]:107 - method name: xxx, data len: 47,0,0,0,47,,data:-128,1,0,1,0,0,0,27,113,117,101,114,121,65,100,1
18,101,114,116,105,115,101,66,121,83,101,99,111,110,100,68,101,112,73,100,0,0,4,-41,8,0,1,0,0,0,57,0,0,3,0,0,0,18,-27,-116,-105,-28,-70,-84,-27,-115,-113,-27,-110,-116,-27,-116,-69,-23,-103
,-94,0,0,-94,-24,-126,-65,-25,-104,-92,-27,-116,-69,-23,-103,-94,0,0,-69,-23,-103,-94,0,0,0,-114,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
服务端打印如下:
2017-07-21 01:56:31,893 ERROR [org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer]:376 - 1,0,0,0,47,-128,1,0,1,0,0,0,27,113,117,101,114,121,65,100,118,101,114,116,105,115,101,66
,121,83,101,99,111,110,100,68,101,112,73,100,0,0,43,-21,8,0,1,0,0,0,57,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,, byteString: ^A^@^@
^@/�^A^@^A^@^@^@ESCxxx^@^@+^@^A^@^@^@9^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
2017-07-21 01:56:31,893 ERROR [org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer]:384 - [Thread-15] Read a frame size of 16777216, which is bigger than the maximum allowable bu
ffer size for ALL connections. client[/20.1.41.18:43196] will be closed in future. server[/192.168.11.150:10083]
通过打印发现服务端在读取frame size时读错了一个字节,导致协议错乱。 正常情况下服务端应该读取0,0,0,47,4个字节,然后转成int,即frame size,然而服务端读取的字节是1,0,0,0。前面多出的1是哪里来的?
分析
发现此问题后瞬间崩溃,哪里来一个字节,这么多人使用thrift,难道只有我一个人遇到了?在bing,百度四处搜索,在thrift官网查看是否有我同样遭遇的人,始终未找到类似帖子。我开始变得急躁,无从下手。吃饭睡觉都在思考这个问题。
用tcpdump抓包
使用tcpdump抓包分别在客户端和服务端抓包,发现客户端有很多cksum incorrect的包,开始怀疑是否是cksum导致的问题,搜索关于cksum的资料,说是网卡设置会导致cksum incorrect的问题。不是太懂硬件,特意咨询运维工程师,运维给出的答案是
You see the "incorrect" checksums due to a feature called TCP checksum offloading. The checksum fields for outgoing TCP packets are not pre-calculated by the operating system but set to 0 and left for calculation by the NIC processor. The Wireshark FAQ has a more detailed explanation.
You should only see incorrect checksums on packets sent by your machine. This is not specific to Wireshark, any packet capture software working at this level will produce similar results.
If you see checksum fails on received packets, there might be a problem indeed - these packets are discarded by the TCP stack and trigger a retransmission as well as congestion control algorithms - which will have a direct impact on data throughput.
大概意思是客户端有cksum incorrect不严重,服务端出现的话会直接抛弃数据。我抓包结果是incorrect出现在客户端,服务端并未出现。跟我遇见的现象描述不符。
运维说可能网卡造成的原因,服务器配置了多个网卡,也帮忙调了一下,发现没什么改变,异常还在继续。所以排除了硬件的可能性。毕竟运维不能一直陪着你,换另一个思路思考。到这里我已基本崩溃,能想的可能性都想过了,也都验证过,都没啥鸟用。
- 最后一搏,还是tcpdump
除了以上几个思路,还试过是否是缓冲区的数据未清空,读取了上一个请求的末尾字节等等等等,发现都没啥鸟用。
实在没有办法了,这个问题困扰了我2天了,实在熬不过了,索性用最笨的办法吧,在新建立的连接上抓客户端和服务端的包,我就不信抓不到不一样的信息。确定好客户端端口,开始抓包,
服务端:
tcpdump -i bond0 -vv -nn port 10083 and src port 43196 > srcport.log
客户端:
tcpdump -i eth0 -nn -vv port 43196 and dst port 10083 > client_dump.log
服务端开启另一个终端:
netstat -atnp | grep 10083 | grep 43196
就盼着43196快点断掉,洗个澡,刷个牙,回来还没断,抽了3根烟,断开了。已经半夜3点了,不解决实在睡不着,接着弄吧。
从3次握手开始查看,一条一条比对两边的数据包,没发现什么问题。突然灵机一动,问题就应该在最后的请求上,索性去看了看,不看不知道,一看吓一跳,真发现有个地方有不同处。
服务端:
01:56:31.882846 IP (tos 0x0, ttl 62, id 7895, offset 0, flags [DF], proto: TCP (6), length: 53) 20.1.41.18.43196 > 192.168.11.150.10083: P, cksum 0x9877 (correct), 97260:97261(1) ack 10313
55 win 501 urg 2 <nop,nop,timestamp 1308631990 636616015>
01:56:31.896903 IP (tos 0x0, ttl 62, id 7896, offset 0, flags [DF], proto: TCP (6), length: 53) 20.1.41.18.43196 > 192.168.11.150.10083: P, cksum 0x9759 (correct), 97313:97314(1) ack 10313
56 win 501 urg 1 <nop,nop,timestamp 1308632003 636616235>
客户端:
20.1.41.18.43196 > 20.1.41.61.10083: Flags [P.U], cksum 0x7a78 (incorrect -> 0x2778), seq 92506:92507, ack 975450, win 501, urg 2, options [nop,nop,TS val 1308631990 ecr 636616015], length 1
01:56:32.004995 IP (tos 0x0, ttl 64, id 7896, offset 0, flags [DF], proto TCP (6), length 53)
20.1.41.18.43196 > 20.1.41.61.10083: Flags [P.U], cksum 0x7a78 (incorrect -> 0x265a), seq 92559:92560, ack 975451, win 501, urg 1, options [nop,nop,TS val 1308632003 ecr 636616235], length 1
对比发现断开的前一次请求的urg不同,urg是啥东东?bing了一下,是紧急字段。到这里全都清楚了,奶奶个腿儿的,不就是我“优化”客户端的时候加的方法吗?
。
刚开始弄thfit的时候
thrift客户端是基于原生socket实现,主要类为TSocket,此类封装了socket实例,也就是套接字。以往的经验告诉我,套接字在服务端关闭连接时客户端无感知,导致线程卡在read()方法上,浪费资源。大概读了一遍thrift的源码,发现TSocket.isOpen()方法不靠谱,只调用了socket_.isConnected() 所以我首先想到了sendUrgentData(int data) 方法,此方法可以很好解决socket是否可用的问题。
我还自以为是的加了sendUrgentData()方法,去判断连接完好,万万没想到就是因为此方法折腾了2天,非常痛苦的2天啊。
总结
别自以为是的修改开源代码
解决问题要多维度思考,实在不行,用最笨的方法。
好了,可以安心睡觉了。
更多推荐
所有评论(0)