linux – tcp连接在高负载下随机失败
我们的应用程序使用非阻塞套接字使用连接和选择操作(c代码). pusedo代码如下:
unsigned int ConnectToServer(struct sockaddr_in *pSelfAddr,struct sockaddr_in *pDestAddr) { int sktConnect = -1; sktConnect = socket(AF_INET,SOCK_STREAM,0); if(sktConnect == INVALID_SOCKET) return -1; fcntl(sktConnect,F_SETFL,fcntl(sktConnect,F_GETFL) | O_NONBLOCK); if(pSelfAddr != 0) { if(bind(sktConnect,(const struct sockaddr*)(void *)pSelfAddr,sizeof(*pSelfAddr)) != 0) { closesocket(sktConnect); return -1; } } errno = 0; int nRc = connect(sktConnect,(const struct sockaddr*)(void *)pDestAddr,sizeof(*pDestAddr)); if(nrC != -1) { return sktConnect; } if(errno != EINPROGRESS) { int savedError = errno; closesocket(sktConnect); return -1; } fd_set scanSet; FD_ZERO(&scanSet); FD_SET(sktConnect,&scanSet); struct timeval waitTime; waitTime.tv_sec = 2; waitTime.tv_usec = 0; int tmp; tmp = select(sktConnect +1,(fd_set*)0,&scanSet,&waitTime); if(tmp == -1 || !FD_ISSET(sktConnect,&scanSet)) { int savedErrorNo = errno; writeLog("Connect %s failed after select,cause %d,error %s",inet_ntoa(pDestAddr->sin_addr),savedErrorNo,strerror(savedErrorNo)); closesocket(sktConnect); return -1; } . . . . .} 有80个这样的节点,应用程序以循环方式连接到所有对等节点.
tcpdump日志是: 387937 2012-07-05 07:45:30.646514 10.18.92.173 10.137.165.136 TCP 33728 > 8441 [SYN] Seq=0 Ack=0 Win=5792 Len=0 MSS=1460 TSV=1414450402 TSER=912308224 WS=8 387947 2012-07-05 07:45:30.780762 10.137.165.136 10.18.92.173 TCP 8441 > 33728 [SYN,ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=912309754 TSER=1414450402 WS=8 387948 2012-07-05 07:45:30.780773 10.18.92.173 10.137.165.136 TCP 33728 > 8441 [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=1414450435 TSER=912309754 All the above three events indicate the success information. 387949 2012-07-05 07:45:30.782652 10.18.92.173 10.137.165.136 TCP 33728 > 8441 [PSH,ACK] Seq=1 Ack=1 Win=5888 Len=320 TSV=1414450436 TSER=912309754 387967 2012-07-05 07:45:30.915615 10.137.165.136 10.18.92.173 TCP 8441 > 33728 [ACK] Seq=1 Ack=321 Win=6912 Len=0 TSV=912309788 TSER=1414450436 388011 2012-07-05 07:45:31.362712 10.18.92.173 10.137.165.136 TCP 33728 > 8441 [PSH,ACK] Seq=321 Ack=1 Win=5888 Len=320 TSV=1414450581 TSER=912309788 388055 2012-07-05 07:45:31.495558 10.137.165.136 10.18.92.173 TCP 8441 > 33728 [ACK] Seq=1 Ack=641 Win=7936 Len=0 TSV=912309933 TSER=1414450581 388080 2012-07-05 07:45:31.702336 10.137.165.136 10.18.92.173 TCP 8441 > 33728 [PSH,ACK] Seq=1 Ack=641 Win=7936 Len=712 TSV=912309985 TSER=1414450581 388081 2012-07-05 07:45:31.702350 10.18.92.173 10.137.165.136 TCP 33728 > 8441 [ACK] Seq=641 Ack=713 Win=7424 Len=0 TSV=1414450666 TSER=912309985 388142 2012-07-05 07:45:32.185612 10.137.165.136 10.18.92.173 TCP 8441 > 33728 [PSH,ACK] Seq=713 Ack=641 Win=7936 Len=320 TSV=912310106 TSER=1414450666 388143 2012-07-05 07:45:32.185629 10.18.92.173 10.137.165.136 TCP 33728 > 8441 [ACK] Seq=641 Ack=1033 Win=8704 Len=0 TSV=1414450786 TSER=912310106 388169 2012-07-05 07:45:32.362622 10.18.92.173 10.137.165.136 TCP 33728 > 8441 [PSH,ACK] Seq=641 Ack=1033 Win=8704 Len=320 TSV=1414450831 TSER=912310106 388212 2012-07-05 07:45:32.494833 10.137.165.136 10.18.92.173 TCP 8441 > 33728 [ACK] Seq=1033 Ack=961 Win=9216 Len=0 TSV=912310183 TSER=1414450831 388219 2012-07-05 07:45:32.501613 10.137.165.136 10.18.92.173 TCP 8441 > 33728 [PSH,ACK] Seq=1033 Ack=961 Win=9216 Len=356 TSV=912310185 TSER=1414450831 388220 2012-07-05 07:45:32.501624 10.18.92.173 10.137.165.136 TCP 33728 > 8441 [ACK] Seq=961 Ack=1389 Win=10240 Len=0 TSV=1414450865 TSER=912310185 应用程序日志通知连接错误(即api – connect select) [5258: 2012-07-05 07:45:30]Connect [10.137.165.136 <- 10.18.92.173] success. [5258: 2012-07-05 07:45:32]Connect 10.137.165.137 fail after select,cause:115,error Operation now in progress. Check whether remote machine exist and the network is normal or not. [5258: 2012-07-05 07:45:32]Connect to server([10.137.165.137 <- 10.18.92.173],port=8441) Failed! 成功日志对应tcpdump的前3个条目.并且在tcpdump中没有事件的故障日志
解决方法
你已经击中了EINPROGRESS.从连接手册页:
这就是说EINPROGRESS指示内核现在无法完成连接,即使有可用的本地端口和路由缓存条目.当套接字状态尚未转换为“ESTABLISHED”时,似乎会发生这种情况.只需再次在select中插入套接字,但之后调用getsockopt以查看您的连接是否已完成. 至于为什么,套接字在连接期间转换到SYN_SENT状态,但是数据包可能仍然在输出队列中,并且实际上还没有到达网络设备缓冲区. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |