记一次外部agent侵入导致的OOM排查过程

2021/04/26 1241点热度 0人点赞 0条评论

业务系统从公有云在迁移到专有云后,有几台服务器一直有问题,直接导致迁移进度无进展。通过

jstat -gcutil pid  5000

查看,发现应用程序的老年代已满,fullgc一直在增长,反而ygc很小。

图片

猜测:

  1. 对象生成没有经过年轻代,直接进了老年代,程序里有大对象;

  2. 年轻代里没有连续的空间,导致无法创建数组直接进入老年代;

然后

jmap -heap pid 

查看到应用的堆有2g,但是新生代只有167MB,新生代太小,不符合1:2的配置

Server compiler detected.
JVM version is 25.281-b09

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 2147483648 (2048.0MB)
NewSize = 174456832 (166.375MB)
MaxNewSize = 174456832 (166.375MB)
OldSize = 1973026816 (1881.625MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 536870912 (512.0MB)
CompressedClassSpaceSize = 134217728 (128.0MB)
MaxMetaspaceSize = 536870912 (512.0MB)
G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 157024256 (149.75MB)
used = 115393952 (110.04824829101562MB)
free = 41630304 (39.701751708984375MB)
73.48797882538607% used
Eden Space:
capacity = 139591680 (133.125MB)
used = 97961376 (93.42324829101562MB)
free = 41630304 (39.701751708984375MB)
70.17708791813381% used
From Space:
capacity = 17432576 (16.625MB)
used = 17432576 (16.625MB)
free = 0 (0.0MB)
100.0% used
To Space:
capacity = 17432576 (16.625MB)
used = 0 (0.0MB)
free = 17432576 (16.625MB)
0.0% used
concurrent mark-sweep generation:
capacity = 1973026816 (1881.625MB)
used = 320379248 (305.53746032714844MB)
free = 1652647568 (1576.0875396728516MB)
16.237957102352937% used

这么小的新生代,不合理。先调整jvm参数的新生代为768m,并且把gc日志的明细输出。

-Xmn768m
-XX:+PrintGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/app/logs/gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/app/logs/

事故再次发生后

2021-04-23T08:36:12.948+0800: 71534.555: [Full GC (Last ditch collection) 
2021-04-23T08:36:12.948+0800: 71534.555: [CMS: 1638219K->1659362K(1926784K), 2.6197429 secs] 1780257K->1659362K(2080128K), [Metaspace: 232942K->232942K(626688K)], 2.6200527 secs] [Times: user=2.57 sys=0.00, real=2.62 secs]
Heap after GC invocations=755 (full 647):
par new generation total 153344K, used 0K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
eden space 136320K, 0% used [0x0000000080000000, 0x0000000080000000, 0x0000000088520000)
from space 17024K, 0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
to space 17024K, 0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
concurrent mark-sweep generation total 1926784K, used 1659362K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 232905K, capacity 524136K, committed 524288K, reserved 626688K
class space used 18997K, capacity 28795K, committed 28800K, reserved 131072K
}
2021-04-23T08:36:15.570+0800: 71537.177: Application time: 0.0020953 seconds
{Heap before GC invocations=755 (full 647):
par new generation total 153344K, used 122K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
eden space 136320K, 0% used [0x0000000080000000, 0x000000008001e968, 0x0000000088520000)
from space 17024K, 0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
to space 17024K, 0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
concurrent mark-sweep generation total 1926784K, used 1659362K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 232914K, capacity 524144K, committed 524288K, reserved 626688K
class space used 18998K, capacity 28800K, committed 28800K, reserved 131072K
2021-04-23T08:36:15.576+0800: 71537.183: [Full GC (Metadata GC Threshold)
2021-04-23T08:36:15.576+0800: 71537.183: [CMS[YG occupancy: 122 K (153344 K)]
2021-04-23T08:36:16.706+0800: 71538.313: [weak refs processing, 0.0020164 secs]
2021-04-23T08:36:16.708+0800: 71538.315: [class unloading, 0.0453498 secs]
2021-04-23T08:36:16.753+0800: 71538.360: [scrub symbol table, 0.0375497 secs]
2021-04-23T08:36:16.791+0800: 71538.398: [scrub string table, 0.0051911 secs]: 1659362K->1659218K(1926784K), 1.5252660 secs] 1659484K->1659341K(2080128K), [Metaspace: 232914K->232914K(626688K)], 1.5254988 secs] [Times: user=1.53 sys=0.00, real=1.52 secs]
Heap after GC invocations=756 (full 648):
par new generation total 153344K, used 122K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
eden space 136320K, 0% used [0x0000000080000000, 0x000000008001e968, 0x0000000088520000)
from space 17024K, 0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
to space 17024K, 0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
concurrent mark-sweep generation total 1926784K, used 1659218K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 232914K, capacity 524144K, committed 524288K, reserved 626688K
class space used 18998K, capacity 28800K, committed 28800K, reserved 131072K
}
{Heap before GC invocations=756 (full 648):
par new generation total 153344K, used 122K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
eden space 136320K, 0% used [0x0000000080000000, 0x000000008001e968, 0x0000000088520000)
from space 17024K, 0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
to space 17024K, 0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
concurrent mark-sweep generation total 1926784K, used 1659218K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 232914K, capacity 524144K, committed 524288K, reserved 626688K
class space used 18998K, capacity 28800K, committed 28800K, reserved 131072K
2021-04-23T08:36:17.101+0800: 71538.708: [Full GC (Last ditch collection)
2021-04-23T08:36:17.101+0800: 71538.708: [CMS: 1659218K->1659176K(1926784K), 2.3683046 secs] 1659341K->1659176K(2080128K), [Metaspace: 232914K->232914K(626688K)], 2.3685772 secs] [Times: user=2.37 sys=0.00, real=2.37 secs]
Heap after GC invocations=757 (full 649):
par new generation total 153344K, used 0K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
eden space 136320K, 0% used [0x0000000080000000, 0x0000000080000000, 0x0000000088520000)
from space 17024K, 0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
to space 17024K, 0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
concurrent mark-sweep generation total 1926784K, used 1659176K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 232913K, capacity 524142K, committed 524288K, reserved 626688K
class space used 18998K, capacity 28799K, committed 28800K, reserved 131072K
}
2021-04-23T08:36:19.471+0800: 71541.078: Application time: 0.0011945 seconds
2021-04-23T08:36:19.472+0800: 71541.079: [GC (CMS Initial Mark) [1 CMS-initial-mark: 1659176K(1926784K)] 1659401K(2080128K), 0.0156439 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2021-04-23T08:36:19.488+0800: 71541.095: [CMS-concurrent-mark-start]
2021-04-23T08:36:19.510+0800: 71541.117: Application time: 0.0219117 seconds
2021-04-23T08:36:37.221+0800: 71558.828: Application time: 0.0803242 seconds
{Heap before GC invocations=757 (full 650):
par new generation total 153344K, used 29816K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
eden space 136320K, 21% used [0x0000000080000000, 0x0000000081d1e390, 0x0000000088520000)
from space 17024K, 0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
to space 17024K, 0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
concurrent mark-sweep generation total 1926784K, used 1659176K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 232940K, capacity 524160K, committed 524288K, reserved 626688K
class space used 18999K, capacity 28800K, committed 28800K, reserved 131072K
2021-04-23T08:36:37.223+0800: 71558.830: [Full GC (Metadata GC Threshold)
2021-04-23T08:36:37.223+0800: 71558.830: [CMS2021-04-23T08:36:38.338+0800: 71559.945: [CMS-concurrent-mark: 1.165/18.850 secs] [Times: user=10.13 sys=1.41, real=18.85 secs]
(concurrent mode failure): 1659176K->1668506K(1926784K), 3.6348258 secs] 1688993K->1668506K(2080128K), [Metaspace: 232940K->232940K(626688K)], 3.6351573 secs] [Times: user=3.63 sys=0.02, real=3.64 secs]
.....

发现几个现象:

  1. 年轻代不往里写数据;

  2. 不停的fullgc;

  3. 每次fullgc后并没有释放太多的内存;

堆也打下来了,把堆拉下来,用jprofile打开,查看大对象 

图片

AgentClassLoader ??? 我们项目启动参数没有加agent啊。再往下点

com.qt.memchk.MemCheck  

让开发全局搜索了下,没有发现这个类。查看了下系统里所有java应用,发现了云盾相关的监控,难道是这个?先让运维把其中各一个节点的这两个监控kill掉(一个是两个服务A/B,一个服务两个节点A1/A2,B1/B2)kill掉的是A1和B1。

项目重启,两个小时没问题。触发了一次fullgc,老年代也没有问题。这个时候已经是4月25日了下班了。4月26日早上9点多,开发又说服务爆了。我查看这四台服务器。发现两台fullgc相对正常,两台fullgc都1w+了。当时没注意看,以为就是云盾影响的。还很高兴的给开发说,解决了,是阿里云盾的问题。

我把项目kill重启,这个时候发现不是停掉云盾的机器fullgc了。

想了下还是在机器里安装一个arthas吧,看看具体是什么玩意。查看了下classLoader,发现arthas的ClassLoader进来了, 还出现了AgentClassLoader。 

图片

这个时候引起我的注意了,arthas是通过attach的方式来监控的,那AgentClassLoader一定是一个Agent并且是通过attach进来的。

堆里一定有我没注意的东西。继续查看,选择对应的类,右键选择Use Selected Objects

图片

直接点ok。 

图片

这是什么鬼?系统里怎么会有这个服务?

找运维,运维说这个是青藤云的agent,主要是做主机安全的。找安全部门,并让安全部门赶紧拉群。

带着两个问题问对方
1,侵入了应用程序(这个是为什么?);
2,强引用,导致应用程序无法gc掉;

对方回答:
1. 这个是为了进行内存webshell检测,所以利用java attach机制在应用中运行了一个jar
2. 这个问题目前我们已经发现,正在进行修复,会尽快上线

无法理解,一个主机监控服务,尽然能侵入到业务应用中,而且我们的人还不知道这件事,多可怕。让运维把青藤云的服务都停了。重启,一直到现在也没再发现问题。

后续问题:

  1. 当-Xms -Xmx 都是2g的时候,没有设置Xmn ,启动后自由167mb。

  2. 90来次ygc就发生了2次fullgc

1 和我理解的不太一样,不设置xmn 系统应该走默认的1:2的分配机制,然并卵。我又查看了下初始化的参数

java -XX:+PrintFlagsInitial|grep NewSize

图片

发现eden区的大小和这个值很接近。

有空了把剩下的两个问题处理下。

yxkong

这个人很懒,什么都没留下

文章评论