likes
comments
collection
share

记一次java堆空间爆满导致宕机的问题

作者站长头像
站长
· 阅读数 4

1 问题现象与定位

团队有一个服务,一直运行的好好的,突然访问异常了,先是请求超时,然后直接无法访问。接到问题反馈后,首先想到的是查看日志,日志中没有任何程序逻辑处理错误的日志,但是接口请求处理的时间拉的特别的长,且访问外部接口都报超时了。

同一个接口处理时间对比如下:

2024-01-25 00:03:44.451 [http-nio-8843-exec-7] INFO  c.b.w.o.api.v1.ApiControllerV1 - 2410710a-17ec-49ce-a74e-23e410189b08 clientRealIp: 10.181.202.12, finish getStatus, duration time: 634 ms
2024-01-25 02:29:38.707 [http-nio-8843-exec-105] INFO  c.b.w.o.a.a.BasicApiExceptionHandler - 3f3b24ef-b1c0-4558-ad15-ec88b30aa0bd clientRealIp: 10.181.200.33, finish getStatus, duration time: 1217935 ms

eureka client周期性打印以下日志:

2024-01-25 01:00:59.823 [DiscoveryClient-1] WARN  c.n.discovery.TimedSupervisorTask -  task supervisor timed out
java.util.concurrent.TimeoutException: null
	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at com.netflix.discovery.TimedSupervisorTask.run(TimedSupervisorTask.java:66)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

访问数据库的连接异常断开:

2024-01-25 01:12:21.965 [http-nio-8843-exec-13] WARN  c.zaxxer.hikari.pool.ProxyConnection - d0487d3e-89cf-4385-ae59-6d26ff1e277a HikariPool-1 - Connection com.mysql.jdbc.JDBC4Connection@2b165bf4 marked as broken because of SQLSTATE(08007), ErrorCode(0)
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Communications link failure during commit(). Transaction resolution unknown.
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
	at com.mysql.jdbc.Util.getInstance(Util.java:360)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
	at com.mysql.jdbc.ConnectionImpl.commit(ConnectionImpl.java:1611)
	at com.zaxxer.hikari.pool.ProxyConnection.commit(ProxyConnection.java:361)
	at com.zaxxer.hikari.pool.HikariProxyConnection.commit(HikariProxyConnection.java)

调用外部接口报错:

2024-01-25 01:12:22.148 [http-nio-8843-exec-3] ERROR c.b.w.o.a.a.BasicApiExceptionHandler - 7c4a2b46-dc8a-4121-85ff-cf25ec16a17d org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1055)
	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943)
	at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006)
	at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:634)
	at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:741)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.boot.actuate.web.trace.servlet.HttpTraceFilter.doFilterInternal(HttpTraceFilter.java:88)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at com.baidu.wuyou.operation.api.aspect.CachingRequestBodyFilter.doFilterInternal(CachingRequestBodyFilter.java:25)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:94)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.filterAndRecordMetrics(WebMvcMetricsFilter.java:114)
	at org.springframework.boot.actuate.metrics.web.servlet.WebMvcMetricsFilter.doFilterInternal(WebMvcMetricsFilter.java:104)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:202)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:541)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:139)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:343)
	at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:367)
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:860)
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1598)
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at com.fasterxml.jackson.core.util.BufferRecycler.calloc(BufferRecycler.java:191)
	at com.fasterxml.jackson.core.util.BufferRecycler.allocCharBuffer(BufferRecycler.java:161)
	at com.fasterxml.jackson.core.util.BufferRecycler.allocCharBuffer(BufferRecycler.java:151)
	at com.fasterxml.jackson.core.io.IOContext.allocConcatBuffer(IOContext.java:234)
	at com.fasterxml.jackson.core.json.UTF8JsonGenerator.<init>(UTF8JsonGenerator.java:139)
	at com.fasterxml.jackson.core.JsonFactory._createUTF8Generator(JsonFactory.java:1817)
	at com.fasterxml.jackson.core.JsonFactory.createGenerator(JsonFactory.java:1281)

上面这条日志很“友好”的说明出错的原因是java.lang.OutOfMemoryError: Java heap space,立即查看应用程序的堆监控,显示堆空间爆满而宕机了,监控图如下所示:

记一次java堆空间爆满导致宕机的问题

现在就是要确定具体是哪个地方导致内存占用飙升了,幸亏项目设置了如下配置,当程序异常宕机时会dump堆空间快照。

JVM_OPTS="$JVM_OPTS -XX:+HeapDumpOnOutOfMemoryError"
JVM_OPTS="$JVM_OPTS -XX:HeapDumpPath=${LOG_DIR}/${PROJ_NAME}-`date +%s`-pid$$.hprof"
JVM_OPTS="$JVM_OPTS -Xloggc:${LOG_DIR}/${PROJ_NAME}-gc.log"
JVM_OPTS="$JVM_OPTS -XX:ErrorFile=${LOG_DIR}/${PROJ_NAME}-`date +%s`-pid%p.log"

马上用MATMemoryAnalyzer)工具查看heap dump数据,终于找到了罪魁祸首,如下所示:

记一次java堆空间爆满导致宕机的问题

2 问题原因分析

导致内存爆满的地方的核心功能是从多个数据源获取数据,然后提供给grafana进行大屏展示(基于simplejson插件)。同时,grafana大屏数据会自动周期性刷新。当在一个panel内展示的曲线比较多,且展示的时间范围比较大,同时需要展示的panel数也较多时,导致该数据处理逻辑处理不过来,上一个刷新周期还没有处理完,下一个刷新周期的请求又到了,同时java heap空间设置的也不足够大,导致heap空间占用累积上升,jvm一直忙于STWStop The World)式垃圾收集,直到最终宕机

3 问题修复与优化方法

通过这次故障,发现这个模块看上去处理逻辑比较简单,但是仍然暴露了不少设计上的问题,分别采用了一些优化方法:

  1. 首先当然是增加heap空间大小,毕竟能多撑一会儿是一会儿

  2. 梳理处理逻辑时,发现性能仍然有提升空间,通过提升性能,减少数据处理时间,让内存尽快释放

  3. 请求外部数据接口和等待并发任务完成时,为避免挂死,都设置了超时时间,原先为了尽量能获取到数据(外部接口有时确实返回也比较慢),设置的超时时间比较长,降低超时时间(消除隐患)

  4. 限制单个pannel的最大曲线数,限制最大展示时间窗口大小

  5. 增量获取数据

  6. 将该功能独立为单独的微服务,不与其他逻辑放在一个服务中(作为独立的微服务,也不会增加调用链长度)

  7. 优化原先的数据结构。

    原先的数据结构为:

    public class Avail {
        private Long total;
        private Integer loss;
        private Double latency;
        private Double rate;
    }
    

    改为:

    public class Avail {
        private long total;
        private int loss;
        private double latency;
        private double rate;
    }
    

    这样一个数据点可以节约16字节(不压缩对象指针)或8字节(压缩对象指针),这样一个数据点可以节约内存约50%~67%