背景:项目上使用Nacos做服务注册与发现,在开发环境中我们使用的容器化的Naocs部署,并且使用的单机节点以及内置数据库derby,但是最近在部署Nacos的服务的使用发现启动失败,报错日志如下(日志很长,已经截取了一些):

2021-09-18 22:40:31,822 INFO Tomcat initialized with port(s): 8848 (http)

2021-09-18 22:40:32,148 INFO Root WebApplicationContext: initialization completed in 11864 ms

2021-09-18 22:40:44,722 ERROR Error starting Tomcat context. Exception: org.springframework.beans.factory.BeanCreationException. Message: Error creating bean with name 'authFilterRegistration' defined in class path resource [com/alibaba/nacos/core/auth/AuthConfigs.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.boot.web.servlet.FilterRegistrationBean]: Factory method 'authFilterRegistration' threw exception; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'authFilter': Unsatisfied dependency expressed through field 'authManager'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'nacosAuthManager': Unsatisfied dependency expressed through field 'authenticationManager'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'nacosAuthConfig': Unsatisfied dependency expressed through field 'userDetailsService'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'nacosUserDetailsServiceImpl': Unsatisfied dependency expressed through field 'userPersistService'; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'embeddedUserPersistServiceImpl': Unsatisfied dependency expressed through field 'databaseOperate'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'standaloneDatabaseOperateImpl': Invocation of init method failed; nested exception is java.lang.RuntimeException: com.alibaba.nacos.api.exception.runtime.NacosRuntimeException: errCode: 500, errMsg: load schema.sql error.

2021-09-18 22:40:44,827 WARN Exception encountered during context initialization - cancelling refresh attempt: org.springframework.context.ApplicationContextException: Unable to start web server; nested exception is org.springframework.boot.web.server.WebServerException: Unable to start embedded Tomcat

2021-09-18 22:40:44,837 INFO Nacos Log files: /home/nacos/logs

2021-09-18 22:40:44,849 INFO Nacos Log files: /home/nacos/conf

2021-09-18 22:40:44,853 INFO Nacos Log files: /home/nacos/data

2021-09-18 22:40:44,856 ERROR Startup errors : {}

org.springframework.context.ApplicationContextException: Unable to start web server; nested exception is org.springframework.boot.web.server.WebServerException: Unable to start embedded Tomcat
        at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.onRefresh(ServletWebServerApplicationContext.java:157)
        at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:540)
...............
Caused by: com.alibaba.nacos.api.exception.runtime.NacosRuntimeException: errCode: 500, errMsg: load schema.sql error.
        at com.alibaba.nacos.config.server.service.datasource.LocalDataSourceServiceImpl.reload(LocalDataSourceServiceImpl.java:101)
        at com.alibaba.nacos.config.server.service.datasource.LocalDataSourceServiceImpl.initialize(LocalDataSourceServiceImpl.java:170)
        at com.alibaba.nacos.config.server.service.datasource.LocalDataSourceServiceImpl.init(LocalDataSourceServiceImpl.java:83)
        at com.alibaba.nacos.config.server.service.datasource.DynamicDataSource.getDataSource(DynamicDataSource.java:47)
        ... 166 common frames omitted
Caused by: java.sql.SQLTimeoutException: Login timeout exceeded.
        at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
        at org.apache.derby.jdbc.InternalDriver.timeLogin(Unknown Source)
        at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
        at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
        at org.apache.derby.jdbc.EmbeddedDriver.connect(Unknown Source)
        at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138)
        at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:354)
        at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:202)
        at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:473)
        at com.zaxxer.hikari.pool.HikariPool.checkFailFast(HikariPool.java:554)
        at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:115)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:112)
        at com.alibaba.nacos.config.server.service.datasource.LocalDataSourceServiceImpl.reload(LocalDataSourceServiceImpl.java:96)
        ... 169 common frames omitted
Caused by: org.apache.derby.iapi.error.StandardException: Login timeout exceeded.
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
        ... 185 common frames omitted

根据错误日志分析: Nacos在加载schema.sql文件时,derby数据库报了一个Login Timeout exceeded的错误,看着像是超时,此时根据异常堆栈查看源码如下:

private EmbedConnection timeLogin(String var1, Properties var2, int var3) throws SQLException {
    try {
        InternalDriver.LoginCallable var4 = new InternalDriver.LoginCallable(this, var1, var2);
        Future var5 = _executorPool.submit(var4);
        long var6 = System.currentTimeMillis();
        long var8 = var6 + (long)var3 * 1000L;

        while(var6 < var8) {
            try {
                EmbedConnection var10 = (EmbedConnection)var5.get(var8 - var6, TimeUnit.MILLISECONDS);
                return var10;
            } catch (InterruptedException var16) {
                InterruptStatus.setInterrupted();
                var6 = System.currentTimeMillis();
            } catch (ExecutionException var17) {
                throw this.processException(var17);
            } catch (TimeoutException var18) {
                throw Util.generateCsSQLException("XBDA0.C.1", new Object[0]);
            }
        }

        throw Util.generateCsSQLException("XBDA0.C.1", new Object[0]);
    } finally {
        InterruptStatus.restoreIntrFlagIfSeen();
    }
}

代码逻辑很简单,封装了获取EmbedConnection这个数据库链接的过程为异步获取,通过Future来控制超时时长,那么此时可以判断出就是在获取EmbedConnection这个数据库链接的时候超时了。

但是获取EmbedConnection为什么超时,里面到底做了什么,不得而知(由于源码的class文件中缺乏了本地变量表,导致代码阅读困难),此时注意到:

2021-09-18 22:40:31,822 INFO Tomcat initialized with port(s): 8848 (http)

2021-09-18 22:40:44,722 ERROR。。。。

这两行日志为SpringBoot的启动日志,而这两行日志中,日志打印的时间差了10s左右,那么猜测可能就是这10s中获取数据库连接超时,并且一定会有一个线程一直阻塞中。

此时我们想到可以使用 jstack命令去查看java进程的线程堆栈日志,如下:

"derby.rawStoreDaemon" #38 daemon prio=5 os_prio=0 tid=0x00007ff090057800 nid=0x6337 in Object.wait() [0x00007ff097ffe000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000ee89fd70> (a org.apache.derby.impl.services.daemon.BasicDaemon)
        at org.apache.derby.impl.services.daemon.BasicDaemon.rest(Unknown Source)
        - locked <0x00000000ee89fd70> (a org.apache.derby.impl.services.daemon.BasicDaemon)
        at org.apache.derby.impl.services.daemon.BasicDaemon.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:748)

"Thread-15" #37 daemon prio=5 os_prio=0 tid=0x00007ff12928b800 nid=0x6336 runnable [0x00007ff0ec1be000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileDescriptor.sync(Native Method)
        at org.apache.derby.impl.io.DirRandomAccessFile.sync(Unknown Source)
        at org.apache.derby.impl.store.raw.data.RAFContainer.writeRAFHeader(Unknown Source)
        at org.apache.derby.impl.store.raw.data.RAFContainer.clean(Unknown Source)
        - locked <0x00000000ef759b28> (a org.apache.derby.impl.store.raw.data.RAFContainer4)
        at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAndUnkeepEntry(Unknown Source)
        at org.apache.derby.impl.services.cache.ConcurrentCache.cleanCache(Unknown Source)
        at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAll(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.checkpoint(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.createFinished(Unknown Source)
        at org.apache.derby.impl.store.raw.RawStore.createFinished(Unknown Source)
        at org.apache.derby.impl.store.access.RAMAccessManager.createFinished(Unknown Source)
        at org.apache.derby.impl.db.BasicDatabase.createFinished(Unknown Source)
        at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.createPersistentService(Unknown Source)
        at org.apache.derby.impl.services.monitor.FileMonitor.createPersistentService(Unknown Source)
        at org.apache.derby.iapi.services.monitor.Monitor.createPersistentService(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection$5.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.derby.impl.jdbc.EmbedConnection.createPersistentService(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.createDatabase(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)
        at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
        at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown Source)
        at org.apache.derby.jdbc.InternalDriver$LoginCallable.call(Unknown Source)
        at org.apache.derby.jdbc.InternalDriver$LoginCallable.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
.......

"main" #1 prio=5 os_prio=0 tid=0x00007ff12804c000 nid=0x61d1 waiting on condition [0x00007ff130ee5000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000c01f6de8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
        at com.alibaba.nacos.common.utils.ThreadUtils.shutdownThreadPool(ThreadUtils.java:121)
        at com.alibaba.nacos.common.utils.ThreadUtils.shutdownThreadPool(ThreadUtils.java:106)
        at com.alibaba.nacos.common.executor.ThreadPoolManager.destroy(ThreadPoolManager.java:156)
        - locked <0x00000000c03b65d8> (a java.lang.Object)
        at com.alibaba.nacos.common.executor.ThreadPoolManager.shutdown(ThreadPoolManager.java:197)
        at com.alibaba.nacos.core.code.StartingSpringApplicationRunListener.failed(StartingSpringApplicationRunListener.java:147)
        at org.springframework.boot.SpringApplicationRunListeners.callFailedListener(SpringApplicationRunListeners.java:91)
        at org.springframework.boot.SpringApplicationRunListeners.failed(SpringApplicationRunListeners.java:84)
        at org.springframework.boot.SpringApplication.handleRunFailure(SpringApplication.java:828)
        at org.springframework.boot.SpringApplication.run(SpringApplication.java:327)
        at org.springframework.boot.SpringApplication.run(SpringApplication.java:1260)
        at org.springframework.boot.SpringApplication.run(SpringApplication.java:1248)
        at com.alibaba.nacos.Nacos.main(Nacos.java:35)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
        at org.springframework.boot.loader.Launcher.launch(Launcher.java:108)
        at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
        at org.springframework.boot.loader.PropertiesLauncher.main(PropertiesLauncher.java:467)

"VM Thread" os_prio=0 tid=0x00007ff12819d000 nid=0x61df runnable

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007ff12805e800 nid=0x61d8 runnable

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007ff128060800 nid=0x61d9 runnable

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007ff128062800 nid=0x61da runnable

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007ff128064000 nid=0x61db runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007ff1281f0000 nid=0x61e7 waiting on condition

JNI global references: 991

经过多次的jstack命令查看,发现Thread-15这个线程一直阻塞在

 java.io.FileDescriptor.sync(Native Method)

查看下jdk中对此方法的说明:

在涉及IO的操作中,linux系统为了平衡物理介质和内存的速度差异,引入了PageCache的概念,在我们调用outputstram的write的方法时,并不是将内容写入到物理介质中而是写入到PageCache中并且标记此块内存为dirty。linux的后台会启动一个线程定时将标记为dirty的内存块刷新到物理介质中,而如果此时虚机断电的话,那么没有刷新到物理介质中的内容就有可能会丢失,所以linxu会提供fsync的命令强制刷新,而不是等定时任务刷新。而对应jdk中封装就是FileDescriptor.sync的方法

这个问题我的第一反应就是会不会是PageCache过大,导致刷盘过慢通过free -g命令查看

free -g
              total        used        free      shared  buff/cache   available
Mem:              8           1           0           0           6           6
Swap:             9           0           9

发现总共内存为8g,而PageCache占用了高达6g,立马将PageCache强制刷盘,并重新启动Nacos,可惜问题依旧存在。

此时我的想法是会不会是在执行某一个fsync的命令的时候,执行时间过久呢,我希望能够知道每个fsync的执行耗时,修改启动命令为:

#nohup java  ${JAVA_OPT} >${BASE_DIR}/logs/start.out 2>&1 </dev/null
strace -T -ttt -ff -xx -yy -o strace.log java  ${JAVA_OPT} >${BASE_DIR}/logs/start.out 2>&1 </dev/null

增加了strace命令去追踪每个系统调用的耗时,并且以线程ID的形式分成不同的文件,strace.log后面跟着的数字就是线程ID。

[root@localhost nacos]# ls -l
total 77832
drwxr-xr-x 2 root root     4096 Sep 18 22:37 bin
drwxr-xr-x 2 root root     4096 Sep 18 01:52 conf
drwxr-xr-x 3 root root     4096 Sep 18 22:40 data
-rw-r--r-- 1 root root      696 Sep 18 22:40 derby.log
drwxr-xr-x 2 root root     4096 Sep 18 22:40 logs
-rw-r--r-- 1 root root    44298 Sep 18 22:40 strace.log.27777
-rw-r--r-- 1 root root 58640394 Sep 18 22:40 strace.log.27778
-rw-r--r-- 1 root root    16262 Sep 18 22:40 strace.log.27779
-rw-r--r-- 1 root root    16882 Sep 18 22:40 strace.log.25398
......
drwxr-xr-x 2 root root     4096 Sep 18 22:38 target
drwxr-xr-x 3 root root     4096 Sep 18 22:40 work

此时我们回到之前打印的线程堆栈日志,找到阻塞的线程

"Thread-15" #37 daemon prio=5 os_prio=0 tid=0x00007ff12928b800 nid=0x6336 runnable [0x00007ff0ec1be000]

线程Id: nid=0x6336,通过进制换算将16进制换算成10进制为25398,找到strace.log.25398,下载并通过notepad++ 打开,搜索fsync命令:

// 时间戳
1631906136.913442 fsync(44<\x2f\x68\x6f\x6d\x65\x2f\x6e\x61\x63\x6f\x73\x2f\x64\x61\x74\x61\x2f\x64\x65\x72\x62\x79\x2d\x64\x61\x74\x61\x2f\x64\x62\x2e\x6c\x63\x6b>) = 0 
// fsync命令耗时
<0.184688>

找到所有的fsync命令,通过打印的时间戳计算总耗时大约为11s左右。
此时只需要确认前面获取链接的超时时间,确认默认的超时时长为多少,查看Nacos的代码,找到初始化数据库连接的地方,通过代码跟踪找到:

private synchronized void initialize(String jdbcUrl) {
    HikariDataSource ds = new HikariDataSource();
    ds.setDriverClassName(jdbcDriverName);
    ds.setJdbcUrl(jdbcUrl);
    ds.setUsername(userName);
    ds.setPassword(password);
    ds.setIdleTimeout(30_000L);
    ds.setMaximumPoolSize(80);
    ds.setConnectionTimeout(10000L);
    DataSourceTransactionManager tm = new DataSourceTransactionManager();
    tm.setDataSource(ds);
    if (jdbcTemplateInit) {
        jt.setDataSource(ds);
        tjt.setTransactionManager(tm);
    } else {
        jt = new JdbcTemplate();
        jt.setMaxRows(50000);
        jt.setQueryTimeout(5000);
        jt.setDataSource(ds);
        tjt = new TransactionTemplate(tm);
        tjt.setTimeout(5000);
        jdbcTemplateInit = true;
    }
    reload();
}

ds.setConnectionTimeout(10000L);这里就是设置数据库连接超时的地方,代码中写死为10s,和strace.log中计算的时长大致能对的上,所以至此可以确认,是调用linux的fsync的方法耗时过久导致业务侧超时,但是为什么会耗时这么久?毕竟阿里给的默认10s应该是经过验证的。带着这个疑问我寻思应该去查看磁盘的性能了。

iostat -x 1 1000

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.49    0.00    2.00   22.94    0.00   71.57

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    43.00    0.00   52.00     0.00   488.00    14.92     0.96   17.56    0.00   17.56  18.42  95.80
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00   87.00     0.00   488.00     8.92     0.99   10.92    0.00   10.92  11.01  95.80

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.77    0.00    1.26   23.37    0.00   71.61

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    47.00    0.00   59.00     0.00   548.00    18.58     0.99   16.47    0.00   16.47  16.24  95.80
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00   98.00     0.00   552.00    11.27     1.03   10.40    0.00   10.40   9.79  95.90

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.99    0.00    1.75   22.94    0.00   71.32

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    42.00    0.00   55.00     0.00   464.00    13.24     0.98   19.11    0.00   19.11  17.22  94.70
dm-0              0.00     0.00    0.00    2.00     0.00    36.00    36.00     0.03   15.50    0.00   15.50   8.00   1.60
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00   86.00     0.00   424.00     7.53     0.98   12.30    0.00   12.30  11.01  94.70

观测磁盘使用率,发现在Nacos的启动过程中, %util指标(磁盘的使用率)一直在95%左右,磁盘在被高度占用中,磁盘的写入速率wkB/s平均在500kb/s左右。

至此可以确认磁盘的性能不足导致derby数据库初始化的时候,刷盘耗时过久,超过了默认配置的10s的超时时长,导致Nacos启动失败。

修改方法很简单,修改默认超时时长或者将默认的超时时长变更为配置项放在配置文件中。

GitHub 加速计划 / na / nacos
29.83 K
12.75 K
下载
Nacos是由阿里巴巴开源的服务治理中间件,集成了动态服务发现、配置管理和服务元数据管理功能,广泛应用于微服务架构中,简化服务治理过程。
最近提交(Master分支:3 个月前 )
4334cd16 * Support custom client configuration timeout.(#12748) * Add UT.(#12748) 19 天前
b04d2266 23 天前
Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐