events scheduler导致MGR节点退出详解及修复
问题复现
执行如下命令:
mysql> SET GLOBAL EVENT_SCHEDULER = OFF;
Query OK, 0 rows affected (0.01 sec)
mysql> CREATE EVENT e1 ON SCHEDULE EVERY 1 SECOND ENDS NOW() + INTERVAL 1 SECOND DO SELECT 1;
Query OK, 0 rows affected (0.01 sec)
mysql> SET GLOBAL EVENT_SCHEDULER = ON;
Query OK, 0 rows affected (0.04 sec)
mysql> show events;
Empty set (0.01 sec)
正常情况下,这个events会被自动删除。 如果在MGR集群上执行上面的命令,则根据编译版本不同表现不同。 Debug版本:直接ASSERT
#0 __GI_raise (sig=...) at raise.c:50
#1 0x00007fc873279535 in __GI_abort () at abort.c:79
#2 0x00007fc87327940f in __assert_fail_base (fmt=..., assertion=..., file=..., line=..., function=...) at assert.c:92
#3 0x00007fc873287102 in __GI___assert_fail (assertion=..., file=..., line=..., function=...) at assert.c:101
#4 0x000055e3f8f44bab in Disable_autocommit_guard::~Disable_autocommit_guard (this=..., __in_chrg=...) at thd_raii.h:79
#5 0x000055e3f985561f in Event_queue::recalculate_activation_times (this=..., thd=...) at event_queue.cc:390
#6 0x000055e3f985b4ab in Event_scheduler::run (this=..., thd=...) at event_scheduler.cc:568
#7 0x000055e3f985a372 in event_scheduler_thread (arg=...) at event_scheduler.cc:279
#8 0x000055e3faee89df in pfs_spawn_thread (arg=...) at pfs.cc:2899
#9 0x00007fc87341efa3 in start_thread (arg=...) at pthread_create.c:486
#10 0x00007fc87334feff in clone () at clone.S:95
Release版本:导致MGR集群状态异常,节点退出MGR集群。
2022-06-22T14:32:18.412045Z 15 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
问题分析
调试Debug版本,很容易发现导致assert的原因在recalculate_activation_times函数中:
recalculate_activation_times {
ret = trans_commit_stmt(thd) || trans_commit(thd);
}
是由于trans_commit_stmt事务提交失败,导致Disable_autocommit_guard中的事务状态判断失败。 trans_commit_stmt失败的原因是group_replication_trans_before_commit失败。
#0 group_replication_trans_before_commit (param=...) at observer_trans.cc:587
#1 0x0000559358f2cb04 in Trans_delegate::before_commit (this=..., thd=..., all=..., trx_cache_log=..., stmt_cache_log=..., cache_log_max_size=..., is_atomic_ddl_arg=...) at rpl_handler.cc:610
#2 0x0000559358e8d33d in MYSQL_BIN_LOG::commit (this=..., thd=..., all=...) at binlog.cc:8454
#3 0x0000559357909667 in ha_commit_trans (thd=..., all=..., ignore_global_read_lock=...) at handler.cc:1801
#4 0x00005593582613c7 in trans_commit_stmt (thd=..., ignore_global_read_lock=...) at transaction.cc:532
#5 0x000055935840f56c in Event_queue::recalculate_activation_times (this=..., thd=...) at event_queue.cc:447
#6 0x00005593584154ab in Event_scheduler::run (this=..., thd=...) at event_scheduler.cc:568
熟悉MGR代码的知道,后台applier线程才是事件处理的真正线程,用户线程失败肯定是applier线程处理出现问题导致,applier线程出现问题的地方在set_transaction_ctx中,其调用堆栈为:
#0 0x000056386186e909 in set_transaction_ctx (transaction_termination_ctx=...) at rpl_transaction_ctx.cc:107
#1 0x00007f8e5220afad in Certification_handler::handle_transaction_id (this=..., pevent=..., cont=...) at certification_handler.cc:308
#2 0x00007f8e5220a349 in Certification_handler::handle_event (this=..., pevent=..., cont=...) at certification_handler.cc:127
#3 0x00007f8e5220982a in Event_handler::next (this=..., event=..., continuation=...) at pipeline_interfaces.h:716
#4 0x00007f8e5220f20e in Event_cataloger::handle_event (this=..., pevent=..., cont=...) at event_cataloger.cc:53
#5 0x00007f8e521b7621 in Applier_module::inject_event_into_pipeline (this=..., pevent=..., cont=...) at applier.cc:258
#6 0x00007f8e521b801b in Applier_module::apply_data_packet (this=..., data_packet=..., fde_evt=..., cont=...) at applier.cc:388
#7 0x00007f8e521b8fba in Applier_module::applier_thread_handle (this=...) at applier.cc:613
#8 0x00007f8e521b692e in launch_handler_thread (arg=...) at applier.cc:50
#9 0x00005638634d39df in pfs_spawn_thread (arg=...) at pfs.cc:2899
#10 0x00007f8e8d37efa3 in start_thread (arg=...) at pthread_create.c:486
set_transaction_ctx代码逻辑如下https://gitee.com/mirrors/mysql-server/blob/8.0/sql/rpl_transaction_ctx.cc#L90:
int set_transaction_ctx(
Transaction_termination_ctx transaction_termination_ctx) {
DBUG_TRACE;
DBUG_PRINT("enter", ("thread_id=%lu, rollback_transaction=%d, "
"generated_gtid=%d, sidno=%d, gno=%" PRId64,
transaction_termination_ctx.m_thread_id,
transaction_termination_ctx.m_rollback_transaction,
transaction_termination_ctx.m_generated_gtid,
transaction_termination_ctx.m_sidno,
transaction_termination_ctx.m_gno));
uint error = ER_NO_SUCH_THREAD;
Find_thd_with_id find_thd_with_id(transaction_termination_ctx.m_thread_id);
THD_ptr thd_ptr =
Gal_THD_manager::get_instance()->find_thd(&find_thd_with_id);
if (thd_ptr) {
error = thd_ptr->get_transaction()
->get_rpl_transaction_ctx()
->set_rpl_transaction_ctx(transaction_termination_ctx);
}
return error;
}lob
这里搜索的thd_ptr为nullptr,导致了返回了ER_NO_SUCH_THREAD错误。从而导致了认证失败。 为什么会导致无法搜索到对应的thd呢?因为提交这个事务的线程是event_scheduler_thread事件调度后台线程,而线程搜索算法是忽略这种后台线程。https://gitee.com/mirrors/mysql-server/blob/8.0/sql/mysqld_thd_manager.cc#L59
bool Find_thd_with_id::operator()(THD *thd) {
if (thd->get_command() == COM_DAEMON) return false;
return (thd->thread_id() == m_thread_id);
}
综上所述,原因可以总结如下:
1. event_scheduler_thread 后台事件调度线程触发了COMMIT操作
2. COMMIT操作经过MGR的applier模块
3. applier模块通过Global_THD_manager无法找到后台THD
4. applier模块异常 修复思路: 修改线程搜索算法,支持Global_THD_manager可以搜索daemon thread。
反馈社区
打开bugs.mysql.com
1. Report a bug
2. 填写Bug相关描述信息
3. 给出复现步骤
4. 给出修改建议
5. 附加可以提交测例文件以及修复patch
https://bugs.mysql.com/bug.php?id=107635
本文转载自公共号GreatSQL社区。