Hi,
Suddenly since 3th of January I noticed my servers suddenly had 100% CPU usage (all 3 servers of my Galera cluster).
I resolved the issue by restarting galera on my master and then connecting other members back and all seemed ok. But day later it crashed again and the control server (this is pure Mysql-galera server nothing more) is having trouble (won't even restart now, hosting support calls it bootloader problem).
Anyway what I wanted to ask is if the problems on that server could have brought down all others also? (The whole idea behind the cluster is kinda if one goes down the others keep working). I checked error logs on master server from the cluster and saw:
120104 14:02:44 [Note] WSREP: Full re-merge of primary e06f7ed7-36d8-11e1-0800-29e1da6c01bf found: 140174847639554 of 2.
120104 14:02:44 [Note] WSREP: Quorum results:
version = 1,
component = PRIMARY,
conf_id = 9,
members = 2/2 (joined/total),
act_id = 41472,
last_appl. = 41353,
protocols = 0/1/1 (gcs/repl/appl),
group UUID = 2bdcc0ef-14d6-11e1-0800-3c629414b34d
120104 14:02:44 [Note] WSREP: Flow-control interval: [12, 23]
120104 14:02:44 [Note] WSREP: Restored state OPEN -> SYNCED (41472)
120104 14:02:44 [Note] WSREP: New cluster view: group UUID: 2bdcc0ef-14d6-11e1-0800-3c629414b34d, conf# 10: Primary, number of nodes: 2, my index: 1, first seqno: 41473, protocol version 1
120104 14:02:44 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120104 14:02:44 [Note] WSREP: Assign initial position for certification: 41472, protocol version: 1
120104 14:02:44 [Note] WSREP: Synchronized with group, ready for connections
120104 14:02:44 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120104 14:02:49 [Note] WSREP: cleaning up cabce790-3666-11e1-0800-b43d9764b7ea (tcp://46.23.72.59:4567)
120104 14:11:11 [Note] WSREP: evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=cabce790-3666-11e1-0800-b43d9764b7ea,source_view_id=view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=151019,node_list=()
} 168
120104 14:11:11 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,seq=0,seq_range=-1,aru_seq=0,flags=4,source=49cda355-3667-11e1-0800-4e84b7e71f5b,source_view_id=view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=153934,node_list=()
}
120104 14:11:11 [ERROR] WSREP: state after handling message: evs::proto(evs::proto(7c713619-3662-11e1-0800-a4a650863504, OPERATIONAL, view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74)), OPERATIONAL) {
current_view=view(view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74) memb {
49cda355-3667-11e1-0800-4e84b7e71f5b,
7c713619-3662-11e1-0800-a4a650863504,
cabce790-3666-11e1-0800-b43d9764b7ea,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} ,msg_index= (2,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=cabce790-3666-11e1-0800-b43d9764b7ea,source_view_id=view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=151019,node_list=()
}
,recovery_index= (0,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=49cda355-3667-
terminate called after throwing an instance of 'gu::Exception'
what(): msg_state.get_to_seq() == max_to_seq: 7c713619-3662-11e1-0800-a4a650863504 node 49cda355-3667-11e1-0800-4e84b7e71f5b to seq not consistent with local state: max to seq 777 msg state to seq 3 (FATAL)
at gcomm/src/pc_proto.cpp:validate_state_msgs():463
120104 14:11:11 - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.
key_buffer_size=8384512
read_buffer_size=131072
max_used_connections=13
max_threads=1024
threads_connected=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2247308 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = (nil) thread_stack 0x40000
/opt/galera/mysql/libexec/mysqld(my_print_stacktrace+0x2e) [0x8abb5e]
/opt/galera/mysql/libexec/mysqld(handle_segfault+0x41c) [0x5e2aac]
/lib/libpthread.so.0(+0xef60) [0x7f7dfe646f60]
/lib/libc.so.6(gsignal+0x35) [0x7f7dfd2f3165]
/lib/libc.so.6(abort+0x180) [0x7f7dfd2f5f70]
/usr/lib/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7dfdb86dc5]
/usr/lib/libstdc++.so.6(+0xcb166) [0x7f7dfdb85166]
/usr/lib/libstdc++.so.6(+0xcb193) [0x7f7dfdb85193]
/usr/lib/libstdc++.so.6(__cxa_rethrow+0x46) [0x7f7dfdb85216]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::evs::Proto::handle_up(void const*, gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x7a1) [0x7f7dd8b65371]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::Protolay::send_up(gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x49) [0x7f7dd8b791b9]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::GMCast::handle_up(void const*, gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x29f) [0x7f7dd8b7602f]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::Protostack::dispatch(void const*, gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x5c) [0x7f7dd8b9a22c]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::AsioProtonet::dispatch(void const* const&, gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x4b) [0x7f7dd8bc835b]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::AsioTcpSocket::read_handler(asio::error_code const&, unsigned long)+0x770) [0x7f7dd8ba5b90]
/opt/galera/galera/lib/libgalera_smm.so(asio::detail::read_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::mutable_buffer, 1ul>, boost::_bi::bind_t<unsigned long, boost::_mfi::mf2<unsigned long, gcomm::AsioTcpSocket, asio::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<gcomm::AsioTcpSocket> >, boost::arg<1> (*)(), boost::arg<2> (*)()> >, boost::_bi::bind_t<void, boost::_mfi::mf2<void, gcomm::AsioTcpSocket, asio::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<gcomm::AsioTcpSocket> >, boost::arg<1> (*)(), boost::arg<2> (*)()> > >::operator()(asio::error_code const&, unsigned long)+0xce) [0x7f7dd8bb35ae]
/opt/galera/galera/lib/libgalera_smm.so(asio::detail::reactive_socket_service<asio::ip::tcp>::receive_op<asio::detail::consuming_buffers<asio::mutable_buffer, boost::array<asio::mutable_buffer, 1ul> >, asio::detail::read_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::mutable_buffer, 1ul>, boost::_bi::bind_t<unsigned long, boost::_mfi::mf2<unsigned long, gcomm::AsioTcpSocket, asio::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<gcomm::AsioTcpSocket> >, boost::arg<1> (*)(), boost::arg<2> (*)()> >, boost::_bi::bind_t<void, boost::_mfi::mf2<void, gcomm::AsioTcpSocket, asio::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<gcomm::AsioTcpSocket> >, boost::arg<1> (*)(), boost::arg<2> (*)()> > > >::do_complete(asio::detail::task_io_service<asio::detail::epoll_reactor>*, asio::detail::task_io_service_operation<asio::detail::epoll_reactor>*, asio::error_code, unsigned long)+0x308) [0x7f7dd8bb4238]
/opt/galera/galera/lib/libgalera_smm.so(asio::detail::task_io_service<asio::detail::epoll_reactor>::run(asio::error_code&)+0x790) [0x7f7dd8bd0d70]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::AsioProtonet::event_loop(gu::datetime::Period const&)+0x196) [0x7f7dd8bc9ad6]
/opt/galera/galera/lib/libgalera_smm.so(GCommConn::run()+0x3a3) [0x7f7dd8be2b73]
/opt/galera/galera/lib/libgalera_smm.so(GCommConn::run_fn(void*)+0x9) [0x7f7dd8be6dc9]
/lib/libpthread.so.0(+0x68ba) [0x7f7dfe63e8ba]
/lib/libc.so.6(clone+0x6d) [0x7f7dfd39002d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file
120104 15:53:36 [Warning] WSREP: gethostname failed, using 'localhost' as default wsrep_node_name
120104 15:53:36 [Warning] Ignoring user change to 'root' because the user was set to 'mysql' earlier on the command line
Which describes an error, but it is above my head to really understand what is going on here. The exact same error is also in the log of the '3' server (not the control server with the problems).
So basically what I'm asking is to make some sense of the above log, and perhaps point me in right direction to source and solution of this problem
Thank you in advance.
Best regards,
Pd tech info
all 3 VPS servers with debian 6
2 web servers: apache + mysql local instance + Mysql-Galera (mysql 5.1.58 + 21.0.8.3(r67) )
control server, can't check now but should be the same. This server only runs Mysql-Galera nothing more.