Mysql-Galera crashing - with 100% CPU usage

Asked by Steven

Hi,
Suddenly since 3th of January I noticed my servers suddenly had 100% CPU usage (all 3 servers of my Galera cluster).
I resolved the issue by restarting galera on my master and then connecting other members back and all seemed ok. But day later it crashed again and the control server (this is pure Mysql-galera server nothing more) is having trouble (won't even restart now, hosting support calls it bootloader problem).

Anyway what I wanted to ask is if the problems on that server could have brought down all others also? (The whole idea behind the cluster is kinda if one goes down the others keep working). I checked error logs on master server from the cluster and saw:

120104 14:02:44 [Note] WSREP: Full re-merge of primary e06f7ed7-36d8-11e1-0800-29e1da6c01bf found: 140174847639554 of 2.
120104 14:02:44 [Note] WSREP: Quorum results:
 version = 1,
 component = PRIMARY,
 conf_id = 9,
 members = 2/2 (joined/total),
 act_id = 41472,
 last_appl. = 41353,
 protocols = 0/1/1 (gcs/repl/appl),
 group UUID = 2bdcc0ef-14d6-11e1-0800-3c629414b34d
120104 14:02:44 [Note] WSREP: Flow-control interval: [12, 23]
120104 14:02:44 [Note] WSREP: Restored state OPEN -> SYNCED (41472)
120104 14:02:44 [Note] WSREP: New cluster view: group UUID: 2bdcc0ef-14d6-11e1-0800-3c629414b34d, conf# 10: Primary, number of nodes: 2, my index: 1, first seqno: 41473, protocol version 1
120104 14:02:44 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120104 14:02:44 [Note] WSREP: Assign initial position for certification: 41472, protocol version: 1
120104 14:02:44 [Note] WSREP: Synchronized with group, ready for connections
120104 14:02:44 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120104 14:02:49 [Note] WSREP: cleaning up cabce790-3666-11e1-0800-b43d9764b7ea (tcp://46.23.72.59:4567)
120104 14:11:11 [Note] WSREP: evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=cabce790-3666-11e1-0800-b43d9764b7ea,source_view_id=view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=151019,node_list=()
} 168
120104 14:11:11 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,seq=0,seq_range=-1,aru_seq=0,flags=4,source=49cda355-3667-11e1-0800-4e84b7e71f5b,source_view_id=view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=153934,node_list=()
}
120104 14:11:11 [ERROR] WSREP: state after handling message: evs::proto(evs::proto(7c713619-3662-11e1-0800-a4a650863504, OPERATIONAL, view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74)), OPERATIONAL) {
current_view=view(view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74) memb {
 49cda355-3667-11e1-0800-4e84b7e71f5b,
 7c713619-3662-11e1-0800-a4a650863504,
 cabce790-3666-11e1-0800-b43d9764b7ea,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} ,msg_index= (2,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=cabce790-3666-11e1-0800-b43d9764b7ea,source_view_id=view_id(REG,49cda355-3667-11e1-0800-4e84b7e71f5b,74),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=151019,node_list=()
}
,recovery_index= (0,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=49cda355-3667-
terminate called after throwing an instance of 'gu::Exception'
  what(): msg_state.get_to_seq() == max_to_seq: 7c713619-3662-11e1-0800-a4a650863504 node 49cda355-3667-11e1-0800-4e84b7e71f5b to seq not consistent with local state: max to seq 777 msg state to seq 3 (FATAL)
  at gcomm/src/pc_proto.cpp:validate_state_msgs():463
120104 14:11:11 - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=8384512
read_buffer_size=131072
max_used_connections=13
max_threads=1024
threads_connected=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2247308 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = (nil) thread_stack 0x40000
/opt/galera/mysql/libexec/mysqld(my_print_stacktrace+0x2e) [0x8abb5e]
/opt/galera/mysql/libexec/mysqld(handle_segfault+0x41c) [0x5e2aac]
/lib/libpthread.so.0(+0xef60) [0x7f7dfe646f60]
/lib/libc.so.6(gsignal+0x35) [0x7f7dfd2f3165]
/lib/libc.so.6(abort+0x180) [0x7f7dfd2f5f70]
/usr/lib/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7dfdb86dc5]
/usr/lib/libstdc++.so.6(+0xcb166) [0x7f7dfdb85166]
/usr/lib/libstdc++.so.6(+0xcb193) [0x7f7dfdb85193]
/usr/lib/libstdc++.so.6(__cxa_rethrow+0x46) [0x7f7dfdb85216]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::evs::Proto::handle_up(void const*, gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x7a1) [0x7f7dd8b65371]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::Protolay::send_up(gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x49) [0x7f7dd8b791b9]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::GMCast::handle_up(void const*, gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x29f) [0x7f7dd8b7602f]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::Protostack::dispatch(void const*, gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x5c) [0x7f7dd8b9a22c]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::AsioProtonet::dispatch(void const* const&, gu::Datagram const&, gcomm::ProtoUpMeta const&)+0x4b) [0x7f7dd8bc835b]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::AsioTcpSocket::read_handler(asio::error_code const&, unsigned long)+0x770) [0x7f7dd8ba5b90]
/opt/galera/galera/lib/libgalera_smm.so(asio::detail::read_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::mutable_buffer, 1ul>, boost::_bi::bind_t<unsigned long, boost::_mfi::mf2<unsigned long, gcomm::AsioTcpSocket, asio::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<gcomm::AsioTcpSocket> >, boost::arg<1> (*)(), boost::arg<2> (*)()> >, boost::_bi::bind_t<void, boost::_mfi::mf2<void, gcomm::AsioTcpSocket, asio::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<gcomm::AsioTcpSocket> >, boost::arg<1> (*)(), boost::arg<2> (*)()> > >::operator()(asio::error_code const&, unsigned long)+0xce) [0x7f7dd8bb35ae]
/opt/galera/galera/lib/libgalera_smm.so(asio::detail::reactive_socket_service<asio::ip::tcp>::receive_op<asio::detail::consuming_buffers<asio::mutable_buffer, boost::array<asio::mutable_buffer, 1ul> >, asio::detail::read_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::mutable_buffer, 1ul>, boost::_bi::bind_t<unsigned long, boost::_mfi::mf2<unsigned long, gcomm::AsioTcpSocket, asio::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<gcomm::AsioTcpSocket> >, boost::arg<1> (*)(), boost::arg<2> (*)()> >, boost::_bi::bind_t<void, boost::_mfi::mf2<void, gcomm::AsioTcpSocket, asio::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<gcomm::AsioTcpSocket> >, boost::arg<1> (*)(), boost::arg<2> (*)()> > > >::do_complete(asio::detail::task_io_service<asio::detail::epoll_reactor>*, asio::detail::task_io_service_operation<asio::detail::epoll_reactor>*, asio::error_code, unsigned long)+0x308) [0x7f7dd8bb4238]
/opt/galera/galera/lib/libgalera_smm.so(asio::detail::task_io_service<asio::detail::epoll_reactor>::run(asio::error_code&)+0x790) [0x7f7dd8bd0d70]
/opt/galera/galera/lib/libgalera_smm.so(gcomm::AsioProtonet::event_loop(gu::datetime::Period const&)+0x196) [0x7f7dd8bc9ad6]
/opt/galera/galera/lib/libgalera_smm.so(GCommConn::run()+0x3a3) [0x7f7dd8be2b73]
/opt/galera/galera/lib/libgalera_smm.so(GCommConn::run_fn(void*)+0x9) [0x7f7dd8be6dc9]
/lib/libpthread.so.0(+0x68ba) [0x7f7dfe63e8ba]
/lib/libc.so.6(clone+0x6d) [0x7f7dfd39002d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file
120104 15:53:36 [Warning] WSREP: gethostname failed, using 'localhost' as default wsrep_node_name
120104 15:53:36 [Warning] Ignoring user change to 'root' because the user was set to 'mysql' earlier on the command line

Which describes an error, but it is above my head to really understand what is going on here. The exact same error is also in the log of the '3' server (not the control server with the problems).

So basically what I'm asking is to make some sense of the above log, and perhaps point me in right direction to source and solution of this problem

Thank you in advance.

Best regards,

Pd tech info
all 3 VPS servers with debian 6
2 web servers: apache + mysql local instance + Mysql-Galera (mysql 5.1.58 + 21.0.8.3(r67) )
control server, can't check now but should be the same. This server only runs Mysql-Galera nothing more.

Question information

Language:
English Edit question
Status:
Solved
For:
MySQL patches by Codership Edit question
Assignee:
No assignee Edit question
Solved by:
Steven
Solved:
Last query:
Last reply:
Revision history for this message
Teemu Ollakka (teemu-ollakka) said :
#1

Hi,

It seems that you suffered some kind of network partitioning or extremely high load on nodes which caused node group to split and then re-merge. For some reason the return of the node cabce790-3666-11e1-0800-b43d9764b7ea (probably the control server) didn't go too well and group communication needed to abort (throw exception) to make sure that there will be no inconsistency. The exact reason for the exception visible in log is unclear, and if possible, I'd like to see more logs starting from around the cluster restart to pinpoint the origin of the problem.

I'd also suggest to upgrade Galera library, the most recent version contains some improvements in low level networking which should make it more tolerant against network issues. Recommended versions are
- mysql-server-wsrep 5.1.59-22.2
- galera 22.1.1

Hopefully this helps,

- Teemu

Revision history for this message
Steven (steven-7tv) said :
#2

Hi Teemu,

Thank you for your reply. I also contacted hosting support because the control server didn't start up anymore and they fixed the problem. Apparently some changes were made in /boot/grub/menu.lst due to system update and this caused the server not to restart. I don't know if this can also cause 100% CPU but I don't think so. So I mentioned that I found this such a weird problem and then they told me they also had to deal with a DDOS attack in that datacentre. So that explains a lot concerning extremely high load as you mentioned.

As for the upgrade; yes this is a good idea.
I was just thinking to update to mysql 5.5 with galera 1.1, just all the latest, or would you suggest it is better to update to the version you mentioned?

All seems to be good now. However if you still have interest in seeing the rest of the error logs, let me know then I will post them here.

Thanks
Steven

Revision history for this message
Alex Yurchenko (ayurchen) said :
#3

Hi Steven,

If you have such an option it is certainly better to upgrade to 5.5. Most of the issues with it have been ironed out already and all new features and improvements will go there. 5.1 is pretty much deprecated.

However keep in mind that rolling upgrade from 5.1 to 5.5 is quite error-prone. Check this howto from FromDual: http://www.fromdual.com/rolling-upgrade-of-galera-10-to-11 - there is a chapter just for this case.

Regards,
Alex

Revision history for this message
Steven (steven-7tv) said :
#4

Hi Alex.

I actually found this url also searching for some answers :) but thanks anyway. I've saved the url for references when the updating will be done. However I'm thinking perhaps it is a better idea to take it all down and do a complete update instead of a rolling update and with that trying to keep services running. Just need to pinpoint a good moment when use of the website is as low as possible, and of course some people with good experience for this job.

I've already posted a freelance job on freelancer however if there are any freelancers here with the experience / knowledge to do this and interested don't be shy to contact me privately :)

Best regards,