Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, in Silicon Valley.
A few months ago, after finally having received the new disks for our new Ceph cluster, we decided to benchmark them, so we could tune our cluster to get the best performance out of it, but that’s a story for another time. After setting everything up for our first benchmark, and starting it, we knew it would take about an hour and a half before completion, so in the meantime, we moved on to other things. Coming back to the test after about two hours, it still hadn’t finished, and it was stuck somewhere at the beginning.
During our preparation for it, we had run into this issue, and we thought it was just the tool that was mis-configured. However, after running a simple ceph status, we saw that some Ceph services (in this particular case, a monitor, a manager and two OSDs) were down. There was no joy in SSHing to the node (hereafter called node-1) that was running those services as it wasn’t answering anymore. We had to use the machine’s iDRAC (a remote control interface embedded into the server) to find a kernel panic on its console and force-reboot the machine.
To better understand why the kernel panicked, we started to panic with it, thinking there might be a huge problem with our brand new servers. Looking at the call trace from the kernel didn’t give us a lot of information, except that it could be memory-related. After a successful reboot, we went to the kernel logs from the previous boot to see what had happened. The first stack trace we were greeted with was this one (excluding some lines to keep it short):