High CPU on one core, but unable to find process responsible

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

High CPU on one core, but unable to find process responsible

David Rosenstrauch
My server's been exhibiting some very strange behavior lately.  Every
couple of days I run into a situation where one core (core #0) on the
quad core CPU starts continuously using around 34% of CPU, but I'm not
able to see (using htop) any process that's responsible for using all
that CPU.  Even when I tell htop to show me kernel threads too, I still
am not able to see the offending process.  Every process remains under
1% CPU usage (except for occasional, small, short-lived spikes up) yet
the CPU usage on that core remains permanently hovering at around 34%.  
The problem goes away when I reboot, but then comes back with a day or
so.

I'm rather stumped as to how to fix this.  The server is a bit old,
running an up-to-date installation of Arch on a Intel Core 2 Quad Q6600
CPU.  Any suggestions anyone might have as to either what might be going
on here, or how to go about debugging it would be greatly appreciated.

Thanks!

DR
Reply | Threaded
Open this post in threaded view
|

Re: High CPU on one core, but unable to find process responsible

arch general mailing list-2
On 3/12/18, David Rosenstrauch <[hidden email]> wrote:

> My server's been exhibiting some very strange behavior lately.  Every
> couple of days I run into a situation where one core (core #0) on the
> quad core CPU starts continuously using around 34% of CPU, but I'm not
> able to see (using htop) any process that's responsible for using all
> that CPU.  Even when I tell htop to show me kernel threads too, I still
> am not able to see the offending process.  Every process remains under
> 1% CPU usage (except for occasional, small, short-lived spikes up) yet
> the CPU usage on that core remains permanently hovering at around 34%.
> The problem goes away when I reboot, but then comes back with a day or
> so.

My gut feeling is that one of the kernel worker threads hangs.
So that would be 25% overall and 100% of the affected core.
But you say there's no load to be found in the kernel threads,
which is odd.

Or if the server is accessible from the Internet, is it possible
it's rooted and someone's running a hidden process? To confirm
this isn't the case, cut off Internet access and let it run for
two days.

I don't think there are any official hidden processes that do not
show up in htop or top since that would make them seem like rootkits.
That means if the guilty process is really invisible, then it's
definitely unusual.

It's scary to consider a rootkit, but if that's the case, then
it's best to be aware as soon as possible. I hope this is not
case for you, wouldn't wish it on your worst enemy.

Another idea. Can you limit the cores to 1 or maybe two and see
if it becomes easier to pinpoint?

This might work in the booted system:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online

But on the kernel command line maxcpus=1 should work.
Reply | Threaded
Open this post in threaded view
|

Re: High CPU on one core, but unable to find process responsible

arch general mailing list-2
In reply to this post by David Rosenstrauch


On 2018年03月12日 11:19, David Rosenstrauch wrote:

> My server's been exhibiting some very strange behavior lately.  Every
> couple of days I run into a situation where one core (core #0) on the
> quad core CPU starts continuously using around 34% of CPU, but I'm not
> able to see (using htop) any process that's responsible for using all
> that CPU.  Even when I tell htop to show me kernel threads too, I
> still am not able to see the offending process.  Every process remains
> under 1% CPU usage (except for occasional, small, short-lived spikes
> up) yet the CPU usage on that core remains permanently hovering at
> around 34%.  The problem goes away when I reboot, but then comes back
> with a day or so.
>
> I'm rather stumped as to how to fix this.  The server is a bit old,
> running an up-to-date installation of Arch on a Intel Core 2 Quad
> Q6600 CPU.  Any suggestions anyone might have as to either what might
> be going on here, or how to go about debugging it would be greatly
> appreciated.
>
> Thanks!
>
> DR
Can you check whether you have enabled "Detailed CPU time" option in
htop's setup (F2 -> Display options -> "Detailed CPU time")?
From my experience and understanging, htop's CPU meter is accounting
IO-wait/IRQ-response time by default but not showing them differently
unless you enabled the "Detailed CPU time" option.
And these waiting time is not accounted on each process or kernel
thread. Enabling that said option will revail more detailed CPU usage info.
High IO-wait or IRQ time is itself an indication of some misbehaving
hardware, but at least you can be sure that it is not by more
"dangerous" malwares or attacks.


signature.asc (499 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: High CPU on one core, but unable to find process responsible

David C. Rankin
In reply to this post by David Rosenstrauch
On 03/11/2018 09:19 PM, David Rosenstrauch wrote:

> My server's been exhibiting some very strange behavior lately.  Every couple
> of days I run into a situation where one core (core #0) on the quad core CPU
> starts continuously using around 34% of CPU, but I'm not able to see (using
> htop) any process that's responsible for using all that CPU.  Even when I tell
> htop to show me kernel threads too, I still am not able to see the offending
> process.  Every process remains under 1% CPU usage (except for occasional,
> small, short-lived spikes up) yet the CPU usage on that core remains
> permanently hovering at around 34%.  The problem goes away when I reboot, but
> then comes back with a day or so.
>
> I'm rather stumped as to how to fix this.  The server is a bit old, running an
> up-to-date installation of Arch on a Intel Core 2 Quad Q6600 CPU.  Any
> suggestions anyone might have as to either what might be going on here, or how
> to go about debugging it would be greatly appreciated.
>
> Thanks!
>
> DR

DR,

  You are on to something. In the past 24hr. I've experienced two hardlocks on
two separate multi-cpu Arch servers. Both AMD based boxes. One the
dual-quad-core opteron box that was the subject of the 4.15.8 update failure
that left the box unbootable. Then today on a quad-quad-core opteron box that
hardlocked doing a simple rsync from another server on the LAN. Other than the
rsync process, the box is nearly idle, e.g.

top - 19:49:45 up 8 min,  2 users,  load average: 2.93, 2.23, 1.11
Tasks: 271 total,   1 running, 158 sleeping,   0 stopped,   0 zombie
%Cpu0  :   0.0/0.0     0[                                            ]
%Cpu1  :   0.0/0.0     0[                                            ]
%Cpu2  :   0.0/0.0     0[                                            ]
%Cpu3  :   0.0/0.0     0[                                            ]
%Cpu4  :   0.0/0.0     0[                                            ]
%Cpu5  :   0.0/0.0     0[                                            ]
%Cpu6  :   0.0/0.0     0[                                            ]
%Cpu7  :   0.0/0.0     0[                                            ]
%Cpu8  :   0.0/0.0     0[                                            ]
%Cpu9  :   0.0/0.0     0[                                            ]
%Cpu10 :   0.0/0.0     0[                                            ]
%Cpu11 :   0.0/0.7     1[|                                           ]
%Cpu12 :   0.0/0.0     0[                                            ]
%Cpu13 :   0.0/0.0     0[                                            ]
%Cpu14 :   0.0/0.0     0[                                            ]
%Cpu15 :   0.0/0.0     0[                                            ]
GiB Mem :  1.7/62.915   [                                            ]
GiB Swap:  0.0/1.998    [                                            ]


  I could not get a top during the lockup, because when the lockup occurred,
while I could connect to the server from another box via ssh, I could not log
in. E.g.,

19:34 wizard:~/dev/src-c/tmp/debug> svh
Last login: Mon Mar 12 19:34:44 2018 from 192.168.6.104
^CConnection to valhalla closed.

  Both boxes have exhibited strange behavior regarding the linux-raid array
(all disks are fine), but I receive spurious errors like (Out of IOMMU space),
Huh?

Mar 12 19:45:20 valhalla su[869]: pam_unix(su:session): session opened for
user root by david(uid=1000)
Mar 12 19:45:57 valhalla kernel: sata_nv 0000:00:05.0: PCI-DMA: Out of IOMMU
space for 65536 bytes
Mar 12 19:45:57 valhalla kernel: ata3: EH in SWNCQ mode,QC:qc_active 0x4
sactive 0x4
Mar 12 19:45:57 valhalla kernel: ata3: SWNCQ:qc_active 0x0 defer_bits 0x0
last_issue_tag 0x1
                                   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Mar 12 19:45:57 valhalla kernel: ata3: ATA_REG 0x40 ERR_REG 0x0
Mar 12 19:45:57 valhalla kernel: ata3: tag : dhfis dmafis sdbfis sactive
Mar 12 19:45:57 valhalla kernel: ata3.00: exception Emask 0x0 SAct 0x4 SErr
0x0 action 0x6
Mar 12 19:45:57 valhalla kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Mar 12 19:45:57 valhalla kernel: ata3.00: cmd
61/00:10:00:d0:e4/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou
                                          res
40/00:20:00:ea:e3/00:00:0f:00:00/40 Emask 0x40 (internal error)
Mar 12 19:45:57 valhalla kernel: ata3.00: status: { DRDY }
Mar 12 19:45:57 valhalla kernel: ata3: hard resetting link
Mar 12 19:45:57 valhalla kernel: ata3: nv: skipping hardreset on occupied port
Mar 12 19:45:58 valhalla kernel: ata3: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
Mar 12 19:45:58 valhalla kernel: ata3.00: configured for UDMA/133
Mar 12 19:45:58 valhalla kernel: ata3: EH complete
Mar 12 19:46:09 valhalla kernel: sata_nv 0000:00:05.1: PCI-DMA: Out of IOMMU
space for 65536 bytes
Mar 12 19:46:09 valhalla kernel: ata5: EH in SWNCQ mode,QC:qc_active 0x4
sactive 0x4
Mar 12 19:46:09 valhalla kernel: ata5: SWNCQ:qc_active 0x0 defer_bits 0x0
last_issue_tag 0x1
                                   dhfis 0x0 dmafis 0x0 sdbfis 0x0
Mar 12 19:46:09 valhalla kernel: ata5: ATA_REG 0x40 ERR_REG 0x0
Mar 12 19:46:09 valhalla kernel: ata5: tag : dhfis dmafis sdbfis sactive
Mar 12 19:46:09 valhalla kernel: ata5.00: exception Emask 0x0 SAct 0x4 SErr
0x0 action 0x6
Mar 12 19:46:09 valhalla kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Mar 12 19:46:09 valhalla kernel: ata5.00: cmd
61/00:10:00:c0:f8/0a:00:0f:00:00/40 tag 2 ncq dma 1310720 ou
                                          res
40/00:20:00:da:f7/00:00:0f:00:00/40 Emask 0x40 (internal error)
Mar 12 19:46:09 valhalla kernel: ata5.00: status: { DRDY }
Mar 12 19:46:09 valhalla kernel: ata5: hard resetting link
Mar 12 19:46:09 valhalla kernel: ata5: nv: skipping hardreset on occupied port
Mar 12 19:46:10 valhalla kernel: ata5: SATA link up 1.5 Gbps (SStatus 113
SControl 300)
Mar 12 19:46:10 valhalla kernel: ata5.00: configured for UDMA/133
Mar 12 19:46:10 valhalla kernel: ata5: EH complete

  Today's lockup occurred after update (where 29 packages were updated), e.g.

Packages (29)

calc-2.12.6.7-1  cryptsetup-2.0.2-1  device-mapper-2.02.177-5
e2fsprogs-1.44.0-1  efivar-34-1 filesystem-2018.1-2  gawk-4.2.1-1
git-2.16.2-2  graphite-1:1.3.11-1  hdparm-9.54-1 imagemagick-7.0.7.26-1
libevdev-1.5.9-1  libmagick-7.0.7.26-1  libmagick6-6.9.9.38-1
libsystemd-238.0-3  libzip-1.4.0-1  linux-lts-4.14.25-1
linux-lts-headers-4.14.25-1 lvm2-2.02.177-5  nano-2.9.4-1  qt5-base-5.10.1-5
systemd-238.0-3  systemd-sysvcompat-238.0-3 tcl-8.6.8-2
vulkan-icd-loader-1.1.70.0-3  wxgtk-common-3.0.4-1  wxgtk2-3.0.4-1
wxgtk3-3.0.4-1 xfsprogs-4.15.1-1

  Following update, I reloaded systemd (systemctl daemon-reload), and then
proceeded with the rsync transfer -- which locked. I note we have
vulkan-icd-loader and qt5-base, both of which also had updates yesterday at
the time of the locked system update that caused so much problem.

  I don't know where to begin. I've been monitoring with top since the last
lockup, but have not caught anything. Both of these boxes have run Arch since
2016 and have never experienced any problems before now. Both are SuperMicro
boards and both have either 2 or 4 quad-core opterons. Both are raid1 linux
raid based boxes and have either 32 or 64G of RAM.

  Something has gone haywire in the past week that has a adverse impact on
these boxes. I have to admit -- I have no idea where to start tracking this down.


--
David C. Rankin, J.D.,P.E.
Reply | Threaded
Open this post in threaded view
|

Re: High CPU on one core, but unable to find process responsible

David Rosenstrauch
In reply to this post by arch general mailing list-2


On 03/12/2018 05:13 AM, Jiachen Yang via arch-general wrote:
>
>
> On 2018年03月12日 11:19, David Rosenstrauch wrote:
>> My server's been exhibiting some very strange behavior lately.  Every
>> couple of days I run into a situation where one core (core #0) on the
>> quad core CPU starts continuously using around 34% of CPU, but I'm not
>> able to see (using htop) any process that's responsible for using all
>> that CPU.

> Can you check whether you have enabled "Detailed CPU time" option in
> htop's setup (F2 -> Display options -> "Detailed CPU time")?
>  From my experience and understanging, htop's CPU meter is accounting
> IO-wait/IRQ-response time by default but not showing them differently
> unless you enabled the "Detailed CPU time" option.
> And these waiting time is not accounted on each process or kernel
> thread. Enabling that said option will revail more detailed CPU usage info.
> High IO-wait or IRQ time is itself an indication of some misbehaving
> hardware, but at least you can be sure that it is not by more
> "dangerous" malwares or attacks.

Thanks for the suggestion.  So this issue happened again tonight, and I
switched to "Detailed CPU time" to try to research it further.
According to htop, the cpu usage is from "irq" (orange color).  I guess
this would explain why I'm not seeing any process responsible too.

And it also might be related that I'm seeing these messages in my dmesg:

[  871.317377] perf: interrupt took too long (2506 > 2500), lowering
kernel.perf_event_max_sample_rate to 79000
[ 1732.773491] perf: interrupt took too long (3140 > 3132), lowering
kernel.perf_event_max_sample_rate to 63000
[ 3375.392292] perf: interrupt took too long (3950 > 3925), lowering
kernel.perf_event_max_sample_rate to 50000


So if this issue is irq-based, I guess that means some piece of hardware
is faulty or failing.  Any idea how I might go about pinning down which
one?  Would there be info in the kernel log about this?  Or something
that I can look at in /proc?

Thanks,

DR
Reply | Threaded
Open this post in threaded view
|

Re: High CPU on one core, but unable to find process responsible

arch general mailing list-2
On 3/13/18, David Rosenstrauch <[hidden email]> wrote:

> So if this issue is irq-based, I guess that means some piece of hardware
> is faulty or failing.  Any idea how I might go about pinning down which
> one?  Would there be info in the kernel log about this?  Or something
> that I can look at in /proc?

Any BIOS updates or kernel updates recently (4.15.8)?

Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org
just for testing (not production).

It's more likely that the kernel regressed rather than IRQ issues popping
up suddenly. It's possible but less likely.
Reply | Threaded
Open this post in threaded view
|

Re: High CPU on one core, but unable to find process responsible

David Rosenstrauch


On 03/12/2018 09:56 PM, Carsten Mattner via arch-general wrote:
> Any BIOS updates or kernel updates recently (4.15.8)?
>
> Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org
> just for testing (not production).
>
> It's more likely that the kernel regressed rather than IRQ issues popping
> up suddenly. It's possible but less likely.

No BIOS updates.  As far as kernel updates, I do those all the time, so
not sure that would be the cause.


After doing some digging, though, I did "cat /proc/interrupts", and this
line stood out, for having an astronomically high number:

            CPU0       CPU1       CPU2       CPU3
...
  16: 2424156658          0          0          0   IO-APIC  16-fasteoi
  uhci_hcd:usb5, parport1


I have an old PCI card in the machine that's powers an old parallel port
printer I used to use with it.  Perhaps that's failing.  I don't need
the card anymore, so I might as well try taking it out and see if that
makes things better.  If not, I'll be back.  :-)

Thanks,

DR
Reply | Threaded
Open this post in threaded view
|

Re: High CPU on one core, but unable to find process responsible

arch general mailing list-2
On 3/13/18, David Rosenstrauch <[hidden email]> wrote:

>
>
> On 03/12/2018 09:56 PM, Carsten Mattner via arch-general wrote:
>> Any BIOS updates or kernel updates recently (4.15.8)?
>>
>> Try with 3.16 or 4.9 or another old lts kernel from archive.archlinux.org
>> just for testing (not production).
>>
>> It's more likely that the kernel regressed rather than IRQ issues popping
>> up suddenly. It's possible but less likely.
>
> No BIOS updates.  As far as kernel updates, I do those all the time, so
> not sure that would be the cause.
>
> After doing some digging, though, I did "cat /proc/interrupts", and this
> line stood out, for having an astronomically high number:
>
>             CPU0       CPU1       CPU2       CPU3
> ...
>   16: 2424156658          0          0          0   IO-APIC  16-fasteoi
>   uhci_hcd:usb5, parport1
>
>
> I have an old PCI card in the machine that's powers an old parallel port
> printer I used to use with it.  Perhaps that's failing.  I don't need
> the card anymore, so I might as well try taking it out and see if that
> makes things better.  If not, I'll be back.  :-)

Let us know in either case. Please tell me it's a huge plotter
you use to make construction plans for your rocket ships.
Reply | Threaded
Open this post in threaded view
|

Re: High CPU on one core, but unable to find process responsible

David Rosenstrauch


On 03/13/2018 03:34 AM, Carsten Mattner via arch-general wrote:
> On 3/13/18, David Rosenstrauch <[hidden email]> wrote >> I have an old PCI card in the machine that's powers an old parallel port
>> printer I used to use with it.  Perhaps that's failing.  I don't need
>> the card anymore, so I might as well try taking it out and see if that
>> makes things better.  If not, I'll be back.  :-)
>
> Let us know in either case. Please tell me it's a huge plotter
> you use to make construction plans for your rocket ships.


Just an old printer that's been good to me that I've been reluctant to
let go of:

https://www.cnet.com/products/hp-laserjet-6l/specs/

DR