Opened 2 years ago

Closed 2 years ago

#75 closed defect (fixed)

Load spike and small downtime

Reported by: chris Owned by: chris
Priority: major Milestone: Maintenance
Component: crin2 Version:
Keywords: Cc:
Estimated Number of Hours: 0 Add Hours to Ticket: 0
Billable?: yes Total Hours: 1.2

Description

Since about 8:10am this morning there has been a noticeable increase in bandwidth usage and load on Crin2 and at 18:33, for 20 seconds, the Webarchitects monitoring server was unable to connect.

Attachments (17)

memcached_rates-day.png (45.1 KB) - added by chris 2 years ago.
memcached_bytes-day.png (32.2 KB) - added by chris 2 years ago.
fw_conntrack-day.png (41.7 KB) - added by chris 2 years ago.
if_eth0-day.png (18.6 KB) - added by chris 2 years ago.
fw_packets-day.png (25.5 KB) - added by chris 2 years ago.
nginx_request-day.png (22.0 KB) - added by chris 2 years ago.
phpfpm_connections-day.png (31.0 KB) - added by chris 2 years ago.
phpfpm_average-day.png (26.2 KB) - added by chris 2 years ago.
phpfpm_status-day.png (33.6 KB) - added by chris 2 years ago.
cpu-day.png (29.3 KB) - added by chris 2 years ago.
2016-06-04_load-day.png (26.3 KB) - added by chris 2 years ago.
2016-06-04_cpu-day.png (33.6 KB) - added by chris 2 years ago.
2016-06-04_phpfpm_connections-day.png (32.2 KB) - added by chris 2 years ago.
2016-06-04_fw_packets-day.png (27.9 KB) - added by chris 2 years ago.
2016-06-04_if_eth0-day.png (20.8 KB) - added by chris 2 years ago.
2016-06-04_fw_conntrack-day.png (40.9 KB) - added by chris 2 years ago.
2016-06-04_memcached_rates-day.png (61.1 KB) - added by chris 2 years ago.

Download all attachments as: .zip

Change History (21)

comment:1 Changed 2 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.25
  • Total Hours set to 0.25

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

comment:2 Changed 2 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.25
  • Total Hours changed from 0.25 to 0.5

The down email:

Subject: PROBLEM - CRIN2.webarch.net - http is CRITICAL
Date: Sun, 03 Jul 2016 18:33:56 +0100

***** Icinga  *****

Notification Type: PROBLEM

Service: http
Host: CRIN2.webarch.net
Address: 93.95.228.180
State: CRITICAL

Date/Time: 2016-07-03 18:33:56 +0100

Additional Info: CRITICAL - Socket timeout after 10 seconds

And the up email:

Subject: RECOVERY - CRIN2.webarch.net - http is OK
Date: Sun, 03 Jul 2016 18:34:17 +0100

***** Icinga  *****

Notification Type: RECOVERY

Service: http
Host: CRIN2.webarch.net
Address: 93.95.228.180
State: OK

Date/Time: 2016-07-03 18:34:17 +0100

Additional Info: HTTP OK: HTTP/1.1 200 OK - 76033 bytes in 0.516 second response time

Some of the munin graphs illustrating what is going on:











I have restarted nginx, memcache and php5-fpm and tried running strace on onve of the php5-fpm processes which was using 100% of one CPU (see also ticket:73):

date ; strace -c -p 11524 ; date
Sun Jul  3 18:23:38 GMT 2016
Process 11524 attached
^CProcess 11524 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.88    0.012495          46       272           brk
  1.60    0.000204           0      1014           sendto
  0.52    0.000066           0       861        37 access
  0.00    0.000000           0        72           read
  0.00    0.000000           0        13           write
  0.00    0.000000           0        27           open
  0.00    0.000000           0        45           close
  0.00    0.000000           0       235        16 stat
  0.00    0.000000           0        26           fstat
  0.00    0.000000           0        30         4 lstat
  0.00    0.000000           0       665           poll
  0.00    0.000000           0        28           lseek
  0.00    0.000000           0        23           mmap
  0.00    0.000000           0        23           munmap
  0.00    0.000000           0         4           rt_sigaction
  0.00    0.000000           0         4           rt_sigprocmask
  0.00    0.000000           0       671           ioctl
  0.00    0.000000           0       685           select
  0.00    0.000000           0        18           setitimer
  0.00    0.000000           0         8           socket
  0.00    0.000000           0         8         8 connect
  0.00    0.000000           0         4           accept
  0.00    0.000000           0      1716           recvfrom
  0.00    0.000000           0        10           shutdown
  0.00    0.000000           0        20           setsockopt
  0.00    0.000000           0         8           getsockopt
  0.00    0.000000           0        51           fcntl
  0.00    0.000000           0         4           getdents
  0.00    0.000000           0        52           getcwd
  0.00    0.000000           0         9           chdir
  0.00    0.000000           0       591           gettimeofday
  0.00    0.000000           0         9           times
  0.00    0.000000           0        27           clock_gettime
  0.00    0.000000           0         2           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.012765                  7235        65 total
Sun Jul  3 18:24:40 GMT 2016

I don't know if the cause of this is external (the firewall graph indicate that it might be) or not.

comment:3 Changed 2 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.45
  • Total Hours changed from 0.5 to 0.95

The high load and increased bandwidth usage is ongoing and looking at the access logs one IP address stands out:

grep "149.202.198.122" crin.org.access.log | wc -l
36769
grep "149.202.198.122" crin.org.access.log.1 | wc -l
83084

This looks like a dynamic IP address:

dig -x 149.202.198.122 +short
ns3025239.ip-149-202-198.eu.

The total number of requests: 119,853

It is having an adverse impact on other users and after watching the logs for a while and ponding all the options I have blocked the IP address using iptables as it is basically abuse.

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

Changed 2 years ago by chris

comment:4 Changed 2 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.25
  • Resolution set to fixed
  • Status changed from new to closed
  • Total Hours changed from 0.95 to 1.2

Blocking that one abusive IP address appears to have solved this issue, see these graphs:








Closing this ticket.

Note: See TracTickets for help on using tickets.