Context Navigation

#81 new defect

crin3 backup issue

Reported by:	chris	Owned by:	chris
Priority:	major	Milestone:	Maintenance
Component:	backups	Version:
Keywords:		Cc:
Estimated Number of Hours:	0	Add Hours to Ticket:	0
Billable?:	yes	Total Hours:	0.85

Description

Munin email alert I'm getting every 5 mins:

Date: Fri, 19 Aug 2016 09:30:16 +0000
From: munin application user <munin@crin1.crin.org>
Subject: crin3.crin.org Munin Alert

crin.org :: crin3.crin.org :: Disk usage in percent
        WARNINGs: / is 97.35 (outside range [:92]).
        OKs: /run is 10.40, /dev/shm is 0.00, /boot is 87.07, /run/lock is 0.00, /sys/fs/cgroup is 0.00, /run/user/1000 is 0.00.

Change History (4)

comment:1 Changed 2 years ago by chris

Add Hours to Ticket changed from 0 to 0.25
Total Hours set to 0.25

It looks like it is because something has gone wrong with a backup job, two are running but only one s3ql filesystem is connected:

df -h
Filesystem                             Size  Used Avail Use% Mounted on
udev                                   489M     0  489M   0% /dev
tmpfs                                  100M   11M   90M  11% /run
/dev/mapper/CRIN3--vg-root              15G   14G  378M  98% /
tmpfs                                  499M     0  499M   0% /dev/shm
tmpfs                                  5.0M     0  5.0M   0% /run/lock
tmpfs                                  499M     0  499M   0% /sys/fs/cgroup
/dev/sda1                              236M  195M   29M  88% /boot
crin1:/                                121G   27G   88G  24% /media/sshfs/crin1
s3c://s.qstack.advania.com:443/crin1/  1.0T  252G  773G  25% /media/s3ql/crin1
crin2:/                                121G   24G   91G  21% /media/sshfs/crin2
tmpfs                                  100M     0  100M   0% /run/user/1000

ps -lA | grep rsync
0 S     0  2290  1743  0  80   0 -  5960 -      ?        00:00:23 rsync
1 S     0  2291  2290  0  80   0 -  5869 -      ?        00:00:00 rsync
1 S     0  2292  2291  0  80   0 -  5917 -      ?        00:00:24 rsync

So:

killall rsync
ps -lA | grep rsync
  1 S     0  2292     1  0  80   0 -  5917 -      ?        00:00:24 rsync
kill 2292
ps -lA | grep rsync
  1 D     0  2292     1  0  80   0 -  5917 -      ?        00:00:24 rsync

This seems to be un unkillable process, rebooting the server and I'll check how it looks later.

comment:2 Changed 2 years ago by chris

Add Hours to Ticket changed from 0 to 0.25
Total Hours changed from 0.25 to 0.5

Checking where space usage is:

cd /
du -h --max-depth=1
8.1M    ./bin
4.0K    ./opt
1.4G    ./lib
7.0M    ./sbin
193M    ./boot
24K     ./tmp
1.5G    ./usr
5.4M    ./run
0       ./sys
5.2M    ./etc
40K     ./media
9.3G    ./root
0       ./dev
746M    ./var
4.0K    ./mnt
4.0K    ./srv
4.0K    ./lib64
104K    ./home
16K     ./lost+found
du: cannot access './proc/24896/task/24896/fd/4': No such file or directory
du: cannot access './proc/24896/task/24896/fdinfo/4': No such file or directory
du: cannot access './proc/24896/fd/3': No such file or directory
du: cannot access './proc/24896/fdinfo/3': No such file or directory
0       ./proc
14G     .

cd /root/
du -h --max-depth=1
16K     ./.aptitude
516K    ./.Changelog
4.0K    ./Mail
20K     ./.ssh
9.3G    ./.s3ql
4.0K    ./.nano
9.3G    .


ls -lah
total 4.1G
drwx------ 3 root root 4.0K Aug 19 10:12 .
drwx------ 8 root root 4.0K Aug 19 10:03 ..
-rw------- 1 root root  456 Jul 25  2015 authinfo2
-rw------- 1 root root  218 Jun  2  2015 authinfo2.db1
-rw------- 1 root root  226 Jun  2  2015 authinfo2.greenqloud.db1
-rw------- 1 root root  218 Jun  2  2015 authinfo2.web1
-rw------- 1 root root  218 May 18  2015 authinfo2.web2
-rw------- 1 root root  218 May 19  2015 authinfo2.wiki
-rw-r--r-- 1 root root 8.2M Aug 19 10:00 fsck.log
-rw-r--r-- 1 root root 1.0M Feb 27 00:34 fsck.log.1
-rw-r--r-- 1 root root  46K Feb 17  2016 fsck.log.2
-rw-r--r-- 1 root root 1.0M Feb 17  2016 fsck.log.3
-rw-r--r-- 1 root root 1.0M Dec 18  2015 fsck.log.4
-rw-r--r-- 1 root root 1.0M Oct  1  2015 fsck.log.5
-rw-r--r-- 1 root root 2.5M Aug 19 10:16 mount.log
-rw-r--r-- 1 root root 1.0M Dec 27  2015 mount.log.1
-rw-r--r-- 1 root root    0 Jul 23  2015 mount.s3ql_crit.log
drwxr-xr-x 2 root root  64K Aug 17 00:09 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin1=2F-cache
-rw------- 1 root root 1.1G Aug 19 00:03 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin1=2F.db
-rw-r--r-- 1 root root  200 Aug 17 00:04 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin1=2F.params
-rw------- 1 root root 2.5G Aug 19 09:52 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin2=2F.db
-rw-r--r-- 1 root root  201 Aug 19 09:52 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin2=2F.params
-rw------- 1 root root 517M Aug 19 10:16 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin4=2F.db
-rw-r--r-- 1 root root  200 Aug 19 10:15 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin4=2F.params

So, I'm not sure what can be done to create more space here...

Last edited 2 years ago by chris (previous) (diff)

comment:3 Changed 2 years ago by chris

Add Hours to Ticket changed from 0 to 0.25
Total Hours changed from 0.5 to 0.75

I have stopped the Munin alerts for now by upping the alert level to 94% rather than 92% by editing /etc/munin/plugin-conf.d/munin-node:

[df*]
env.warning 94
env.critical 98

Restart munin-node:

/etc/init.d/munin-node restart
[ ok ] Restarting munin-node (via systemctl): munin-node.service.

I have also deleted some of the backups of the Changelog:

cd /root/.Changelog
rm -f Changelog.2015*
rm -f Changelog.2016-01*
rm -f Changelog.2016-02*
rm -f Changelog.2016-03*
rm -f Changelog.2016-04*
rm -f Changelog.2016-05*
rm -f Changelog.2016-06*
rm -f Changelog.2016-07*

And cleaned up the apt archive:

apt-get clean

And that has brought some time, the next thing to try is deleting old backups to see if that frees up space by reducing the amount of metadata stored:

df -h
Filesystem                  Size  Used Avail Use% Mounted on
udev                        489M     0  489M   0% /dev
tmpfs                       100M  5.4M   95M   6% /run
/dev/mapper/CRIN3--vg-root   15G   13G  1.6G  89% /
tmpfs                       499M     0  499M   0% /dev/shm
tmpfs                       5.0M     0  5.0M   0% /run/lock
tmpfs                       499M     0  499M   0% /sys/fs/cgroup
/dev/sda1                   236M  195M   29M  88% /boot
tmpfs                       100M     0  100M   0% /run/user/1000

comment:4 Changed 2 years ago by chris

Add Hours to Ticket changed from 0 to 0.1
Total Hours changed from 0.75 to 0.85

This came up again:

crin.org :: crin3.crin.org :: Disk usage in percent                                                                                  
        WARNINGs: / is 97.81 (outside range [:94]).                                                                                  
        OKs: /run is 10.34, /run/lock is 0.00, /boot is 87.07, /sys/fs/cgroup is 0.00, /dev/shm is 0.00.

But it has dropped back down now:

 df -h
Filesystem                  Size  Used Avail Use% Mounted on
udev                        489M     0  489M   0% /dev
tmpfs                       100M   11M   90M  11% /run
/dev/mapper/CRIN3--vg-root   15G   12G  2.4G  84% /
tmpfs                       499M     0  499M   0% /dev/shm
tmpfs                       5.0M     0  5.0M   0% /run/lock
tmpfs                       499M     0  499M   0% /sys/fs/cgroup
/dev/sda1                   236M  195M   29M  88% /boot
crin1:/                     121G   27G   88G  24% /media/sshfs/crin1
tmpfs                       100M     0  100M   0% /run/user/1000

I guess it is caused by tmp files or something, at some point we will probably have to make this server bigger.

Note: See TracTickets for help on using tickets.

Download in other formats: