Opened 2 years ago
Last modified 2 years ago
#81 new defect
crin3 backup issue
Reported by: | chris | Owned by: | chris |
---|---|---|---|
Priority: | major | Milestone: | Maintenance |
Component: | backups | Version: | |
Keywords: | Cc: | ||
Estimated Number of Hours: | 0 | Add Hours to Ticket: | 0 |
Billable?: | yes | Total Hours: | 0.85 |
Description
Munin email alert I'm getting every 5 mins:
Date: Fri, 19 Aug 2016 09:30:16 +0000 From: munin application user <munin@crin1.crin.org> Subject: crin3.crin.org Munin Alert crin.org :: crin3.crin.org :: Disk usage in percent WARNINGs: / is 97.35 (outside range [:92]). OKs: /run is 10.40, /dev/shm is 0.00, /boot is 87.07, /run/lock is 0.00, /sys/fs/cgroup is 0.00, /run/user/1000 is 0.00.
Change History (4)
comment:1 Changed 2 years ago by chris
- Add Hours to Ticket changed from 0 to 0.25
- Total Hours set to 0.25
comment:2 Changed 2 years ago by chris
- Add Hours to Ticket changed from 0 to 0.25
- Total Hours changed from 0.25 to 0.5
Checking where space usage is:
cd / du -h --max-depth=1 8.1M ./bin 4.0K ./opt 1.4G ./lib 7.0M ./sbin 193M ./boot 24K ./tmp 1.5G ./usr 5.4M ./run 0 ./sys 5.2M ./etc 40K ./media 9.3G ./root 0 ./dev 746M ./var 4.0K ./mnt 4.0K ./srv 4.0K ./lib64 104K ./home 16K ./lost+found du: cannot access './proc/24896/task/24896/fd/4': No such file or directory du: cannot access './proc/24896/task/24896/fdinfo/4': No such file or directory du: cannot access './proc/24896/fd/3': No such file or directory du: cannot access './proc/24896/fdinfo/3': No such file or directory 0 ./proc 14G . cd /root/ du -h --max-depth=1 16K ./.aptitude 516K ./.Changelog 4.0K ./Mail 20K ./.ssh 9.3G ./.s3ql 4.0K ./.nano 9.3G . ls -lah total 4.1G drwx------ 3 root root 4.0K Aug 19 10:12 . drwx------ 8 root root 4.0K Aug 19 10:03 .. -rw------- 1 root root 456 Jul 25 2015 authinfo2 -rw------- 1 root root 218 Jun 2 2015 authinfo2.db1 -rw------- 1 root root 226 Jun 2 2015 authinfo2.greenqloud.db1 -rw------- 1 root root 218 Jun 2 2015 authinfo2.web1 -rw------- 1 root root 218 May 18 2015 authinfo2.web2 -rw------- 1 root root 218 May 19 2015 authinfo2.wiki -rw-r--r-- 1 root root 8.2M Aug 19 10:00 fsck.log -rw-r--r-- 1 root root 1.0M Feb 27 00:34 fsck.log.1 -rw-r--r-- 1 root root 46K Feb 17 2016 fsck.log.2 -rw-r--r-- 1 root root 1.0M Feb 17 2016 fsck.log.3 -rw-r--r-- 1 root root 1.0M Dec 18 2015 fsck.log.4 -rw-r--r-- 1 root root 1.0M Oct 1 2015 fsck.log.5 -rw-r--r-- 1 root root 2.5M Aug 19 10:16 mount.log -rw-r--r-- 1 root root 1.0M Dec 27 2015 mount.log.1 -rw-r--r-- 1 root root 0 Jul 23 2015 mount.s3ql_crit.log drwxr-xr-x 2 root root 64K Aug 17 00:09 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin1=2F-cache -rw------- 1 root root 1.1G Aug 19 00:03 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin1=2F.db -rw-r--r-- 1 root root 200 Aug 17 00:04 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin1=2F.params -rw------- 1 root root 2.5G Aug 19 09:52 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin2=2F.db -rw-r--r-- 1 root root 201 Aug 19 09:52 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin2=2F.params -rw------- 1 root root 517M Aug 19 10:16 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin4=2F.db -rw-r--r-- 1 root root 200 Aug 19 10:15 s3c:=2F=2Fs.qstack.advania.com:443=2Fcrin4=2F.params
So, I'm not sure what can be done to create more space here...
comment:3 Changed 2 years ago by chris
- Add Hours to Ticket changed from 0 to 0.25
- Total Hours changed from 0.5 to 0.75
I have stopped the Munin alerts for now by upping the alert level to 94% rather than 92% by editing /etc/munin/plugin-conf.d/munin-node:
[df*] env.warning 94 env.critical 98
Restart munin-node:
/etc/init.d/munin-node restart [ ok ] Restarting munin-node (via systemctl): munin-node.service.
I have also deleted some of the backups of the Changelog:
cd /root/.Changelog rm -f Changelog.2015* rm -f Changelog.2016-01* rm -f Changelog.2016-02* rm -f Changelog.2016-03* rm -f Changelog.2016-04* rm -f Changelog.2016-05* rm -f Changelog.2016-06* rm -f Changelog.2016-07*
And cleaned up the apt archive:
apt-get clean
And that has brought some time, the next thing to try is deleting old backups to see if that frees up space by reducing the amount of metadata stored:
df -h Filesystem Size Used Avail Use% Mounted on udev 489M 0 489M 0% /dev tmpfs 100M 5.4M 95M 6% /run /dev/mapper/CRIN3--vg-root 15G 13G 1.6G 89% / tmpfs 499M 0 499M 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 499M 0 499M 0% /sys/fs/cgroup /dev/sda1 236M 195M 29M 88% /boot tmpfs 100M 0 100M 0% /run/user/1000
comment:4 Changed 2 years ago by chris
- Add Hours to Ticket changed from 0 to 0.1
- Total Hours changed from 0.75 to 0.85
This came up again:
crin.org :: crin3.crin.org :: Disk usage in percent WARNINGs: / is 97.81 (outside range [:94]). OKs: /run is 10.34, /run/lock is 0.00, /boot is 87.07, /sys/fs/cgroup is 0.00, /dev/shm is 0.00.
But it has dropped back down now:
df -h Filesystem Size Used Avail Use% Mounted on udev 489M 0 489M 0% /dev tmpfs 100M 11M 90M 11% /run /dev/mapper/CRIN3--vg-root 15G 12G 2.4G 84% / tmpfs 499M 0 499M 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 499M 0 499M 0% /sys/fs/cgroup /dev/sda1 236M 195M 29M 88% /boot crin1:/ 121G 27G 88G 24% /media/sshfs/crin1 tmpfs 100M 0 100M 0% /run/user/1000
I guess it is caused by tmp files or something, at some point we will probably have to make this server bigger.
It looks like it is because something has gone wrong with a backup job, two are running but only one s3ql filesystem is connected:
So:
This seems to be un unkillable process, rebooting the server and I'll check how it looks later.