Opened 3 years ago

Closed 3 years ago

#39 closed defect (fixed)

S3QL backup problems

Reported by: chris Owned by: chris
Priority: critical Milestone: Maintenance
Component: backups Version:
Keywords: Cc:
Estimated Number of Hours: 0 Add Hours to Ticket: 0
Billable?: yes Total Hours: 17.12

Description (last modified by chris)

Often backups fail during the rsync with "Transport endpoint is not connected".

The S3QL Crin3 backup script mounts on /tmp but the other scripts don't so where there is some issue the unmount don't work as the S3QL system is mounted in an unknown location, see ticket:35. [SOLVED]

There are also some unneeded scripts in /usr/local/bin on Crin3 which need removing to make things less confusing. [SOLVED]

Change History (15)

comment:1 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.25
  • Priority changed from minor to critical
  • Total Hours set to 0.25

This really needs fixing, at the moment what needs doing when there is an error is finding the not-properly-unmounted file system with a df -h and it shows up as something like:

df: ‘/tmp/s3ql_backup_9276’: Transport endpoint is not connected

Then use fuse to unmount it and then manually mount again, eg:

fusermount -u /tmp/s3ql_backup_9276
mnt-s3ql crin1

comment:2 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 2
  • Total Hours changed from 0.25 to 2.25

I have updated the scripts for mounting, un-mounting and backing up s3ql filesystems and updated the wiki pages, wiki:S3QLMnt and wiki:S3QLBackup

I have raised an issue with the Crin1 backup on the s3ql email list.

I have updated the crontab on Crin3:

01 01 * * * /usr/local/bin/umnt-s3ql crin1 ; /usr/local/bin/mnt-s3ql crin1 ; /usr/local/bin/umnt-sshfs crin1 ; /usr/local/bin/mnt-sshfs crin1 ; /usr/local/bin/s3ql_backup crin1
01 03 * * * /usr/local/bin/umnt-s3ql crin2 ; /usr/local/bin/mnt-s3ql crin2 ; /usr/local/bin/umnt-sshfs crin2 ; /usr/local/bin/mnt-sshfs crin2 ; /usr/local/bin/s3ql_backup crin2
01 05 * * * /usr/local/bin/umnt-s3ql crin4 ; /usr/local/bin/mnt-s3ql crin4 ; /usr/local/bin/umnt-sshfs crin4 ; /usr/local/bin/mnt-sshfs crin4 ; /usr/local/bin/s3ql_backup crin4

But this could do with further improvement to do thinks like check if a forced filesystem check is needed and only unmount and re-mount things if needed.

comment:3 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.37
  • Total Hours changed from 2.25 to 2.62

I have raised the issue with the Crin1 backup as a ticket.

comment:4 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.5
  • Total Hours changed from 2.62 to 3.12

I have followed up the ticket with more details as requested on the email list.

comment:5 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.5
  • Total Hours changed from 3.12 to 3.62

Posted a response on the ticket last night and I have updated the backup crontab on Crin3 to:

01 01 * * * /usr/local/bin/mnt-s3ql crin1 && /usr/local/bin/mnt-sshfs crin1 && /usr/local/bin/s3ql_backup crin1 && /usr/local/bin/umnt-s3ql crin1 && /usr/local/bin/umnt-sshfs crin1
02 03 * * * /usr/local/bin/mnt-s3ql crin2 && /usr/local/bin/mnt-sshfs crin2 && /usr/local/bin/s3ql_backup crin2 && /usr/local/bin/umnt-s3ql crin2 && /usr/local/bin/umnt-sshfs crin2
03 05 * * * /usr/local/bin/mnt-s3ql crin4 && /usr/local/bin/mnt-sshfs crin4 && /usr/local/bin/s3ql_backup crin4 && /usr/local/bin/umnt-s3ql crin4 && /usr/local/bin/umnt-sshfs crin4
Last edited 3 years ago by chris (previous) (diff)

comment:6 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 4
  • Total Hours changed from 3.62 to 7.62

I have added a lot of error checking to the scripts and sorted out the exit codes, see wiki:SshfsMnt and wiki:S3QLMnt and I have changed the wiki:S3QLBackup script so it calls the others and the crontab has been updated to:

# m h  dom mon dow   command
01 01 * * * /usr/local/bin/s3ql_backup crin1
02 03 * * * /usr/local/bin/s3ql_backup crin2
03 05 * * * /usr/local/bin/s3ql_backup crin4

s3ql and lots of other packages have been upgraded, there is this issue:

The following NEW packages will be installed:
  cpp-5{a} distro-info-data{a} gcc-5{a} libapt-inst1.7{a} libapt-pkg4.16{a} libasan2{a}
  libb-hooks-endofscope-perl{a} libb-hooks-op-check-perl{a} libbareword-filehandles-perl{a}
  libboost-iostreams1.58.0{a} libcc1-0{a} libclass-method-modifiers-perl{a} libclass-xsaccessor-perl{a}
  libcpan-changes-perl{a} libcwidget3v5{a} libdata-perl-perl{a} libdevel-caller-perl{a}
  libdevel-globaldestruction-perl{a} libdevel-lexalias-perl{a} libfile-slurp-perl{a} libgcc-5-dev{a}
  libgetopt-long-descriptive-perl{a} libicu55{a} libimport-into-perl{a} libindirect-perl{a} libio-stringy-perl{a}
  liblexical-sealrequirehints-perl{a} liblvm2app2.2{a} libmodule-implementation-perl{a} libmodule-runtime-perl{a}
  libmoo-perl{a} libmoox-handlesvia-perl{a} libmpx0{a} libmultidimensional-perl{a} libnamespace-autoclean-perl{a}
  libnamespace-clean-perl{a} libnghttp2-14{a} libpackage-stash-perl{a} libpackage-stash-xs-perl{a}
  libpadwalker-perl{a} libparams-classify-perl{a} libparams-validate-perl{a} libpath-tiny-perl{a}
  libpod-markdown-perl{a} librole-tiny-perl{a} libsigc++-2.0-0v5{ab} libstrictures-perl{a}
  libsub-exporter-progressive-perl{a} libsub-identify-perl{a} libsys-meminfo-perl{a} libtry-tiny-perl{a}
  libtype-tiny-perl{a} libtype-tiny-xs-perl{a} libunicode-utf8-perl{a} libvariable-magic-perl{a} libxapian22v5{a}
  linux-image-4.2.0-1-amd64{a} python-httplib2{a} python-pycurl{a} python-pysimplesoap{a}
The following packages will be REMOVED:
  libcwidget3{u} libregexp-common-perl{u} libxapian22{u} python-defusedxml{u} python-soappy{u} python-wstools{u}
The following packages have unmet dependencies:
 libsigc++-2.0-0v5 : Conflicts: libsigc++-2.0-0c2a but 2.4.1-1 is installed.
The following actions will resolve these dependencies:

     Remove the following packages:
1)     aptitude

     Keep the following packages at their current version:
2)     libcwidget3v5 [Not Installed]
3)     libsigc++-2.0-0v5 [Not Installed]

     Leave the following dependencies unresolved:
4)     aptitude-common recommends aptitude


Accept this solution? [Y/n/q/?] q

But I think that can wait, s3ql has been updated:

s3ql (2.14+dfsg-1) unstable; urgency=medium

  The internal file system revision has changed. File systems created
  with S3QL 2.14 or newer are not compatible with prior S3QL
  versions. To update an existing file system to the newest revision,
  use the 's3qladm upgrade' command.

 -- Nikolaus Rath <Nikolaus@rath.org>  Wed, 12 Aug 2015 14:57:22 -0700

So these needed to be run:

s3qladm upgrade s3c://s.qstack.advania.com:443/crin4/
s3qladm upgrade s3c://s.qstack.advania.com:443/crin2/

The bug has been fixed, I need to check when a new version of s3ql with the fix will be added to Debian.

comment:7 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.25
  • Total Hours changed from 7.62 to 7.87

The backups failed last night as I forgot that /usr/local/bin isn't in the $PATH for cron, so I have amended the script to fix this.

I have also asked when the fix will be available in Debian unstable.

comment:8 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.5
  • Total Hours changed from 7.87 to 8.37

Nikolaus has replied regarding when a new version of s3ql might be available:

That might be a while, it's really unlikely to happen so I don't want to make a release just because of this. However, you should be able to apply the above patch to your S3QL 2.13 installation (only the changes to fsck.py are important, you don't need to patch the t3_fsck.py or Changes.rst file).

So the diff is here, the file is here so, on Crin3:

locate fsck.py 
/usr/lib/s3ql/s3ql/fsck.py
/usr/lib/s3ql/s3ql/fsck.py.orig

So I must have manually patched this file in the past (it'll be documented on a ticket somewhere), but the last Debian update would have clobbered that change.

cd
wget https://bitbucket.org/nikratio/s3ql/raw/79a95415c11e0bd199225c0e4a585a5ac183aef2/src/s3ql/fsck.py
diff fsck.py /usr/lib/s3ql/s3ql/fsck.py

This is the diff:

2c2
< fsck.py - this file is part of S3QL.
---
> fsck.py - this file is part of S3QL (http://s3ql.googlecode.com)
6c6
< This work can be distributed under the terms of the GNU GPLv3.
---
> This program can be distributed under the terms of the GNU GPLv3.
649,650c649,650
<         for (block_id, obj_id) in list(self.conn.query('SELECT id, obj_id FROM blocks '
<                                                        'WHERE hash IS NULL')):
---
>         for (block_id, obj_id) in self.conn.query('SELECT id, obj_id FROM blocks '
>                                                   'WHERE hash IS NULL'):
652,658c652
< 
<             # This should only happen when there was an error during upload,
<             # so the object must not have been stored correctly. We cannot
<             # just recalculate the hash for the block, because then we may
<             # be modifying the contents of the inode that refers to this
<             # block.
<             self.log_error("No checksum for block %d, removing from table...", block_id)
---
>             self.log_error("No cached checksum for block %d, recomputing...", block_id)
664,665c658,669
<             self.conn.execute('DELETE FROM blocks WHERE id=?', (block_id,))
<             self.conn.execute('DELETE FROM objects WHERE id=?', (obj_id,))
---
>             def do_read(fh):
>                 sha = hashlib.sha256()
>                 while True:
>                     buf = fh.read(BUFSIZE)
>                     if not buf:
>                         break
>                     sha.update(buf)
>                 return sha.digest()
>             hash_ = self.backend.perform_read(do_read, "s3ql_data_%d" % obj_id)
> 
>             self.conn.execute('UPDATE blocks SET hash=? WHERE id=?',
>                               (hash_, block_id))

Moving the old file and replacing it with the new one (to save patching):

mv /usr/lib/s3ql/s3ql/fsck.py /usr/lib/s3ql/s3ql/fsck.py.2015-10-16
mv fsck.py /usr/lib/s3ql/s3ql/

Test:

mnt-s3ql crin1

Starting fsck of s3c://s.qstack.advania.com:443/crin1/
Using cached metadata.
Remote metadata is outdated.
Checking DB integrity...
Creating temporary extra indices...
Checking lost+found...
Checking cached objects...
Checking names (refcounts)...
Checking contents (names)...
Checking contents (inodes)...
Checking contents (parent inodes)...
Checking objects (reference counts)...
Checking objects (backend)...
..processed 112240 objects so far..
Checking objects (sizes)...
Checking blocks (referenced objects)...
Checking blocks (refcounts)...
Checking blocks (checksums)...
No checksum for block 113265, removing from table...
Checking inode-block mapping (blocks)...
Inode-block mapping for inode 2112665 refers to non-existing block 113265
File may lack data, moved to /lost+found: /lost+found/block-113265
Checking inode-block mapping (inodes)...
Checking inodes (refcounts)...
Checking inodes (sizes)...
Checking extended attributes (names)...
Checking extended attributes (inodes)...
Checking symlinks (inodes)...
Checking directory reachability...
Checking unix conventions...
Checking referential integrity...
Dropping temporary indices...
Dumping metadata...
..objects..
..blocks..
..inodes..
..inode_blocks..
..symlink_targets..
..names..
..contents..
..ext_attributes..
Compressing and uploading metadata...
Wrote 20.0 MiB of compressed metadata.
Cycling metadata backups...
Backing up old metadata...
Cleaning up local metadata...
Completed fsck of s3c://s.qstack.advania.com:443/crin1/
Filesystem check of s3c://s.qstack.advania.com:443/crin1 failed

So running it again:

mnt-s3ql crin1
Starting fsck of s3c://s.qstack.advania.com:443/crin1/
Using cached metadata.
File system is marked as clean. Use --force to force checking.
Filesystem check of s3c://s.qstack.advania.com:443/crin1 success
Using 2 upload threads (memory limited).
Autodetected 65498 file descriptors available for cache entries
Using cached metadata.
Setting cache size to 9031 MB
Mounting filesystem...
Mounting s3c://s.qstack.advania.com:443/crin1 on /media/s3ql/crin1 success

So I have set off a manual backup.

TODO: When a fsck fails the backup script should run it again with --force and also I think some checks for the condition mentioned in ticket:39#comment:1 are needed.

comment:9 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 4
  • Total Hours changed from 8.37 to 12.37

Over the last 12 days I have been tweaking the backup and mounting scripts and they are more-or-less working properly, but need a little more error checking and then I'll update the documentation on the wiki for them.

There are still problem with "Transport endpoint is not connected" errors half way through backups.

comment:10 Changed 3 years ago by chris

  • Description modified (diff)
  • Summary changed from S3QL backup script mount location to S3QL backup problems

comment:11 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 2.65
  • Total Hours changed from 12.37 to 15.02

Often the Crin1 MySQL backup fails, it did last night, this happens during the rsync:

mysql/databases/owncloud/oc_users.sql
mysql/databases/owncloud/oc_vcategory.sql
mysql/databases/owncloud/oc_vcategory_to_object.sql
rsync: write failed on "/media/s3ql/crin1/2015-11-03_00:22:26/var/backups/mysql/databases/dev/field_revision_body.sql": Transport endpoint is not connected (107)
rsync: stat "/media/s3ql/crin1/2015-11-03_00:22:26/var/backups/mysql/databases/dev/.field_revision_body.sql.dueMx3" failed: Transport endpoint is not connected (107)
rsync: rename "/media/s3ql/crin1/2015-11-03_00:22:26/var/backups/mysql/databases/dev/.field_revision_body.sql.dueMx3" -> "mysql/databases/dev/field_revision_body.sql": Transport endpoint is not connected (107)
rsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.1]
Problem with rsync -aHAXx --delete-during --delete-excluded --partial -v /media/sshfs/crin1/var/backups/ /media/sshfs/crin1/2015-11-03_00:22:26/var/backups/
df: ‘/media/s3ql/crin1’: Transport endpoint is not connected

So I have written a separate script to do this check, wiki:S3QLBackup#s3ql_endpoint_check and done some more work on the other scripts and these changes are now being tested...

comment:12 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.5
  • Total Hours changed from 15.02 to 15.52

The Crin1 backup failed again last night:

mysql/databases/stage/webform_submitted_data.sql
rsync: write failed on "/media/s3ql/crin1/2015-11-05_00:18:27/var/backups/mysql/databases/newprod/field_revision_body.sql": Transport endpoint is not connected (107)
rsync: stat "/media/s3ql/crin1/2015-11-05_00:18:27/var/backups/mysql/databases/newprod/.field_revision_body.sql.UGxX5J" failed: Transport endpoint is not connected (107)
rsync: rename "/media/s3ql/crin1/2015-11-05_00:18:27/var/backups/mysql/databases/newprod/.field_revision_body.sql.UGxX5J" -> "mysql/databases/newprod/field_revision_body.sql": Transport endpoint is not connected (107)
rsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.1]
Problem with rsync, try running this script again: /usr/local/bin/s3ql_backup crin1

The error happens long after the rsync for the file has completed, the file in question is 240M, I raised this on the s3ql list and then found this bug report so have set the number of threads to 1 in the script, wiki:S3QLMnt and posted a follow up email to the list.

comment:13 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 1
  • Total Hours changed from 15.52 to 16.52

I have updated /usr/lib/s3ql/s3ql/backends/s3c.py from here as per this suggestion. Time being recorded covers this email.

comment:14 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.5
  • Total Hours changed from 16.52 to 17.02

I checked the backups each day over the weekend and it has now run three times without an issue, so I have reported that to the developer and hopefully if it runs OK for the rest of the week then this ticket is good to close.

comment:15 Changed 3 years ago by chris

  • Add Hours to Ticket changed from 0 to 0.1
  • Resolution set to fixed
  • Status changed from new to closed
  • Total Hours changed from 17.02 to 17.12

This is still working without errors, closing.

Note: See TracTickets for help on using tickets.