container stats are not uploaded due to one container db file can not be opened

Asked by Shengjie Min

This is an issue I am running into with STATS.
so basically, I added a stats user named "logger", set up all the STATS stuff. all the logs are processed and uploaded under this user ok except container stats:

under the logger account, I only have three containers:

root@ubuntu:/etc/swift# st -A http://127.0.0.1:8080/auth/v1.0 -U logger:logger -K passw0rd list
account_stats
log_data
log_processing_data

according to my log-processor.conf below, i should have had a container called "container_stats'' created along with others to contain all the stats data.
=============================================
[log-processor]
swift_account = AUTH_101e472e-136e-4233-97e1-252cf299f1a4
user = root

[log-processor-access]
swift_account = AUTH_101e472e-136e-4233-97e1-252cf299f1a4
container_name = log_data
log_dir = /var/log/swift/hourly/
source_filename_pattern = ^
    (?P<year>[0-9]{4})
    (?P<month>[0-1][0-9])
    (?P<day>[0-3][0-9])
    (?P<hour>[0-2][0-9])
    .*$
class_path = swift.stats.access_processor.AccessLogProcessor
user = root

[log-processor-stats]
swift_account = AUTH_101e472e-136e-4233-97e1-252cf299f1a4
container_name = account_stats
log_dir = /var/log/swift/stats/
class_path = swift.stats.stats_processor.StatsLogProcessor
devices = /srv/1/node
mount_check = false
user = root

[log-processor-container-stats]
swift_account = AUTH_101e472e-136e-4233-97e1-252cf299f1a4
container_name = container_stats
log_dir = /var/log/swift/stats/
class_path = swift.stats.stats_processor.StatsLogProcessor
processable = false
devices = /srv/1/node
mount_check = false
user = root

===========================================
in my syslog, I found the exception repeated all the time. because it's timed out when it's trying to connect to one of the container dbs. I checked all the container dbs, this is the only one 'c8e24cf84557279b018386f2c5d9e959.db' can't not be accessed. I tried using sqliteman open this db, getting the error "Unable to open or create file c8e24cf84557279b018386f2c5d9e959.db. it is probably not a database". So this container db seems being corrupted.

Jun 27 06:05:02 ubuntu container-stats UNCAUGHT EXCEPTION#012Traceback (most recent call last):#012 File "/usr/local/bin/swift-container-stats-logger", line 7, in <module>#012 execfile(__file__)#012 File "/home/dellswift/swift/trunk/bin/swift-container-stats-logger", line 27, in <module>#012 log_name="container-stats", **options)#012 File "/home/dellswift/swift/trunk/swift/common/daemon.py", line 87, in run_daemon#012 klass(conf).run(once=once, **kwargs)#012 File "/home/dellswift/swift/trunk/swift/common/daemon.py", line 52, in run#012 self.run_once(**kwargs)#012 File "/home/dellswift/swift/trunk/swift/stats/db_stats_collector.py", line 56, in run_once#012 self.find_and_process()#012 File "/home/dellswift/swift/trunk/swift/stats/db_stats_collector.py", line 88, in find_and_process#012 line_data = self.get_data(db_path)#012 File "/home/dellswift/swift/trunk/swift/stats/db_stats_collector.py", line 144, in get_data#012 if not broker.is_deleted():#012 File "/home/dellswift/swift/trunk/swift/common/db.py", line 869, in is_deleted#012 with self.get() as conn:#012 File "/usr/lib/python2.6/contextlib.py", line 16, in __enter__#012 return self.gen.next()#012 File "/home/dellswift/swift/trunk/swift/common/db.py", line 264, in get#012 self.conn = get_db_connection(self.db_file, self.timeout)#012 File "/home/dellswift/swift/trunk/swift/common/db.py", line 147, in get_db_connection#012 timeout=timeout)#012DatabaseConnectionError: DB connection error (/srv/1/node/sdb1/containers/205705/959/c8e24cf84557279b018386f2c5d9e959/c8e24cf84557279b018386f2c5d9e959.db, 25):#012Traceback (most recent call last):#012 File "/home/dellswift/swift/trunk/swift/common/db.py", line 139, in get_db_connection#012 conn.execute('PRAGMA synchronous = NORMAL')#012 File "/home/dellswift/swift/trunk/swift/common/db.py", line 81, in execute#012 return self._timeout(lambda: sqlite3.Connection.execute(#012 File "/home/dellswift/swift/trunk/swift/common/db.py", line 74, in _timeout#012 return call()#012 File "/home/de
===========================================
root@ubuntu:/srv/1/node/sdb1/containers# find -name '*.db'
./73763/1ce/4808deba36e11d131efa810a2ae9b1ce/4808deba36e11d131efa810a2ae9b1ce.db
./194312/6b3/bdc22bd9da35c1bc76750394b130d6b3/bdc22bd9da35c1bc76750394b130d6b3.db
./1548/9e7/018330c4ff4b486bacdb72eabd1d39e7/018330c4ff4b486bacdb72eabd1d39e7.db
./250637/1de/f4c346f5ccf98e450d216b4d17e871de/f4c346f5ccf98e450d216b4d17e871de.db
./43112/746/2a1a0f64eacb7d7448360e24cba17746/2a1a0f64eacb7d7448360e24cba17746.db
./201921/3f0/c5305e0713754da16dfd60e844f9b3f0/c5305e0713754da16dfd60e844f9b3f0.db
./135585/ae2/84685cee1c371d0a29dae0c2d3f88ae2/84685cee1c371d0a29dae0c2d3f88ae2.db
./190003/a20/b98ce25e04db013c068f616f52127a20/b98ce25e04db013c068f616f52127a20.db
./4279/995/042ddf0d27c2f94c7c78517335160995/042ddf0d27c2f94c7c78517335160995.db
./203103/004/c657e54ffcf2f371ccbbead5dc228004/c657e54ffcf2f371ccbbead5dc228004.db
./71353/ef4/45ae622b7f732fbe59098e0898233ef4/45ae622b7f732fbe59098e0898233ef4.db
./48889/cae/2fbe712598dbbe48471bb75f67e1fcae/2fbe712598dbbe48471bb75f67e1fcae.db
./72573/dff/46df74c3b1befc2faeaa4fa1ff7a5dff/46df74c3b1befc2faeaa4fa1ff7a5dff.db
./5565/9a4/056f45cce6fd7b772f2e472b5196a9a4/056f45cce6fd7b772f2e472b5196a9a4.db
./243507/fb2/edccf31cc84e6812c66ed05b6c8e5fb2/edccf31cc84e6812c66ed05b6c8e5fb2.db
./156021/e1d/985d404503e1f344c8bc4ace1e710e1d/985d404503e1f344c8bc4ace1e710e1d.db
./216376/0e7/d34e0a1a4ee2d1e29c5c2b60d24cd0e7/d34e0a1a4ee2d1e29c5c2b60d24cd0e7.db
./156014/80f/985ba052ef299ecb1f84303db155f80f/985ba052ef299ecb1f84303db155f80f.db
./109572/78d/6b013f833790cec7972c5daf6e46e78d/6b013f833790cec7972c5daf6e46e78d.db
./205705/959/c8e24cf84557279b018386f2c5d9e959/c8e24cf84557279b018386f2c5d9e959.db
./168019/5cc/a414f21fd642903cabddc3c20ab875cc/a414f21fd642903cabddc3c20ab875cc.db
./10007/d90/09c5c9ac5bddcfbaf5d96953e3a4cd90/09c5c9ac5bddcfbaf5d96953e3a4cd90.db
./65930/532/4062af2e0e26baad0b9ed88acaa67532/4062af2e0e26baad0b9ed88acaa67532.db
./67389/4e8/41cf629bdcd8b2bf314ba913ccc264e8/41cf629bdcd8b2bf314ba913ccc264e8.db
./185922/7d3/b5909b3c69186fbcdab95c4b2e22b7d3/b5909b3c69186fbcdab95c4b2e22b7d3.db
./216445/c62/d35f4b671a9d02f04be1b105759b7c62/d35f4b671a9d02f04be1b105759b7c62.db
./118615/584/73d5d6d06d7c12e14d8939f37b3f5584/73d5d6d06d7c12e14d8939f37b3f5584.db
./14473/f9a/0e226ed54973dce7ae42ac1b5740af9a/0e226ed54973dce7ae42ac1b5740af9a.db
./157874/02a/9a2c8e8d56ecdb1b2e9010f4fdbf602a/9a2c8e8d56ecdb1b2e9010f4fdbf602a.db

So my questions are:
1. what could cause the container or account dbs corrupted or invalid.
2. if one of the dbs are not accessible or corrupted, is that intended to fail the whole stats process?
3. what's the recommendation if this happens in a real live environment? delete the invalid db and let the process recover from it by itself?

Question information

Language:
English Edit question
Status:
Solved
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Solved by:
Shengjie Min
Solved:
Last query:
Last reply:
Revision history for this message
David Goetz (david-goetz) said :
#1
Revision history for this message
Shengjie Min (shengjie-min) said :
#2

Hi David, Thanks, I had a look at the fix, looks good. So now we catch the db error/exception, let the db_stats_collection job carry on.

A good question here would be What's impact of missing a few db files during this process? does that mean the STAS report at the end will be a bit inaccurate?

Also, another question in the original post,
1. what could cause the container or account dbs corrupted or invalid.

if the account/container dbs are replicated by replication daemon, is the replication supposed to have some checksum check across the replicas? or it's handled by rsync? because i have this db on other nodes, they are all accessible.

Revision history for this message
David Goetz (david-goetz) said :
#3

One db being corrupted shouldn't affect the stats. Stats will be collected on all 3 copies of the db so as long as one of them is there, when you combine the results everything should be fine.

We don't have a lot of db's getting corrupted. When it does happen, its the result of a file system error or a system crash. If you have the replicator running, it will catch corrupted db's and quarantine them. After it's quarantined, the db replicator takes care of rebuilding it. The best way to see the details on how that works is to look through the code.

I've been thinking about revamping the container auditor to do a more thorough job. Here's the bug I just created to keep track of it:

https://bugs.launchpad.net/swift/+bug/807052

Thanks

Revision history for this message
Shengjie Min (shengjie-min) said :
#4

Thanks for the explanation David, That helps. I will take a close look at the replication code.