OVHcloud Public Cloud Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#20636 — VPS CLOUD 2016 - GRA CEPH
Incident Report for Public Cloud
Resolved
There has been an issue on CEPH cluster in GRA for VPS cloud. The Ceph team is working on it.

Update(s):

Date: 2016-10-03 16:00:29 UTC
If your VPS is still down, please reboot it.

Date: 2016-10-03 16:00:07 UTC
UP!


Date: 2016-10-03 15:59:43 UTC

2) we have lot of read/write on the ceph right now because
5 PG are UP now. the last one it's hard to make UP.

Date: 2016-10-03 15:59:27 UTC
2)

in Ceph, there is a command to forget 1 objet. it's worked
on 10 objets but it doesn't on the others 7 objectss.
we've found out that if we run the command to forget the
objet with \"whie true\" loop and in the same time we run
the command to restart ceph in the \"while true\" loop too
after a while, ceph starts

the 5 PG of 6 are UP now. we still have the last objet
fail and 1 PG fail

!!!

Date: 2016-10-03 15:59:10 UTC
2) we have only 1 objet that avoids to restart the Ceph.

we lose 17 objets with 4/8MB each. so 100MB/120TB. 0.8%

Date: 2016-10-03 15:58:51 UTC
The solution 2) begins to work

4) We are trying to find out the VPS that are down to restore
the data.

Date: 2016-10-03 15:58:26 UTC
From War Room.

Good evening
We recommand not to reboot your VPS.

The Ceph cluster is based on 24 servers. each with 12 disks. We have
the issue on 6 servers, not 24. That is why, not all 5000 VPS are down.
Some of them are down. The others continue to work with the remains
18 servers. We have an issue on 6 Placement Group (PG) and we have
10533 PG in this cluster. A small part of the data are fail but it can
impact lot of VPS. We are trying to estimate how many VPS are really
impacting, but it depends if 1) the VPS is using the 6 PG 2) the VPS wants
to read/write the data that are in the 6 PG.

The main issue is the version of the 17 objects. The objects are in the
version 696135'83055746. The version in the Ceph’s metadata is
696135'83055747. So Ceph doesn’t want to start. We’ve forced to
forget the bad files but it doesn’t work. Ceph is freezing.

We think that Ceph was trying to write the data on the failed disk.
Ceph has done it on the failed disk and updated the metadata, but
Ceph hasn’t done it on the others disks. Probably a bug in the version
we run 0.94.4. The last version is 0.94.9

4 action plans:

1)
we are patching the tool that can import/export the objects to force
the version of the object to 696135'83055747. 1 team is working on
this

2)
we are looking how to restart Ceph wihout 17 objects. an another
team is working on that.

3)
in the case, nothing works, we will start the Ceph without 6 PG, but
it means lost of some data. we don’t know which data would bel ost.
that is why we lauch a local backup of the data in the case we have
to work on it and restore the lost data in the futur. it’s a strategy
the « last case ». 1 team is working on that.

4)
we’are preapring the recovery of the data, but we are talking about
120TB. it will be slow, and the backup has 24 hours. 1 team is working
on that

Regards,
Octave

Date: 2016-10-03 15:57:59 UTC
Good evening,
We have a technical issue on 1 Ceph cluster. We have about 200
harddisk in this cluster. Each 2TB. We have 3 copy of each data
on 3 disks. This cluster manage 120To of data.

1 of the disks was broken and we removed it. For some reasons,
Ceph stopped to working : 17 objectfs are missed. It should not.

The teams has been working on the issue from 6 :30am. We’ve tried
lot of things to get it up. Some members of the team have 11 hours
uptime, that is why we’ve decided to stop working during 1 hour.
We will restart working at 7pm with a meeting between the team in
Wroclaw (Poland), Roubaix (France) and Brest (France). The goal
is to find out an action plan to resolv the issue.

We know the impact is important. About 5000 VPS are using this
Ceph cluster. The deal is simple : it has to work even if we lose 66.66%
of the hosts. Here we lost 1 hard disk and it’s broken. Once the data
are UP, we will write the post-mortem and see if we can find out an
another technology for the block storage. Right now, the goal is to
restore the data and we will.

We are very sorry about this long downtime. Be sure, we are working
on this issue and we will resolv this problem. I can’t give you right now
the ETA.

Regards,
Octave

Date: 2016-10-03 15:57:33 UTC
Summarizing: we had 1 failing HDD. After removing OSD from it, 17 objects were not found in Ceph after recovery. 7 of them are in 6 PGs which we cannot query or tell them to lose the objects. We have tried to manually force operations but have not succeeded. We are still looking for a solution to unblock those PGs.

Date: 2016-10-03 15:55:49 UTC
We are still trying do debug the ceph cluster. Some hexadump has been done without succes for the moment.
In parallel we are trying to restore some backup

Date: 2016-10-03 15:51:43 UTC
Our team is still working on it, the cluster is still locked, we are trying to unlock it.

Date: 2016-10-03 15:51:14 UTC
We have lost one OSD, and for an unknown reason all the cluster has slowed down. The Ceph team is still working on it.

No ETA for the moment.

Date: 2016-10-03 15:47:10 UTC
Due to a problem with the storage node, we experience degraded performance
Posted Oct 03, 2016 - 15:46 UTC