Project

General

Profile

Actions

Feature #62715

open

mgr/volumes: switch to storing subvolume metadata in libcephsqlite

Added by Venky Shankar 8 months ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
mgr/volumes
Labels (FS):
Pull request ID:

Description

A bit of history: The subvolume thing started out as a directory structure in the file system (and that is still the case), but the initial versions did not have any notion of subvolume metadata. The next version(s), what we have as of this writing, have a metadata store. The (subvolume) metadata is stored as an ini file and lives in the top-level subvolume directory and stores various pieces for subvolume information - state, path, in-progress clones, etc.

The metadata store has worked pretty well since it was introduced barring a few issues, e.g.:

- corrupted metadata file due to ENOSPC (this also prompted us to work an a tool to regenerate metadata, but we hit a roadblock with the implementation)
- security issue with a custom .meta

which prompted a variety of code changes to workaround the problem. This all carries a sizeable baggage of fixes/workarounds which I think can be done away with if the metadata is stored directly in RADOS. One may argue saying "If it ain't broken, don't fix it", which is a perfectly valid argument, however, IMO, doing away with workarounds is a big win with all this (not to mention that the metadata store would be unified for legacy and v1/v2 subvolumes). Thoughts?


Related issues 1 (1 open0 closed)

Related to mgr - Feature #62884: audit: create audit module which persists in RADOS important operations performed on the clusterNew

Actions
Actions #1

Updated by Dhairya Parmar 8 months ago

Venky Shankar wrote:

which prompted a variety of code changes to workaround the problem. This all carries a sizeable baggage of fixes/workarounds which I think can be done away with if the metadata is stored directly in RADOS.

Why didn't we go with directly storing into RADOS back when we implemented subvolumes? Is there any catch here?

Actions #2

Updated by Venky Shankar 8 months ago

Dhairya Parmar wrote:

Venky Shankar wrote:

which prompted a variety of code changes to workaround the problem. This all carries a sizeable baggage of fixes/workarounds which I think can be done away with if the metadata is stored directly in RADOS.

Why didn't we go with directly storing into RADOS back when we implemented subvolumes? Is there any catch here?

Honestly, I don't recall exactly why, but one of the factors that effected the decision was surely timelines :)

Actions #3

Updated by Patrick Donnelly 8 months ago

Dhairya Parmar wrote:

Venky Shankar wrote:

which prompted a variety of code changes to workaround the problem. This all carries a sizeable baggage of fixes/workarounds which I think can be done away with if the metadata is stored directly in RADOS.

Why didn't we go with directly storing into RADOS back when we implemented subvolumes? Is there any catch here?

The reason is because legacy subvolume interface was handled by a python library (used by Openstack Manila). That library only had libcephfs and acted as a regular client.

If we are going to move the metadata out of CephFS, I think it should go in cephsqlite. There is already a database available to the volumes plugin for this potential purpose.

Actions #4

Updated by Greg Farnum 8 months ago

Patrick Donnelly wrote:

If we are going to move the metadata out of CephFS, I think it should go in cephsqlite. There is already a database available to the volumes plugin for this potential purpose.

I've been pretty aggressively NAKing this due to the situations we've seen with Rook users losing data. If we lose the object containing the sqlite database, we can no longer map any subvolumes for Ceph-CSI, and there's no recovery option at all (unless we maintain a backup elsewhere, in which case...why bother with the database?).

Venky, is there a design for storing the metadata in omap? It's not clear to me how adding "yet another version" will clean up the issues we've hit, most of which actually stem from trying to correctly identify which version of the metadata we're working with...

Actions #5

Updated by Patrick Donnelly 8 months ago

Greg Farnum wrote:

Patrick Donnelly wrote:

If we are going to move the metadata out of CephFS, I think it should go in cephsqlite. There is already a database available to the volumes plugin for this potential purpose.

I've been pretty aggressively NAKing this due to the situations we've seen with Rook users losing data. If we lose the object containing the sqlite database, we can no longer map any subvolumes for Ceph-CSI, and there's no recovery option at all (unless we maintain a backup elsewhere, in which case...why bother with the database?).

Two items:

- Usability: sqlite is suprior to manipulating the omap. It's too easy to hit scaling issues with the omap. Programmability is as unfriendly and tricky. Recovering a corrupt (probably partially written) .meta file is not fun and complicates the volumes plugin with recovery steps. There will be similar issues with omap when the metadata spans multiple objects (which it must for clones). Sqlite allows us to neatly avoid this problem.

- Recoverability: There are two approaches to this: (a) distributing the metadata as-is currently done so that a partial loss doesn't mean complete loss. So if one subvolume is corrupt, not all are impacted. This is what we currently have and we're talking about changing it. (b) centralizing the metadata for ease of inspection/modification. Centralizing the store doesn't necessarily mean we stop duplicating the data in the subvolume tree structure. It's just not authoritative. We could recover the database in the event of a catastrophe (probably with out-of-the-mgr tools). Of course, we can also use sqlite3 existing backup solutions to put periodic backups in CephFS itself to rollback to.

Venky, is there a design for storing the metadata in omap? It's not clear to me how adding "yet another version" will clean up the issues we've hit, most of which actually stem from trying to correctly identify which version of the metadata we're working with...

The directory layout of the subvolumes has always complicated how we can store metadata. "Legacy" subvolumes from openstack manila (via the infamous ceph_volume_plugin.py library) would have no metadata and simple directory structures (/volumes/<group>/<root of subvolume>/). This made doing snapshots harder: if you snapshot the subvolume then you can no longer "delete" it until its snapshots are gone (creating an incompatiblity with CSI). This and other issues prompted the v2 rework.

It doesn't sound to me like Venky is proposing a new hierarchy. Just moving the metadata to solve a specific problem of reliability. SQLite can help solve that.

Actions #6

Updated by Venky Shankar 8 months ago

Greg Farnum wrote:

Patrick Donnelly wrote:

If we are going to move the metadata out of CephFS, I think it should go in cephsqlite. There is already a database available to the volumes plugin for this potential purpose.

I've been pretty aggressively NAKing this due to the situations we've seen with Rook users losing data. If we lose the object containing the sqlite database, we can no longer map any subvolumes for Ceph-CSI, and there's no recovery option at all (unless we maintain a backup elsewhere, in which case...why bother with the database?).

Venky, is there a design for storing the metadata in omap? It's not clear to me how adding "yet another version" will clean up the issues we've hit, most of which actually stem from trying to correctly identify which version of the metadata we're working with...

The feature is more to do with being able to do away with a bunch of stuff to workaround cases such as a half-baked .meta due to ENOSPC and another issue which involved a failed metadata update when removing subvolumes since it involved changing state (and persisting the state in .meta).

Actions #7

Updated by Venky Shankar 8 months ago

Patrick Donnelly wrote:

Greg Farnum wrote:

Patrick Donnelly wrote:

If we are going to move the metadata out of CephFS, I think it should go in cephsqlite. There is already a database available to the volumes plugin for this potential purpose.

I've been pretty aggressively NAKing this due to the situations we've seen with Rook users losing data. If we lose the object containing the sqlite database, we can no longer map any subvolumes for Ceph-CSI, and there's no recovery option at all (unless we maintain a backup elsewhere, in which case...why bother with the database?).

Two items:

- Usability: sqlite is suprior to manipulating the omap. It's too easy to hit scaling issues with the omap.

To avoid scalability issues with omap, having one object per subvolume instead of stashing lots of omap entries in one object would be preferred.

Programmability is as unfriendly and tricky.

Perhaps my previous experience of working with RADOS OMAP APIs are contributing my bias towards using omap. I'm not sure if the RADOS OMAP APIs are too unfriendly - so I'm more curious about what trickery is in there? (not the implementation, of course)

Recovering a corrupt (probably partially written) .meta file is not fun and complicates the volumes plugin with recovery steps. There will be similar issues with omap when the metadata spans multiple objects (which it must for clones). Sqlite allows us to neatly avoid this problem.

Fair enough, however, in the current implementation, one of the places where a bit more attention is given during metadata updation is when a subvolume clone finishes and the plugin needs to update metadata on the source subvolume (removing some meta) followed by updating metadata on the clone (changing state) - the order in which we update is important (update clone state and then the source subvolume) and I do agree that libcephsqlite can be useful in this case.

The partial metadata that I mention in the tracker description is the half-baked metadata file for a subvolume. This is worked around by writing to a temp file and then renaming it over the real (existing) one.

- Recoverability: There are two approaches to this: (a) distributing the metadata as-is currently done so that a partial loss doesn't mean complete loss. So if one subvolume is corrupt, not all are impacted. This is what we currently have and we're talking about changing it. (b) centralizing the metadata for ease of inspection/modification. Centralizing the store doesn't necessarily mean we stop duplicating the data in the subvolume tree structure. It's just not authoritative. We could recover the database in the event of a catastrophe (probably with out-of-the-mgr tools). Of course, we can also use sqlite3 existing backup solutions to put periodic backups in CephFS itself to rollback to.

Venky, is there a design for storing the metadata in omap? It's not clear to me how adding "yet another version" will clean up the issues we've hit, most of which actually stem from trying to correctly identify which version of the metadata we're working with...

The directory layout of the subvolumes has always complicated how we can store metadata. "Legacy" subvolumes from openstack manila (via the infamous ceph_volume_plugin.py library) would have no metadata and simple directory structures (/volumes/<group>/<root of subvolume>/). This made doing snapshots harder: if you snapshot the subvolume then you can no longer "delete" it until its snapshots are gone (creating an incompatiblity with CSI). This and other issues prompted the v2 rework.

It doesn't sound to me like Venky is proposing a new hierarchy. Just moving the metadata to solve a specific problem of reliability. SQLite can help solve that.

That's correct. Another thing that I this would bring us is avoiding hand-editing the metadata which has been a source of abuse lately.

Actions #8

Updated by Patrick Donnelly 8 months ago

Venky Shankar wrote:

Patrick Donnelly wrote:

Greg Farnum wrote:

Patrick Donnelly wrote:

If we are going to move the metadata out of CephFS, I think it should go in cephsqlite. There is already a database available to the volumes plugin for this potential purpose.

I've been pretty aggressively NAKing this due to the situations we've seen with Rook users losing data. If we lose the object containing the sqlite database, we can no longer map any subvolumes for Ceph-CSI, and there's no recovery option at all (unless we maintain a backup elsewhere, in which case...why bother with the database?).

Two items:

- Usability: sqlite is suprior to manipulating the omap. It's too easy to hit scaling issues with the omap.

To avoid scalability issues with omap, having one object per subvolume instead of stashing lots of omap entries in one object would be preferred.

Right, but then comes the consistency issues.

Programmability is as unfriendly and tricky.

Perhaps my previous experience of working with RADOS OMAP APIs are contributing my bias towards using omap. I'm not sure if the RADOS OMAP APIs are too unfriendly - so I'm more curious about what trickery is in there? (not the implementation, of course)

omap may be straightforward when you have a static key-value format to persist but the moment it becomes hierarchical or tabular (and it will!), then you are wading into painful territory. That is one of my main concerns. It's also expensive to answer questions concerning multiple subvolumes as you must do omap queries spanning all subvolumes. e.g.: list subvolumes by quota; list all cloned subvolumes; etc.

Actions #9

Updated by Venky Shankar 7 months ago

  • Assignee set to Neeraj Pratap Singh

Patrick Donnelly wrote:

Venky Shankar wrote:

Patrick Donnelly wrote:

Greg Farnum wrote:

Patrick Donnelly wrote:

If we are going to move the metadata out of CephFS, I think it should go in cephsqlite. There is already a database available to the volumes plugin for this potential purpose.

I've been pretty aggressively NAKing this due to the situations we've seen with Rook users losing data. If we lose the object containing the sqlite database, we can no longer map any subvolumes for Ceph-CSI, and there's no recovery option at all (unless we maintain a backup elsewhere, in which case...why bother with the database?).

Two items:

- Usability: sqlite is suprior to manipulating the omap. It's too easy to hit scaling issues with the omap.

To avoid scalability issues with omap, having one object per subvolume instead of stashing lots of omap entries in one object would be preferred.

Right, but then comes the consistency issues.

Programmability is as unfriendly and tricky.

Perhaps my previous experience of working with RADOS OMAP APIs are contributing my bias towards using omap. I'm not sure if the RADOS OMAP APIs are too unfriendly - so I'm more curious about what trickery is in there? (not the implementation, of course)

omap may be straightforward when you have a static key-value format to persist but the moment it becomes hierarchical or tabular (and it will!), then you are wading into painful territory. That is one of my main concerns. It's also expensive to answer questions concerning multiple subvolumes as you must do omap queries spanning all subvolumes. e.g.: list subvolumes by quota; list all cloned subvolumes; etc.

Fair enough.

So, let's get this going.

Actions #10

Updated by Patrick Donnelly 7 months ago

  • Related to Feature #62884: audit: create audit module which persists in RADOS important operations performed on the cluster added
Actions #11

Updated by Venky Shankar 6 months ago

  • Subject changed from mgr/volumes: switch to storing subvolume metadata in omap to mgr/volumes: switch to storing subvolume metadata in libcephsqlite
Actions

Also available in: Atom PDF