_ . The public part of the GPG key used to sign the packages is available at RPM-GPG-KEY-Ceph-Community <https://linuxsoft.cern.ch/repos/RPM-GPG-KEY-Ceph-Community>
_ .
Also note that Ceph now builds against OpenSSL 3.5.0, which may affect EL package users who are on distros that still reference an older version.
This release fixes a critical BlueStore regression in versions 17.2.8 (#63122). Users running this releas are recommended to upgrade at the earliest convenience.
This is the seventh backport (hotfix) release in the Reef series. We recommend that all users update to this release.
This release fixes a critical bluestore regression in versions 18.2.5 and 18.2.6 (https://github.com/ceph/ceph/pull/61653). Users running either of those releases are recommended to upgrade at the earliest convenience.
This release also includes several other important BlueStore fixes:
[reef] os/bluestore: fix _extend_log seq advance (pr#61653, Pere Diaz Bou)
blk/kerneldevice: notify_all only required when discard_drain wait for condition (pr#62152, Yite Gu)
os/bluestore: Fix ExtentDecoderPartial::_consume_new_blob (pr#62054, Adam Kupczyk)
os/bluestore: Fix race in BlueFS truncate / remove (pr#62840, Adam Kupczyk)
This article explains how to set up a test Ceph cluster that runs on a single-node Minikube cluster.
Docker has been chosen as the driver of the Minikube cluster on Mac M1 due to its reliability and simplicity. By choosing Docker, we avoid the complexities of virtualization, the difficulties of firewall configuration (bootpd), and the cost of x86 emulation.
Docker runs ARM-native containers directly. This improves performance and compatibility and lowers cost, which is important in resource-intensive systems such as Rook and Ceph.
brew install docker\\nbrew install colima\\ncolima start\\n
brew install minikube\\nminikube start --disk-size=20g --driver docker\\n
curl -LO \\"https://dl.k8s.io/release/v1.26.1/bin/darwin/arm64/kubectl\\"\\nchmod +x kubectl\\nsudo mv kubectl /usr/local/bin/\\n
minikube ssh\\nsudo mkdir /mnt/disks\\n\\n# Create an empty file of size 10GB to mount disk as ceph osd\\n\\nsudo dd if=/dev/zero of=/mnt/disks/mydisk.img bs=1M count=10240\\nsudo apt update\\nsudo apt upgrade\\nsudo apt-get install qemu-utils\\n\\n# List the nbd devices\\n\\nlsblk | grep nbd\\n\\n# If you are unable to see the nbd device, load the NBD (Network Block Device)\\nkernel module.\\n\\nsudo modprobe nbd max_part=8\\n\\n# To bind nbd device to the file\\n# Note: Please check there is no necessary data in /dev/nbdx, otherwise back up\\nthat data.\\n\\nsudo qemu-nbd --format raw -c /dev/nbd0 /mnt/disks/mydisk.img\\n
lsblk | grep nbd0\\n
git clone https://github.com/rook/rook.git\\n
cd rook/deploy/examples/\\n
kubectl create -f crds.yaml -f common.yaml -f operator.yaml\\n
kubectl get pods -n rook-ceph\\n
storage:\\n useAllNodes: false\\n useAllDevices: false\\n nodes:\\n - name: minikube # node name of minikube node\\n devices:\\n - name: /dev/nbd0 # device name being used\\n allowDeviceClassUpdate: true\\n allowOsdCrushWeightUpdate: false\\n
kubectl create -f cluster-test.yaml\\n
kubectl -n rook-ceph get pod\\n
If the rook-ceph-mon, rook-ceph-mgr, or rook-ceph-osd pods are not created, refer to Ceph common issues for more information.
To verify that the cluster is in a healthy state, connect to the Rook Toolbox.
kubectl create -f toolbox.yaml\\n
kubectl -n rook-ceph rollout status deploy/rook-ceph-tools\\n
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash\\n
Run the ceph status
command and ensure the following:
bash-5.1$ ceph -s\\n cluster:\\n id: f89dd5e5-e2bb-44e8-8969-659f0fc9dc55\\n health: HEALTH_OK\\n\\n services:\\n mon: 1 daemons, quorum a (age 7m)\\n mgr: a(active, since 5m)\\n osd: 1 osds: 1 up (since 6m), 1 in (since 6m)\\n\\n data:\\n pools: 1 pools, 1 pgs\\n objects: 2 objects, 449 KiB\\n usage: 27 MiB used, 10 GiB / 10 GiB avail\\n pgs: 1 active+clean\\n
If the cluster is not healthy, refer to the Ceph common issues for potential solutions.
Footnote:
Thanks Yuval Lifshitz for providing all the support and guidance to write this article.
References:
https://rook.io/docs/rook/latest/Getting-Started/quickstart/
This is the sixth backport (hotfix) release in the Reef series. We recommend that all users update to this release.
ceph-volume: A bug related to cryptsetup version handling has been fixed.
Related tracker: https://tracker.ceph.com/issues/66393
RADOS: A bug related to IPv6 support is now fixed.
Related tracker: https://tracker.ceph.com/issues/67517
The Crimson project continues to progress, with the Squid release marking the first technical preview available for Crimson. The Tentacle release introduces a host of improvements and new functionalities that enhance the robustness, performance, and usability of both Crimson-OSD and the Seastore object store. Below, we highlight some of the recent work included in the latest release, moving us closer to fully replacing the existing Classical OSD in the future. If you\'re new to the Crimson project, please visit the project page for more information and resources.
Over 100 pull requests have been merged since the Squid code freeze. There is a dedicated and ongoing effort to stabilize and strengthen recovery scenarios and critical paths.
For more details on recent PRs, visit the Crimson GitHub project page.
A cron job has been added to run the full Crimson-RADOS test suite twice a week. Frequent suite runs help us spot any regressions early, as changes to Crimson can be delicate. Additionally, Backfill and PGLog-based recovery test cases have been added.
See the latest test runs.
OSD scheduler: Integrate a recovery operation throttler to improve Quality of Service (QoS) when handling both client and recovery/backfill operations. This is the initial step towards fully supporting the Mclock scheduler used by the Classical OSD. For more details, see the pull request.
Allow for Per-object Processing: Rework the client I/O pipeline to enable concurrent processing of one request per object. Writes still need to serialize at submission time. For random reads with high concurrency, this results in increased throughput on a single OSD with Seastore. For more details, see the pull request.
PG Splitting and Merging: Splitting existing placement groups (PGs) allows the cluster to scale over time as storage requirements increase. Ongoing work to support the PG auto-scaler feature is aimed to be included in the T release. For more details, see the pull request.
From Seastar\'s documentation:
The simplest way to write efficient asynchronous code with Seastar is to use coroutines. Coroutines don’t share most of the pitfalls of traditional continuations, and so are the preferred way to write new code.
With the introduction of coroutine support in C++20, any new code added to Crimson is highly prioritized to be implemented with coroutines. This approach makes the code more readable and helps avoid issues like variable lifecycle responsibilities.
For an example of a rewritten code section, see commit.
The common components used by both the Classical OSD and Crimson might need slight adjustments to work with both architectures. For example, Crimson does not require mutexes due to its lockless shard-nothing design, so ceph::mutex is overridden to use a \\"dummy_mutex\\" under the hood. To differentiate between the two types when compiling the OSD, we\'ve introduced the WITH_SEASTAR
macro.
Crimson supports both native (Seastar-based object stores, such as Seastore, and non-native object stores, like Bluestore. To accommodate this, further adjustments to our common components\' usage were necessary. For this reason, the WITH_ALIENSTORE
macro was used in combination with the SEASTAR
macro.
To prevent technical debt from accumulating and to make the code more readable for other developers, the two macros above were replaced with a single WITH_CRIMSON
macro that supports all the nuances mentioned.
For more details, see the pull request.
Crimson architecture is based on the Seastar framework. Our usage of the framework requires some Crimson-specific modifications - for this reason we use our fork of Seastar (located at ceph/seastar). To keep up with recent fixes and updates to the upstream framework our submodule is updated regualry.
Seastar provides reactor configuration options that can affect the behavior of the OSD. We\'ve exposed some of these options as Ceph configurables, supported via ceph.conf. For more details, see the pull request.
From the Kernel documatnion:
aio-nr is the running total of the number of events specified on the io_setup system call for all currently active aio contexts.
Due to Crimson architectures, the default value used may sometimes not be sufficient when deploying multiple Crimson-OSDs on the same host. Therefore, we\'ve updated both package-based and Cephadm deployments to increase this value.
For more details, see the pull request.
Lookup Optimizations: SeaStore is responsible for managing RADOS objects, including their data and metadata. As such, optimizing lookup operations is a critical aspect of performance improvements. These optimizations primarily focus on the B+ tree implementations, balancing performance and complexity. For more details, see: Encapsulate lba pointer PR or OMap and pg log optimization PR.
Data Overwrite Optimizations: It is common for write sizes to differ from the arrangement of low-level blocks in SeaStore. Various optimization strategies can be applied, and it remains worthwhile to explore better approaches. For more details, see the RBM inplace rewrite PR or Control memory copy PR
General Performance Enhancements: Performance issues can arise in various areas. The best approach is to conduct extensive testing, gain a deeper understanding of system behaviors, and prioritize addressing the most impactful problems. For example, see:
Periodic Status Reports: To monitor and understand internal operations, periodic reports can be enabled in the logs to summarize the latest status from various aspects. These reports are primarily for development and optimization purposes and are continuously evolving. For more details, see some of the recent stats enhancements: Reactor Utilization , Disk-IO, Transaction or LRU Cache
Enhanced Checksum Support: Integrity checks are crucial due to the unpredictable reliability of disks. Recent enhancements include Read Checksum support.
Random Block Manager: SeaStore aims to support various storage mediums, particularly those capable of random writes without significantly sacrificing bandwidth. This allows SeaStore to avoid sequential writes, making certain tasks more efficient. For more details, see RBM\'s inplace-rewrite or the Checksum offload.
Code Cleanup and Bug Fixes: To ensure the project\'s sustainability during the development of important features and optimizations, maintaining understandable code is crucial. Despite the inevitable increase in complexity, we continuously fix unexpected bugs, review logic, and take action whenever parts of the code become unclear. This sometimes leads to major refactors, as developing and reviewing based on inappropriate or overly complicated structures can be more challenging. These efforts typically account for nearly half of the total work, and sometimes even more.
SeaStore CI: tests have been implemented to ensure confidence in promoting Seastore as the default option.
This is the second backport release in the Squid series. We recommend all users update to this release.
This is the fifth backport release in the Reef series. We recommend that all users update to this release.
RBD: The try-netlink
mapping option for rbd-nbd has become the default and is now deprecated. If the NBD netlink interface is not supported by the kernel, then the mapping is retried using the legacy ioctl interface.
RADOS: A new command, ceph osd rm-pg-upmap-primary-all
, has been added that allows users to clear all pg-upmap-primary mappings in the osdmap when desired.
Related trackers:
(reintroduce) test/librados: fix LibRadosIoECPP.CrcZeroWrite (pr#61395, Samuel Just, Nitzan Mordechai)
.github: sync the list of paths for rbd label, expand tests label to qa/* (pr#57727, Ilya Dryomov)
fix formatter buffer out-of-bounds (pr#61105, liubingrun)
[reef] os/bluestore: introduce allocator state histogram (pr#61318, Igor Fedotov)
[reef] qa/multisite: stabilize multisite testing (pr#60402, Shilpa Jagannath, Casey Bodley)
[reef] qa/rgw: the rgw/verify suite runs java tests last (pr#60849, Casey Bodley)
[RGW] Fix the handling of HEAD requests that do not comply with RFC standards (pr#59122, liubingrun)
a series of optimizations for kerneldevice discard (pr#59048, Adam Kupczyk, Joshua Baergen, Gabriel BenHanokh, Matt Vandermeulen)
Add Containerfile and build.sh to build it (pr#60228, Dan Mick)
add RBD Mirror monitoring alerts (pr#56552, Arun Kumar Mohan)
AsyncMessenger.cc : improve error messages (pr#61402, Anthony D\'Atri)
AsyncMessenger: Don\'t decrease l_msgr_active_connections if it is negative (pr#60445, Mohit Agrawal)
blk/aio: fix long batch (64+K entries) submission (pr#58675, Yingxin Cheng, Igor Fedotov, Adam Kupczyk, Robin Geuze)
blk/KernelDevice: using join() to wait thread end is more safe (pr#60615, Yite Gu)
bluestore/bluestore_types: avoid heap-buffer-overflow in another way to keep code uniformity (pr#58817, Rongqi Sun)
BlueStore: Improve fragmentation score metric (pr#59263, Adam Kupczyk)
build-with-container fixes exec bit, dnf cache dir option (pr#61913, John Mulligan)
build-with-container: fixes and enhancements (pr#62162, John Mulligan)
build: Make boost_url a list (pr#58315, Adam Emerson)
ceph-mixin: Update mixin to include alerts for the nvmeof gateway(s) (pr#56948, Adam King, Paul Cuzner)
ceph-volume: allow zapping partitions on multipath devices (pr#62178, Guillaume Abrioux)
ceph-volume: create LVs when using partitions (pr#58220, Guillaume Abrioux)
ceph-volume: do source devices zapping if they\'re detached (pr#58996, Igor Fedotov)
ceph-volume: fix set_dmcrypt_no_workqueue() (pr#58997, Guillaume Abrioux)
ceph-volume: Fix unbound var in disk.get_devices() (pr#59262, Zack Cerza)
ceph-volume: fix unit tests errors (pr#59956, Guillaume Abrioux)
ceph-volume: update functional testing (pr#56857, Guillaume Abrioux)
ceph-volume: use importlib from stdlib on Python 3.8 and up (pr#58005, Guillaume Abrioux, Kefu Chai)
ceph-volume: use os.makedirs for mkdir_p (pr#57472, Chen Yuanrun)
ceph.spec.in: remove command-with-macro line (pr#57357, John Mulligan)
ceph.spec.in: we need jsonnet for all distroes for make check (pr#60076, Kyr Shatskyy)
ceph_mon: Fix MonitorDBStore usage (pr#54150, Matan Breizman)
ceph_test_rados_api_misc: adjust LibRadosMiscConnectFailure.ConnectTimeout timeout (pr#58137, Lucian Petrut)
cephadm/services/ingress: configure security user in keepalived template (pr#61151, Bernard Landon)
cephadm: add idmap.conf to nfs sample file (pr#59453, Adam King)
cephadm: added check for --skip-firewalld
to section on adding explicit Ports to firewalld (pr#57519, Michaela Lang)
cephadm: CephExporter doesn\'t bind to IPv6 in dual stack (pr#59461, Mouratidis Theofilos)
cephadm: change loki/promtail default image tags (pr#57475, Guillaume Abrioux)
cephadm: disable ms_bind_ipv4 if we will enable ms_bind_ipv6 (pr#61714, Dan van der Ster, Joshua Blanch)
cephadm: emit warning if daemon\'s image is not to be used (pr#61721, Matthew Vernon)
cephadm: fix cephadm shell --name <daemon-name>
for stopped/failed daemon (pr#56490, Adam King)
cephadm: fix apparmor profiles with spaces in the names (pr#61712, John Mulligan)
cephadm: fix host-maintenance command always exiting with a failure (pr#59454, John Mulligan)
cephadm: have agent check for errors before json loading mgr response (pr#59455, Adam King)
cephadm: make bootstrap default to \\"global\\" section for public_network setting (pr#61918, Adam King)
cephadm: pin pyfakefs version for tox tests (pr#56762, Adam King)
cephadm: pull container images from quay.io (pr#60474, Guillaume Abrioux)
cephadm: rgw: allow specifying the ssl_certificate by filepath (pr#61922, Alexander Hussein-Kershaw)
cephadm: Support Docker Live Restore (pr#61916, Michal Nasiadka)
cephadm: turn off cgroups_split setting when bootstrapping with --no-cgroups-split (pr#61716, Adam King)
cephadm: use importlib.metadata for querying ceph_iscsi\'s version (pr#58323, Zac Dover)
CephContext: acquire _fork_watchers_lock in notify_post_fork() (issue#63494, pr#59266, Venky Shankar)
cephfs-journal-tool: Add preventive measures to avoid fs corruption (pr#57761, Jos Collin)
cephfs-mirror: use monotonic clock (pr#56701, Jos Collin)
cephfs-shell: excute cmd \'rmdir_helper\' reported error (pr#58812, teng jie)
cephfs-shell: fixing cephfs-shell test failures (pr#60410, Neeraj Pratap Singh)
cephfs-shell: prints warning, hangs and aborts when launched (pr#58088, Rishabh Dave)
cephfs-top: fix exceptions on small/large sized windows (pr#59898, Jos Collin)
cephfs: add command \\"ceph fs swap\\" (pr#54942, Rishabh Dave)
cephfs: Fixed a bug in the readdir_cache_cb function that may have us… (pr#58805, Tod Chen)
cephfs_mirror, qa: fix mirror daemon doesn\'t restart when blocklisted or failed (pr#58632, Jos Collin)
cephfs_mirror, qa: fix test failure test_cephfs_mirror_cancel_mirroring_and_readd (pr#60182, Jos Collin)
cephfs_mirror: \'ceph fs snapshot mirror ls\' command (pr#60178, Jos Collin)
cephfs_mirror: fix crash in update_fs_mirrors() (pr#57451, Jos Collin)
cephfs_mirror: increment sync_failures when sync_perms() and sync_snaps() fails (pr#57437, Jos Collin)
cephfs_mirror: provide metrics for last successful snapshot sync (pr#59071, Jos Collin)
client: check mds down status before getting mds_gid_t from mdsmap (pr#58492, Yite Gu, Dhairya Parmar)
client: clear resend_mds only after sending request (pr#57174, Patrick Donnelly)
client: disallow unprivileged users to escalate root privileges (pr#61379, Xiubo Li, Venky Shankar)
client: do not proceed with I/O if filehandle is invalid (pr#58397, Venky Shankar, Dhairya Parmar)
client: Fix leading / issue with mds_check_access (pr#58982, Kotresh HR, Rishabh Dave)
client: Fix opening and reading of symlinks (pr#60373, Anoop C S)
client: flush the caps release in filesystem sync (pr#59397, Xiubo Li)
client: log debug message when requesting unmount (pr#56955, Patrick Donnelly)
client: Prevent race condition when printing Inode in ll_sync_inode (pr#59620, Chengen Du)
client: set LIBMOUNT_FORCE_MOUNT2=always (pr#58529, Jakob Haufe)
cls/cas/cls_cas_internal: Initialize \'hash\' value before decoding (pr#59237, Nitzan Mordechai)
cls/user: reset stats only returns marker when truncated (pr#60165, Casey Bodley)
cmake/arrow: don\'t treat warnings as errors (pr#57375, Casey Bodley)
cmake: use ExternalProjects to build isa-l and isa-l_crypto libraries (pr#60108, Casey Bodley)
common,osd: Use last valid OSD IOPS value if measured IOPS is unrealistic (pr#60659, Sridhar Seshasayee)
common/admin_socket: add a command to raise a signal (pr#54357, Leonid Usov)
common/dout: fix FTBFS on GCC 14 (pr#59056, Radoslaw Zarzynski)
common/Formatter: dump inf/nan as null (pr#60061, Md Mahamudur Rahaman Sajib)
common/options: Change HDD OSD shard configuration defaults for mClock (pr#59972, Sridhar Seshasayee)
common/pick_address: check if address in subnet all public address (pr#57590, Nitzan Mordechai)
common/StackStringStream: update pointer to newly allocated memory in overflow() (pr#57362, Rongqi Sun)
common/TrackedOp: do not count the ops marked as nowarn (pr#58744, Xiubo Li)
common/TrackedOp: rename and raise prio of slow op perfcounter (pr#59280, Yite Gu)
common: fix md_config_cacher_t (pr#61403, Ronen Friedman)
common: use close_range on Linux (pr#61625, edef)
container/build.sh: don\'t require repo creds on NO_PUSH (pr#61582, Dan Mick)
container/build.sh: fix up org vs. repo naming (pr#61581, Dan Mick)
container/build.sh: remove local container images (pr#62065, Dan Mick)
container/Containerfile: replace CEPH_VERSION label for backward compat (pr#61580, Dan Mick)
container: add label ceph=True back (pr#61612, John Mulligan)
containerized build tools [V2] (pr#61683, John Mulligan, Ernesto Puerta)
debian pkg: record python3-packaging dependency for ceph-volume (pr#59201, Kefu Chai, Thomas Lamprecht)
debian: add ceph-exporter package (pr#56541, Shinya Hayashi)
debian: add missing bcrypt to ceph-mgr .requires to fix resulting package dependencies (pr#54662, Thomas Lamprecht)
debian: recursively adjust permissions of /var/lib/ceph/crash (pr#58458, Max Carrara)
doc,mailmap: update my email / association to ibm (pr#60339, Patrick Donnelly)
doc/ceph-volume: add spillover fix procedure (pr#59541, Zac Dover)
doc/cephadm/services: Re-improve osd.rst (pr#61953, Anthony D\'Atri)
doc/cephadm/upgrade: ceph-ci containers are hosted by quay.ceph.io (pr#58681, Casey Bodley)
doc/cephadm: add default monitor images (pr#57209, Zac Dover)
doc/cephadm: add malformed-JSON removal instructions (pr#59664, Zac Dover)
doc/cephadm: Clarify \\"Deploying a new Cluster\\" (pr#60810, Zac Dover)
doc/cephadm: clean \\"Adv. OSD Service Specs\\" (pr#60680, Zac Dover)
doc/cephadm: correct note (pr#61529, Zac Dover)
doc/cephadm: edit \\"Using Custom Images\\" (pr#58941, Zac Dover)
doc/cephadm: how to get exact size_spec from device (pr#59431, Zac Dover)
doc/cephadm: improve \\"Activate Existing OSDs\\" (pr#61748, Zac Dover)
doc/cephadm: improve \\"Activate Existing OSDs\\" (pr#61726, Zac Dover)
doc/cephadm: link to \\"host pattern\\" matching sect (pr#60645, Zac Dover)
doc/cephadm: Reef default images procedure (pr#57236, Zac Dover)
doc/cephadm: remove downgrade reference from upgrade docs (pr#57086, Adam King)
doc/cephadm: simplify confusing math proposition (pr#61575, Zac Dover)
doc/cephadm: Update operations.rst (pr#60638, rhkelson)
doc/cephfs: add cache pressure information (pr#59149, Zac Dover)
doc/cephfs: add doc for disabling mgr/volumes plugin (pr#60497, Rishabh Dave)
doc/cephfs: add metrics to left pane (pr#57736, Zac Dover)
doc/cephfs: disambiguate \\"Reporting Free Space\\" (pr#56872, Zac Dover)
doc/cephfs: disambiguate two sentences (pr#57704, Zac Dover)
doc/cephfs: disaster-recovery-experts cleanup (pr#61447, Zac Dover)
doc/cephfs: document purge queue and its perf counters (pr#61194, Dhairya Parmar)
doc/cephfs: edit \\"Cloning Snapshots\\" in fs-volumes.rst (pr#57666, Zac Dover)
doc/cephfs: edit \\"Disabling Volumes Plugin\\" (pr#60468, Rishabh Dave)
doc/cephfs: edit \\"Dynamic Subtree Partitioning\\" (pr#58910, Zac Dover)
doc/cephfs: edit \\"is mount helper present\\" (pr#58579, Zac Dover)
doc/cephfs: edit \\"Layout Fields\\" text (pr#59022, Zac Dover)
doc/cephfs: edit \\"Pinning Subvolumes...\\" (pr#57663, Zac Dover)
doc/cephfs: edit 2nd 3rd of mount-using-kernel-driver (pr#61059, Zac Dover)
doc/cephfs: edit 3rd 3rd of mount-using-kernel-driver (pr#61081, Zac Dover)
doc/cephfs: edit disaster-recovery-experts (pr#61424, Zac Dover)
doc/cephfs: edit disaster-recovery-experts (2 of x) (pr#61444, Zac Dover)
doc/cephfs: edit disaster-recovery-experts (3 of x) (pr#61454, Zac Dover)
doc/cephfs: edit disaster-recovery-experts (4 of x) (pr#61480, Zac Dover)
doc/cephfs: edit disaster-recovery-experts (5 of x) (pr#61500, Zac Dover)
doc/cephfs: edit disaster-recovery-experts (6 of x) (pr#61522, Zac Dover)
doc/cephfs: edit first 3rd of mount-using-kernel-driver (pr#61042, Zac Dover)
doc/cephfs: edit front matter in client-auth.rst (pr#57122, Zac Dover)
doc/cephfs: edit front matter in mantle.rst (pr#57792, Zac Dover)
doc/cephfs: edit fs-volumes.rst (1 of x) (pr#57418, Zac Dover)
doc/cephfs: edit fs-volumes.rst (1 of x) followup (pr#57427, Zac Dover)
doc/cephfs: edit fs-volumes.rst (2 of x) (pr#57543, Zac Dover)
doc/cephfs: edit grammar in snapshots.rst (pr#61460, Zac Dover)
doc/cephfs: edit vstart warning text (pr#57815, Zac Dover)
doc/cephfs: fix \\"file layouts\\" link (pr#58876, Zac Dover)
doc/cephfs: fix \\"OSD capabilities\\" link (pr#58893, Zac Dover)
doc/cephfs: fix typo (pr#58469, spdfnet)
doc/cephfs: improve \\"layout fields\\" text (pr#59251, Zac Dover)
doc/cephfs: improve cache-configuration.rst (pr#59215, Zac Dover)
doc/cephfs: improve ceph-fuse command (pr#56968, Zac Dover)
doc/cephfs: rearrange subvolume group information (pr#60436, Indira Sawant)
doc/cephfs: refine client-auth (1 of 3) (pr#56780, Zac Dover)
doc/cephfs: refine client-auth (2 of 3) (pr#56842, Zac Dover)
doc/cephfs: refine client-auth (3 of 3) (pr#56851, Zac Dover)
doc/cephfs: s/mountpoint/mount point/ (pr#59295, Zac Dover)
doc/cephfs: s/mountpoint/mount point/ (pr#59287, Zac Dover)
doc/cephfs: s/subvolumegroups/subvolume groups (pr#57743, Zac Dover)
doc/cephfs: separate commands into sections (pr#57669, Zac Dover)
doc/cephfs: streamline a paragraph (pr#58775, Zac Dover)
doc/cephfs: take Anthony\'s suggestion (pr#58360, Zac Dover)
doc/cephfs: update cephfs-shell link (pr#58371, Zac Dover)
doc/cephfs: use \'p\' flag to set layouts or quotas (pr#60483, TruongSinh Tran-Nguyen)
doc/dev/developer_guide/essentials: update mailing lists (pr#62376, Laimis Juzeliunas)
doc/dev/peering: Change acting set num (pr#59063, qn2060)
doc/dev/release-process.rst: New container build/release process (pr#60972, Dan Mick)
doc/dev/release-process.rst: note new \'project\' arguments (pr#57644, Dan Mick)
doc/dev: add \\"activate latest release\\" RTD step (pr#59655, Zac Dover)
doc/dev: add formatting to basic workflow (pr#58738, Zac Dover)
doc/dev: add note about intro of perf counters (pr#57758, Zac Dover)
doc/dev: add target links to perf_counters.rst (pr#57734, Zac Dover)
doc/dev: edit \\"Principles for format change\\" (pr#58576, Zac Dover)
doc/dev: Fix typos in encoding.rst (pr#58305, N Balachandran)
doc/dev: improve basic-workflow.rst (pr#58938, Zac Dover)
doc/dev: instruct devs to backport (pr#61064, Zac Dover)
doc/dev: link to ceph.io leads list (pr#58106, Zac Dover)
doc/dev: origin of Labeled Perf Counters (pr#57914, Zac Dover)
doc/dev: remove \\"Stable Releases and Backports\\" (pr#60273, Zac Dover)
doc/dev: repair broken image (pr#57008, Zac Dover)
doc/dev: s/to asses/to assess/ (pr#57423, Zac Dover)
doc/dev_guide: add needs-upgrade-testing label info (pr#58730, Zac Dover)
doc/developer_guide: update doc about installing teuthology (pr#57750, Rishabh Dave)
doc/foundation.rst: update Intel point of contact (pr#61032, Neha Ojha)
doc/glossary.rst: add \\"Dashboard Plugin\\" (pr#60897, Zac Dover)
doc/glossary.rst: add \\"OpenStack Swift\\" and \\"Swift\\" (pr#57942, Zac Dover)
doc/glossary: add \\"ceph-ansible\\" (pr#59008, Zac Dover)
doc/glossary: add \\"ceph-fuse\\" entry (pr#58944, Zac Dover)
doc/glossary: add \\"DC\\" (Data Center) to glossary (pr#60876, Zac Dover)
doc/glossary: add \\"flapping OSD\\" (pr#60865, Zac Dover)
doc/glossary: add \\"object storage\\" (pr#59425, Zac Dover)
doc/glossary: add \\"PLP\\" to glossary (pr#60504, Zac Dover)
doc/glossary: add \\"Prometheus\\" (pr#58978, Zac Dover)
doc/glossary: Add \\"S3\\" (pr#57983, Zac Dover)
doc/governance: add exec council responsibilites (pr#60140, Zac Dover)
doc/governance: add Zac Dover\'s updated email (pr#60135, Zac Dover)
doc/install: fix typos in openEuler-installation doc (pr#56413, Rongqi Sun)
doc/install: Keep the name field of the created user consistent with … (pr#59757, hejindong)
doc/man/8/radosgw-admin: add get lifecycle command (pr#57160, rkhudov)
doc/man: add missing long option switches (pr#57707, Patrick Donnelly)
doc/man: edit ceph-bluestore-tool.rst (pr#59683, Zac Dover)
doc/man: supplant \\"wsync\\" with \\"nowsync\\" as the default (pr#60200, Zac Dover)
doc/mds: improve wording (pr#59586, Piotr Parczewski)
doc/mgr/dashboard: fix TLS typo (pr#59032, Mindy Preston)
doc/mgr: Add root CA cert instructions to rgw.rst (pr#61885, Anuradha Gadge, Zac Dover)
doc/mgr: edit \\"Overview\\" in dashboard.rst (pr#57336, Zac Dover)
doc/mgr: edit \\"Resolve IP address to hostname before redirect\\" (pr#57296, Zac Dover)
doc/mgr: explain error message - dashboard.rst (pr#57109, Zac Dover)
doc/mgr: remove Zabbix 1 information (pr#56798, Zac Dover)
doc/monitoring: Improve index.rst (pr#62266, Anthony D\'Atri)
doc/rados/operations: Clarify stretch mode vs device class (pr#62078, Anthony D\'Atri)
doc/rados/operations: improve crush-map-edits.rst (pr#62318, Anthony D\'Atri)
doc/rados/operations: Improve health-checks.rst (pr#59583, Anthony D\'Atri)
doc/rados/operations: Improve pools.rst (pr#61729, Anthony D\'Atri)
doc/rados/operations: remove vanity cluster name reference from crush… (pr#58948, Anthony D\'Atri)
doc/rados/operations: rephrase OSDs peering (pr#57157, Piotr Parczewski)
doc/rados/troubleshooting: Improve log-and-debug.rst (pr#60825, Anthony D\'Atri)
doc/rados/troubleshooting: Improve troubleshooting-pg.rst (pr#62321, Anthony D\'Atri)
doc/rados: add \\"pgs not deep scrubbed in time\\" info (pr#59734, Zac Dover)
doc/rados: add blaum_roth coding guidance (pr#60538, Zac Dover)
doc/rados: add bucket rename command (pr#57027, Zac Dover)
doc/rados: add confval directives to health-checks (pr#59872, Zac Dover)
doc/rados: add link to messenger v2 info in mon-lookup-dns.rst (pr#59795, Zac Dover)
doc/rados: add options to network config ref (pr#57916, Zac Dover)
doc/rados: add osd_deep_scrub_interval setting operation (pr#59803, Zac Dover)
doc/rados: add pg-states and pg-concepts to tree (pr#58050, Zac Dover)
doc/rados: add stop monitor command (pr#57851, Zac Dover)
doc/rados: add stretch_rule workaround (pr#58182, Zac Dover)
doc/rados: correct \\"full ratio\\" note (pr#60738, Zac Dover)
doc/rados: credit Prashant for a procedure (pr#58258, Zac Dover)
doc/rados: document manually passing search domain (pr#58432, Zac Dover)
doc/rados: document unfound object cache-tiering scenario (pr#59381, Zac Dover)
doc/rados: edit \\"Placement Groups Never Get Clean\\" (pr#60047, Zac Dover)
doc/rados: edit troubleshooting-osd.rst (pr#58272, Zac Dover)
doc/rados: explain replaceable parts of command (pr#58060, Zac Dover)
doc/rados: fix outdated value for ms_bind_port_max (pr#57048, Pierre Riteau)
doc/rados: fix sentences in health-checks (2 of x) (pr#60932, Zac Dover)
doc/rados: fix sentences in health-checks (3 of x) (pr#60950, Zac Dover)
doc/rados: followup to PR#58057 (pr#58162, Zac Dover)
doc/rados: improve leader/peon monitor explanation (pr#57959, Zac Dover)
doc/rados: improve pg_num/pgp_num info (pr#62057, Zac Dover)
doc/rados: make sentences agree in health-checks.rst (pr#60921, Zac Dover)
doc/rados: pool and namespace are independent osdcap restrictions (pr#61524, Ilya Dryomov)
doc/rados: PR#57022 unfinished business (pr#57265, Zac Dover)
doc/rados: remove dual-stack docs (pr#57073, Zac Dover)
doc/rados: remove redundant pg repair commands (pr#57040, Zac Dover)
doc/rados: s/cepgsqlite/cephsqlite/ (pr#57247, Zac Dover)
doc/rados: standardize markup of \\"clean\\" (pr#60501, Zac Dover)
doc/rados: update how to install c++ header files (pr#58308, Pere Diaz Bou)
doc/radosgw/config-ref: fix lc worker thread tuning (pr#61438, Laimis Juzeliunas)
doc/radosgw/multisite: fix Configuring Secondary Zones -> Updating the Period (pr#60333, Casey Bodley)
doc/radosgw/s3: correct eTag op match tables (pr#61309, Anthony D\'Atri)
doc/radosgw: disambiguate version-added remarks (pr#57141, Zac Dover)
doc/radosgw: Improve archive-sync-module.rst (pr#60853, Anthony D\'Atri)
doc/radosgw: Improve archive-sync-module.rst more (pr#60868, Anthony D\'Atri)
doc/radosgw: s/zonegroup/pools/ (pr#61557, Zac Dover)
doc/radosgw: update Reef S3 action list (pr#57365, Zac Dover)
doc/radosgw: update rgw_dns_name doc (pr#60886, Zac Dover)
doc/radosgw: use \'confval\' directive for reshard config options (pr#57024, Casey Bodley)
doc/rbd/rbd-exclusive-locks: mention incompatibility with advisory locks (pr#58864, Ilya Dryomov)
doc/rbd: add namespace information for mirror commands (pr#60270, N Balachandran)
doc/rbd: fix typos in NVMe-oF docs (pr#58188, N Balachandran)
doc/rbd: use https links in live import examples (pr#61604, Ilya Dryomov)
doc/README.md - add ordered list (pr#59799, Zac Dover)
doc/README.md: create selectable commands (pr#59835, Zac Dover)
doc/README.md: edit \\"Build Prerequisites\\" (pr#59638, Zac Dover)
doc/README.md: improve formatting (pr#59786, Zac Dover)
doc/README.md: improve formatting (pr#59701, Zac Dover)
doc/releases: add actual_eol for quincy (pr#61360, Zac Dover)
doc/releases: Add ordering comment to releases.yml (pr#62193, Anthony D\'Atri)
doc/rgw/d3n: pass cache dir volume to extra_container_args (pr#59768, Mark Kogan)
doc/rgw/notification: persistent notification queue full behavior (pr#59234, Yuval Lifshitz)
doc/rgw/notifications: specify which event types are enabled by default (pr#54500, Yuval Lifshitz)
doc/security: remove old GPG information (pr#56914, Zac Dover)
doc/security: update CVE list (pr#57018, Zac Dover)
doc/src: add inline literals (``) to variables (pr#57937, Zac Dover)
doc/src: invadvisable is not a word (pr#58190, Doug Whitfield)
doc/start/os-recommendations: remove 16.2.z support for CentOS 7 (pr#58721, gukaifeng)
doc/start: Add Beginner\'s Guide (pr#57822, Zac Dover)
doc/start: add links to Beginner\'s Guide (pr#58203, Zac Dover)
doc/start: add tested container host oses (pr#58713, Zac Dover)
doc/start: add vstart install guide (pr#60462, Zac Dover)
doc/start: Edit Beginner\'s Guide (pr#57845, Zac Dover)
doc/start: fix \\"are are\\" typo (pr#60709, Zac Dover)
doc/start: fix wording & syntax (pr#58364, Piotr Parczewski)
doc/start: Mention RGW in Intro to Ceph (pr#61927, Anthony D\'Atri)
doc/start: remove \\"intro.rst\\" (pr#57949, Zac Dover)
doc/start: remove mention of Centos 8 support (pr#58390, Zac Dover)
doc/start: s/http/https/ in links (pr#57871, Zac Dover)
doc/start: s/intro.rst/index.rst/ (pr#57903, Zac Dover)
doc/start: separate package and container support tables (pr#60789, Zac Dover)
doc/start: separate package chart from container chart (pr#60699, Zac Dover)
doc/start: update mailing list links (pr#58684, Zac Dover)
doc: add snapshots in docs under Cephfs concepts (pr#61247, Neeraj Pratap Singh)
doc: Amend dev mailing list subscribe instructions (pr#58697, Paulo E. Castro)
doc: clarify availability vs integrity (pr#58131, Gregory O\'Neill)
doc: clarify superuser note for ceph-fuse (pr#58615, Patrick Donnelly)
doc: Clarify that there are no tertiary OSDs (pr#61731, Anthony D\'Atri)
doc: clarify use of location: in host spec (pr#57647, Matthew Vernon)
doc: Correct link to \\"Device management\\" (pr#58489, Matthew Vernon)
doc: Correct link to Prometheus docs (pr#59560, Matthew Vernon)
doc: correct typo (pr#57884, Matthew Vernon)
doc: document metrics exported by CephFS (pr#57724, Jos Collin)
doc: Document the Windows CI job (pr#60034, Lucian Petrut)
doc: Document which options are disabled by mClock (pr#60672, Niklas Hambüchen)
doc: documenting the feature that scrub clear the entries from damage… (pr#59079, Neeraj Pratap Singh)
doc: explain the consequence of enabling mirroring through monitor co… (pr#60526, Jos Collin)
doc: fix email (pr#60234, Ernesto Puerta)
doc: fix incorrect radosgw-admin subcommand (pr#62005, Toshikuni Fukaya)
doc: fix typo (pr#59992, N Balachandran)
doc: Fixes a typo in controllers section of hardware recommendations (pr#61179, Kevin Niederwanger)
doc: fixup #58689 - document SSE-C iam condition key (pr#62298, dawg)
doc: Improve doc/radosgw/placement.rst (pr#58974, Anthony D\'Atri)
doc: improve tests-integration-testing-teuthology-workflow.rst (pr#61343, Vallari Agrawal)
doc: s/Whereas,/Although/ (pr#60594, Zac Dover)
doc: SubmittingPatches-backports - remove backports team (pr#60298, Zac Dover)
doc: Update \\"Getting Started\\" to link to start not install (pr#59908, Matthew Vernon)
doc: update Key Idea in cephfs-mirroring.rst (pr#60344, Jos Collin)
doc: update nfs doc for Kerberos setup of ganesha in Ceph (pr#59940, Avan Thakkar)
doc: update tests-integration-testing-teuthology-workflow.rst (pr#59549, Vallari Agrawal)
doc: Upgrade and unpin some python versions (pr#61932, David Galloway)
doc:update e-mail addresses governance (pr#60085, Tobias Fischer)
docs/rados/operations/stretch-mode: warn device class is not supported (pr#59100, Kamoltat Sirivadhna)
docs: removed centos 8 and added squid to the build matrix (pr#58902, Yuri Weinstein)
exporter: fix regex for rgw sync metrics (pr#57658, Avan Thakkar)
exporter: handle exceptions gracefully (pr#57371, Divyansh Kamboj)
fix issue with bucket notification test (pr#61881, Yuval Lifshitz)
global: Call getnam_r with a 64KiB buffer on the heap (pr#60126, Adam Emerson)
install-deps.sh, do_cmake.sh: almalinux is another el flavour (pr#58522, Dan van der Ster)
install-deps: save and restore user\'s XDG_CACHE_HOME (pr#56993, luo rixin)
kv/RocksDBStore: Configure compact-on-deletion for all CFs (pr#57402, Joshua Baergen)
librados: use CEPH_OSD_FLAG_FULL_FORCE for IoCtxImpl::remove (pr#59282, Chen Yuanrun)
librbd/crypto/LoadRequest: clone format for migration source image (pr#60170, Ilya Dryomov)
librbd/crypto: fix issue when live-migrating from encrypted export (pr#59151, Ilya Dryomov)
librbd/migration/HttpClient: avoid reusing ssl_stream after shut down (pr#61094, Ilya Dryomov)
librbd/migration: prune snapshot extents in RawFormat::list_snaps() (pr#59660, Ilya Dryomov)
librbd: add rbd_diff_iterate3() API to take source snapshot by ID (pr#62129, Ilya Dryomov, Vinay Bhaskar Varada)
librbd: avoid data corruption on flatten when object map is inconsistent (pr#61167, Ilya Dryomov)
librbd: clear ctx before initiating close in Image::{aio\\\\_,}close() (pr#61526, Ilya Dryomov)
librbd: create rbd_trash object during pool initialization and namespace creation (pr#57603, Ramana Raja)
librbd: diff-iterate shouldn\'t crash on an empty byte range (pr#58211, Ilya Dryomov)
librbd: disallow group snap rollback if memberships don\'t match (pr#58207, Ilya Dryomov)
librbd: don\'t crash on a zero-length read if buffer is NULL (pr#57570, Ilya Dryomov)
librbd: fix a crash in get_rollback_snap_id (pr#62045, Ilya Dryomov, N Balachandran)
librbd: fix a deadlock on image_lock caused by Mirror::image_disable() (pr#62127, Ilya Dryomov)
librbd: fix mirror image status summary in a namespace (pr#61831, Ilya Dryomov)
librbd: make diff-iterate in fast-diff mode aware of encryption (pr#58345, Ilya Dryomov)
librbd: make group and group snapshot IDs more random (pr#57091, Ilya Dryomov)
librbd: stop filtering async request error codes (pr#61644, Ilya Dryomov)
Links to Jenkins jobs in PR comment commands / Remove deprecated commands (pr#62037, David Galloway)
log: save/fetch thread name infra (pr#60728, Milind Changire, Patrick Donnelly)
Make mon addrs consistent with mon info (pr#60750, shenjiatong)
mds/client: return -ENODATA when xattr doesn\'t exist for removexattr (pr#58770, Xiubo Li)
mds/purgequeue: add l_pq_executed_ops counter (pr#58328, shimin)
mds: Add fragment to scrub (pr#56895, Christopher Hoffman)
mds: batch backtrace updates by pool-id when expiring a log segment (issue#63259, pr#60689, Venky Shankar)
mds: cephx path restriction incorrectly rejects snapshots of deleted directory (pr#59519, Patrick Donnelly)
mds: check relevant caps for fs include root_squash (pr#57343, Patrick Donnelly)
mds: CInode::item_caps used in two different lists (pr#56886, Dhairya Parmar)
mds: defer trim() until after the last cache_rejoin ack being received (pr#56747, Xiubo Li)
mds: do remove the cap when seqs equal or larger than last issue (pr#58295, Xiubo Li)
mds: don\'t add counters in warning for standby-replay MDS (pr#57834, Rishabh Dave)
mds: don\'t stall the asok thread for flush commands (pr#57560, Leonid Usov)
mds: fix session/client evict command (issue#68132, pr#58726, Venky Shankar, Neeraj Pratap Singh)
mds: fix the description for inotable testing only options (pr#57115, Xiubo Li)
mds: getattr just waits the xlock to be released by the previous client (pr#60692, Xiubo Li)
mds: Implement remove for ceph vxattrs (pr#58350, Christopher Hoffman)
mds: inode_t flags may not be protected by the policylock during set_vxattr (pr#57177, Patrick Donnelly)
mds: log at a lower level when stopping (pr#57227, Kotresh HR)
mds: misc fixes for MDSAuthCaps code (pr#60207, Xiubo Li)
mds: prevent scrubbing for standby-replay MDS (pr#58493, Neeraj Pratap Singh)
mds: relax divergent backtrace scrub failures for replicated ancestor inodes (issue#64730, pr#58502, Venky Shankar)
mds: set the correct WRLOCK flag always in wrlock_force() (pr#58497, Xiubo Li)
mds: set the proper extra bl for the create request (pr#58528, Xiubo Li)
mds: some request errors come from errno.h rather than fs_types.h (pr#56664, Patrick Donnelly)
mds: try to choose a new batch head in request_clientup() (pr#58842, Xiubo Li)
mds: use regular dispatch for processing beacons (pr#57683, Patrick Donnelly)
mds: use regular dispatch for processing metrics (pr#57681, Patrick Donnelly)
mgr/BaseMgrModule: Optimize CPython Call in Finish Function (pr#55110, Nitzan Mordechai)
mgr/cephadm: add \\"original_weight\\" parameter to OSD class (pr#59411, Adam King)
mgr/cephadm: add command to expose systemd units of all daemons (pr#61915, Adam King)
mgr/cephadm: Allows enabling NFS Ganesha NLM (pr#56909, Teoman ONAY)
mgr/cephadm: ceph orch host drain command to return error for invalid hostname (pr#61919, Shweta Bhosale)
mgr/cephadm: cleanup iscsi and nvmeof keyrings upon daemon removal (pr#59459, Adam King)
mgr/cephadm: create OSD daemon deploy specs through make_daemon_spec (pr#61923, Adam King)
mgr/cephadm: fix flake8 test failures (pr#58076, Nizamudeen A)
mgr/cephadm: fix typo with vrrp_interfaces in keepalive setup (pr#61904, Adam King)
mgr/cephadm: make client-keyring deploying ceph.conf optional (pr#59451, Adam King)
mgr/cephadm: make setting --cgroups=split configurable for adopted daemons (pr#59460, Gilad Sid)
mgr/cephadm: make SMB and NVMEoF upgrade last in staggered upgrade (pr#59462, Adam King)
mgr/cephadm: mgr orchestrator module raise exception if there is trailing tab in yaml file (pr#61921, Shweta Bhosale)
mgr/cephadm: set OSD cap for NVMEoF daemon to \\"profile rbd\\" (pr#57234, Adam King)
mgr/cephadm: Update multi-site configs before deploying daemons on rgw service create (pr#60350, Aashish Sharma)
mgr/cephadm: use double quotes for NFSv4 RecoveryBackend in ganesha conf (pr#61924, Adam King)
mgr/cephadm: use host address while updating rgw zone endpoints (pr#59947, Aashish Sharma)
mgr/dashboard: add a custom warning message when enabling feature (pr#61038, Nizamudeen A)
mgr/dashboard: add absolute path validation for pseudo path of nfs export (pr#57637, avanthakkar)
mgr/dashboard: add cephfs rename REST API (pr#60729, Yite Gu)
mgr/dashboard: add dueTime to rgw bucket validator (pr#58247, Nizamudeen A)
mgr/dashboard: add NFS export button for subvolume/ grp (pr#58657, Avan Thakkar)
mgr/dashboard: add prometheus federation config for mullti-cluster monitoring (pr#57255, Aashish Sharma)
mgr/dashboard: Administration > Configuration > Some of the config options are not updatable at runtime (pr#61182, Naman Munet)
mgr/dashboard: bump follow-redirects from 1.15.3 to 1.15.6 in /src/pybind/mgr/dashboard/frontend (pr#56877, dependabot[bot])
mgr/dashboard: Changes for Sign out text to Login out (pr#58989, Prachi Goel)
mgr/dashboard: Cloning subvolume not listing _nogroup if no subvolume (pr#59952, Dnyaneshwari talwekar)
mgr/dashboard: critical confirmation modal changes (pr#61980, Naman Munet)
mgr/dashboard: disable deleting bucket with objects (pr#61973, Naman Munet)
mgr/dashboard: exclude cloned-deleted RBD snaps (pr#57219, Ernesto Puerta)
mgr/dashboard: fix clone async validators with different groups (pr#58338, Nizamudeen A)
mgr/dashboard: fix dashboard not visible on disabled anonymous access (pr#56965, Nizamudeen A)
mgr/dashboard: fix doc links in rgw-multisite (pr#60155, Pedro Gonzalez Gomez)
mgr/dashboard: fix duplicate grafana panels when on mgr failover (pr#56929, Avan Thakkar)
mgr/dashboard: fix edit bucket failing in other selected gateways (pr#58245, Nizamudeen A)
mgr/dashboard: fix handling NaN values in dashboard charts (pr#59962, Aashish Sharma)
mgr/dashboard: Fix Latency chart data units in rgw overview page (pr#61237, Aashish Sharma)
mgr/dashboard: fix readonly landingpage (pr#57752, Pedro Gonzalez Gomez)
mgr/dashboard: fix setting compression type while editing rgw zone (pr#59971, Aashish Sharma)
mgr/dashboard: fix snap schedule delete retention (pr#56862, Ivo Almeida)
mgr/dashboard: fix total objects/Avg object size in RGW Overview Page (pr#61458, Aashish Sharma)
mgr/dashboard: Fix variable capitalization in embedded rbd-details panel (pr#62209, Juan Ferrer Toribio)
mgr/dashboard: Forbid snapshot name \\".\\" and any containing \\"/\\" (pr#59994, Dnyaneshwari Talwekar)
mgr/dashboard: handle infinite values for pools (pr#61097, Afreen)
mgr/dashboard: introduce server side pagination for osds (pr#60295, Nizamudeen A)
mgr/dashboard: Move features to advanced section and expand by default rbd config section (pr#56921, Afreen)
mgr/dashboard: nfs export enhancement for CEPHFS (pr#58475, Avan Thakkar)
mgr/dashboard: pin lxml to fix run-dashboard-tox-make-check failure (pr#62256, Nizamudeen A)
mgr/dashboard: remove cherrypy_backports.py (pr#60633, Nizamudeen A)
mgr/dashboard: remove minutely from retention (pr#56917, Ivo Almeida)
mgr/dashboard: remove orch required decorator from host UI router (list) (pr#59852, Naman Munet)
mgr/dashboard: service form hosts selection only show up to 10 entries (pr#59761, Naman Munet)
mgr/dashboard: snapshot schedule repeat frequency validation (pr#56880, Ivo Almeida)
mgr/dashboard: Update and correct zonegroup delete notification (pr#61236, Aashish Sharma)
mgr/dashboard: update period after migrating to multi-site (pr#59963, Aashish Sharma)
mgr/dashboard: update translations for reef (pr#60358, Nizamudeen A)
mgr/dashboard: When configuring the RGW Multisite endpoints from the UI allow FQDN(Not only IP) (pr#62354, Aashish Sharma)
mgr/dashboard: Wrong(half) uid is observed in dashboard (pr#59876, Dnyaneshwari Talwekar)
mgr/dashboard: Zone details showing incorrect data for data pool values and compression info for Storage Classes (pr#59877, Aashish Sharma)
mgr/diskprediction_local: avoid more mypy errors (pr#62369, John Mulligan)
mgr/diskprediction_local: avoid mypy error (pr#61292, John Mulligan)
mgr/k8sevents: update V1Events to CoreV1Events (pr#57994, Nizamudeen A)
mgr/Mgr.cc: clear daemon health metrics instead of removing down/out osd from daemon state (pr#58513, Cory Snyder)
mgr/nfs: Don\'t crash ceph-mgr if NFS clusters are unavailable (pr#58283, Anoop C S, Ponnuvel Palaniyappan)
mgr/nfs: scrape nfs monitoring endpoint (pr#61719, avanthakkar)
mgr/orchestrator: fix encrypted flag handling in orch daemon add osd (pr#61720, Yonatan Zaken)
mgr/pybind/object_format: fix json-pretty being marked invalid (pr#59458, Adam King)
mgr/rest: Trim requests array and limit size (pr#59371, Nitzan Mordechai)
mgr/rgw: Adding a retry config while calling zone_create() (pr#61717, Kritik Sachdeva)
mgr/rgw: fix error handling in rgw zone create (pr#61713, Adam King)
mgr/rgw: fix setting rgw realm token in secondary site rgw spec (pr#61715, Adam King)
mgr/snap_schedule: correctly fetch mds_max_snaps_per_dir from mds (pr#59648, Milind Changire)
mgr/snap_schedule: restore yearly spec to lowercase y (pr#57446, Milind Changire)
mgr/stats: initialize mx_last_updated in FSPerfStats (pr#57441, Jos Collin)
mgr/status: Fix \'fs status\' json output (pr#60188, Kotresh HR)
mgr/vol : shortening the name of helper method (pr#60369, Neeraj Pratap Singh)
mgr/vol: handle case where clone index entry goes missing (pr#58556, Rishabh Dave)
mgr: fix subuser creation via dashboard (pr#62087, Hannes Baum)
mgr: remove out&down osd from mgr daemons (pr#54533, shimin)
Modify container/ software to support release containers and the promotion of prerelease containers (pr#60961, Dan Mick)
mon, osd, *: expose upmap-primary in OSDMap::get_features() (pr#57794, Radoslaw Zarzynski)
mon, osd: add command to remove invalid pg-upmap-primary entries (pr#62191, Laura Flores)
mon, qa: suites override ec profiles with --yes_i_really_mean_it; monitors accept that (pr#59274, Radoslaw Zarzynski, Radosław Zarzyński)
mon,cephfs: require confirmation flag to bring down unhealthy MDS (pr#57837, Rishabh Dave)
mon/ElectionLogic: tie-breaker mon ignore proposal from marked down mon (pr#58687, Kamoltat)
mon/LogMonitor: Use generic cluster log level config (pr#57495, Prashant D)
mon/MDSMonitor: fix assert crash in fs swap
(pr#57373, Patrick Donnelly)
mon/MonClient: handle ms_handle_fast_authentication return (pr#59307, Patrick Donnelly)
mon/MonmapMonitor: do not propose on error in prepare_update (pr#56400, Patrick Donnelly)
mon/OSDMonitor: Add force-remove-snap mon command (pr#59404, Matan Breizman)
mon/OSDMonitor: fix rmsnap command (pr#56431, Matan Breizman)
mon/OSDMonitor: relax cap enforcement for unmanaged snapshots (pr#61602, Ilya Dryomov)
mon/scrub: log error details of store access failures (pr#61345, Yite Gu)
mon: add created_at and ceph_version_when_created meta (pr#56681, Ryotaro Banno)
mon: do not log MON_DOWN if monitor uptime is less than threshold (pr#56408, Patrick Donnelly)
mon: fix fs set down
to adjust max_mds only when cluster is not down (pr#59705, chungfengz)
mon: Remove any pg_upmap_primary mapping during remove a pool (pr#59270, Mohit Agrawal)
mon: stuck peering since warning is misleading (pr#57408, shreyanshjain7174)
mon: validate also mons and osds on {rm-,}pg-upmap-primary (pr#59275, Radosław Zarzyński)
msg/async: Encode message once features are set (pr#59286, Aishwarya Mathuria)
msg/AsyncMessenger: re-evaluate the stop condition when woken up in \'wait()\' (pr#53717, Leonid Usov)
msg: always generate random nonce; don\'t try to reuse PID (pr#53269, Radoslaw Zarzynski)
msg: insert PriorityDispatchers in sorted position (pr#61507, Casey Bodley)
node-proxy: make the daemon discover endpoints (pr#58483, Guillaume Abrioux)
nofail option in fstab not supported (pr#52985, Leonid Usov)
orch: refactor boolean handling in drive group spec (pr#61914, Guillaume Abrioux)
os/bluestore: add perfcount for bluestore/bluefs allocator (pr#59103, Yite Gu)
os/bluestore: add some slow count for bluestore (pr#59104, Yite Gu)
os/bluestore: allow use BtreeAllocator (pr#59499, tan changzhi)
os/bluestore: enable async manual compactions (pr#58741, Igor Fedotov)
os/bluestore: expand BlueFS log if available space is insufficient (pr#57241, Pere Diaz Bou)
os/bluestore: Fix BlueRocksEnv attempts to use POSIX (pr#61112, Adam Kupczyk)
os/bluestore: fix btree allocator (pr#59264, Igor Fedotov)
os/bluestore: fix crash caused by dividing by 0 (pr#57197, Jrchyang Yu)
os/bluestore: fix the problem of l_bluefs_log_compactions double recording (pr#57194, Wang Linke)
os/bluestore: fix the problem that _estimate_log_size_N calculates the log size incorrectly (pr#61892, Wang Linke)
os/bluestore: Improve documentation introduced by #57722 (pr#60894, Anthony D\'Atri)
os/bluestore: Make truncate() drop unused allocations (pr#60237, Adam Kupczyk, Igor Fedotov)
os/bluestore: set rocksdb iterator bounds for Bluestore::_collection_list() (pr#57625, Cory Snyder)
os/bluestore: Warning added for slow operations and stalled read (pr#59466, Md Mahamudur Rahaman Sajib)
os/store_test: Retune tests to current code (pr#56139, Adam Kupczyk)
os: introduce ObjectStore::refresh_perf_counters() method (pr#55136, Igor Fedotov)
os: remove unused btrfs_ioctl.h and tests (pr#60612, Casey Bodley)
osd/OSDMonitor: check svc is writeable before changing pending (pr#57067, Patrick Donnelly)
osd/PeeringState: introduce osd_skip_check_past_interval_bounds (pr#60284, Matan Breizman)
osd/perf_counters: raise prio of before queue op perfcounter (pr#59105, Yite Gu)
osd/scheduler: add mclock queue length perfcounter (pr#59034, zhangjianwei2)
osd/scrub: Change scrub cost to average object size (pr#59629, Aishwarya Mathuria)
osd/scrub: decrease default deep scrub chunk size (pr#59792, Ronen Friedman)
osd/scrub: reduce osd_requested_scrub_priority default value (pr#59886, Ronen Friedman)
osd/SnapMapper: fix _lookup_purged_snap (pr#56813, Matan Breizman)
osd/TrackedOp: Fix TrackedOp event order (pr#59108, YiteGu)
osd: Add memstore to unsupported objstores for QoS (pr#59285, Aishwarya Mathuria)
osd: adding \'reef\' to pending_require_osd_release (pr#60981, Philipp Hufangl)
osd: always send returnvec-on-errors for client\'s retry (pr#59273, Radoslaw Zarzynski)
osd: avoid watcher remains after \\"rados watch\\" is interrupted (pr#58846, weixinwei)
osd: bump versions of decoders for upmap-primary (pr#58802, Radoslaw Zarzynski)
osd: CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE flag is passed from ECBackend (pr#57621, Md Mahamudur Rahaman Sajib)
osd: Change PG Deletion cost for mClock (pr#56475, Aishwarya Mathuria)
osd: do not assert on fast shutdown timeout (pr#55135, Igor Fedotov)
osd: ensure async recovery does not drop a pg below min_size (pr#54550, Samuel Just)
osd: fix for segmentation fault on OSD fast shutdown (pr#57615, Md Mahamudur Rahaman Sajib)
osd: full-object read CRC mismatch due to \'truncate\' modifying oi.size w/o clearing \'data_digest\' (pr#57588, Samuel Just, Matan Breizman, Nitzan Mordechai, jiawd)
osd: make _set_cache_sizes ratio aware of cache_kv_onode_ratio (pr#55220, Raimund Sacherer)
osd: optimize extent comparison in PrimaryLogPG (pr#61336, Dongdong Tao)
osd: Report health error if OSD public address is not within subnet (pr#55697, Prashant D)
pybind/ceph_argparse: Fix error message for ceph tell command (pr#59197, Neeraj Pratap Singh)
pybind/mgr/mirroring: Fix KeyError: \'directory_count\' in daemon status (pr#57763, Jos Collin)
pybind/mgr: disable sqlite3/python autocommit (pr#57190, Patrick Donnelly)
pybind/rados: fix missed changes for PEP484 style type annotations (pr#54358, Igor Fedotov)
pybind/rbd: expose CLONE_FORMAT and FLATTEN image options (pr#57309, Ilya Dryomov)
python-common: fix valid_addr on python 3.11 (pr#61947, John Mulligan)
python-common: handle \\"anonymous_access: false\\" in to_json of Grafana spec (pr#59457, Adam King)
qa/cephadm: use reef image as default for test_cephadm workunit (pr#56714, Adam King)
qa/cephadm: wait a bit before checking rgw daemons upgraded w/ ceph versions
(pr#61917, Adam King)
qa/cephfs: a bug fix and few missing backport for caps_helper.py (pr#58340, Rishabh Dave)
qa/cephfs: add mgr debugging (pr#56415, Patrick Donnelly)
qa/cephfs: add more ignorelist entries (issue#64746, pr#56022, Venky Shankar)
qa/cephfs: add probabilistic ignorelist for pg_health (pr#56666, Patrick Donnelly)
qa/cephfs: CephFSTestCase.create_client() must keyring (pr#56836, Rishabh Dave)
qa/cephfs: fix test_single_path_authorize_on_nonalphanumeric_fsname (pr#58560, Rishabh Dave)
qa/cephfs: fix TestRenameCommand and unmount the clinet before failin… (pr#59399, Xiubo Li)
qa/cephfs: ignore variant of MDS_UP_LESS_THAN_MAX (pr#58789, Patrick Donnelly)
qa/cephfs: ignore when specific OSD is reported down during upgrade (pr#60390, Rishabh Dave)
qa/cephfs: ignorelist clog of MDS_UP_LESS_THAN_MAX (pr#56403, Patrick Donnelly)
qa/cephfs: improvements for \\"mds fail\\" and \\"fs fail\\" (pr#58563, Rishabh Dave)
qa/cephfs: remove dependency on centos8/rhel8 entirely (pr#59054, Venky Shankar)
qa/cephfs: switch to ubuntu 22.04 for stock kernel testing (pr#62492, Venky Shankar)
qa/cephfs: use different config options to generate MDS_TRIM (pr#59375, Rishabh Dave)
qa/distros: reinstall nvme-cli on centos 9 nodes (pr#59463, Adam King)
qa/distros: remove centos 8 from supported distros (pr#57932, Guillaume Abrioux, Casey Bodley, Adam King, Laura Flores)
qa/fsx: use a specified sha1 to build the xfstest-dev (pr#57557, Xiubo Li)
qa/mgr/dashboard: fix test race condition (pr#59697, Nizamudeen A, Ernesto Puerta)
qa/multisite: add boto3.client to the library (pr#60850, Shilpa Jagannath)
qa/rgw/crypt: disable failing kmip testing (pr#60701, Casey Bodley)
qa/rgw/sts: keycloak task installs java manually (pr#60418, Casey Bodley)
qa/rgw: avoid \'user rm\' of keystone users (pr#62104, Casey Bodley)
qa/rgw: barbican uses branch stable/2023.1 (pr#56819, Casey Bodley)
qa/rgw: bump keystone/barbican from 2023.1 to 2024.1 (pr#61022, Casey Bodley)
qa/rgw: fix s3 java tests by forcing gradle to run on Java 8 (pr#61054, J. Eric Ivancich)
qa/rgw: force Hadoop to run under Java 1.8 (pr#61121, J. Eric Ivancich)
qa/rgw: pull Apache artifacts from mirror instead of archive.apache.org (pr#61102, J. Eric Ivancich)
qa/standalone/mon/mon_cluster_log.sh: retry check for log line (pr#60780, Shraddha Agrawal, Naveen Naidu)
qa/standalone/scrub: increase status updates frequency (pr#59975, Ronen Friedman)
qa/suites/krbd: drop pre-single-major and move \\"layering only\\" coverage (pr#57464, Ilya Dryomov)
qa/suites/krbd: stress test for recovering from watch errors for -o exclusive (pr#58856, Ilya Dryomov)
qa/suites/rados/singleton: add POOL_APP_NOT_ENABLED to ignorelist (pr#57487, Laura Flores)
qa/suites/rados/thrash-old-clients: update supported releases and distro (pr#57999, Laura Flores)
qa/suites/rados/thrash/workloads: remove cache tiering workload (pr#58413, Laura Flores)
qa/suites/rados/verify/validater/valgrind: increase op thread timeout (pr#54527, Matan Breizman)
qa/suites/rados/verify/validater: increase heartbeat grace timeout (pr#58786, Sridhar Seshasayee)
qa/suites/rados: Cancel injectfull to allow cleanup (pr#59157, Brad Hubbard)
qa/suites/rbd/iscsi: enable all supported container hosts (pr#60088, Ilya Dryomov)
qa/suites/rbd: override extra_system_packages directly on install task (pr#57765, Ilya Dryomov)
qa/suites/upgrade/reef-p2p/reef-p2p-parallel: increment upgrade to 18.2.2 (pr#58411, Laura Flores)
qa/suites: add \\"mon down\\" log variations to ignorelist (pr#61711, Laura Flores)
qa/suites: drop --show-reachable=yes from fs:valgrind tests (pr#59069, Jos Collin)
qa/tasks/ceph_manager.py: Rewrite test_pool_min_size (pr#59268, Kamoltat)
qa/tasks/cephadm: enable mon_cluster_log_to_file (pr#55431, Dan van der Ster)
qa/tasks/nvme_loop: update task to work with new nvme list format (pr#61027, Adam King)
qa/tasks/qemu: Fix OS version comparison (pr#58170, Zack Cerza)
qa/tasks: Include stderr on tasks badness check (pr#61434, Christopher Hoffman, Ilya Dryomov)
qa/tasks: watchdog should terminate thrasher (pr#59193, Nitzan Mordechai)
qa/tests: added client-upgrade-reef-squid tests (pr#58447, Yuri Weinstein)
qa/upgrade: fix checks to make sure upgrade is still in progress (pr#61718, Adam King)
qa/workunits/rbd: avoid caching effects in luks-encryption.sh (pr#58853, Ilya Dryomov)
qa/workunits/rbd: wait for resize to be applied in rbd-nbd (pr#62218, Ilya Dryomov)
qa: account for rbd_trash object in krbd_data_pool.sh + related ceph{,adm} task fixes (pr#58540, Ilya Dryomov)
qa: add a YAML to ignore MGR_DOWN warning (pr#57565, Dhairya Parmar)
qa: Add multifs root_squash testcase (pr#56690, Rishabh Dave, Kotresh HR)
qa: add support/qa for cephfs-shell on CentOS 9 / RHEL9 (pr#57162, Patrick Donnelly)
qa: adjust expected io_opt in krbd_discard_granularity.t (pr#59231, Ilya Dryomov)
qa: barbican: restrict python packages with upper-constraints (pr#59326, Tobias Urdin)
qa: cleanup snapshots before subvolume delete (pr#58332, Milind Changire)
qa: disable mon_warn_on_pool_no_app in fs suite (pr#57920, Patrick Donnelly)
qa: do the set/get attribute on the remote filesystem (pr#59828, Jos Collin)
qa: enable debug logs for fs:cephadm:multivolume subsuite (issue#66029, pr#58157, Venky Shankar)
qa: enhance per-client labelled perf counters test (pr#58251, Jos Collin, Rishabh Dave)
qa: failfast mount for better performance and unblock fs volume ls
(pr#59920, Milind Changire)
qa: fix error reporting string in assert_cluster_log (pr#55391, Dhairya Parmar)
qa: fix krbd_msgr_segments and krbd_rxbounce failing on 8.stream (pr#57030, Ilya Dryomov)
qa: fix log errors for cephadm tests (pr#58421, Guillaume Abrioux)
qa: fixing tests in test_cephfs_shell.TestShellOpts (pr#58111, Neeraj Pratap Singh)
qa: ignore cluster warnings generated from forward-scrub task (issue#48562, pr#57611, Venky Shankar)
qa: ignore container checkpoint/restore related selinux denials for centos9 (issue#64616, pr#56019, Venky Shankar)
qa: ignore container checkpoint/restore related selinux denials for c… (issue#67118, issue#66640, pr#58809, Venky Shankar)
qa: ignore human-friendly POOL_APP_NOT_ENABLED in clog (pr#56951, Patrick Donnelly)
qa: ignore PG health warnings in CephFS QA (pr#58172, Patrick Donnelly)
qa: ignore variation of PG_DEGRADED health warning (pr#58231, Patrick Donnelly)
qa: ignore warnings variations (pr#59618, Patrick Donnelly)
qa: increase debugging for snap_schedule (pr#57172, Patrick Donnelly)
qa: increase the http postBuffer size and disable sslVerify (pr#53628, Xiubo Li)
qa: load all dirfrags before testing altname recovery (pr#59522, Patrick Donnelly)
qa: relocate subvol creation overrides and test (pr#59923, Milind Changire)
qa: suppress __trans_list_add valgrind warning (pr#58791, Patrick Donnelly)
qa: suppress Leak_StillReachable mon leak in centos 9 jobs (pr#58692, Laura Flores)
qa: switch to use the merge fragment for fscrypt (pr#55857, Xiubo Li)
qa: test test_kill_mdstable for all mount types (pr#56953, Patrick Donnelly)
qa: unmount clients before damaging the fs (pr#57524, Patrick Donnelly)
qa: use centos9 for fs:upgrade (pr#58113, Venky Shankar, Dhairya Parmar)
qa: wait for file creation before changing mode (issue#67408, pr#59686, Venky Shankar)
rbd-mirror: clean up stale pool replayers and callouts better (pr#57306, Ilya Dryomov)
rbd-mirror: fix possible recursive lock of ImageReplayer::m_lock (pr#62043, N Balachandran)
rbd-mirror: use correct ioctx for namespace (pr#59772, N Balachandran)
rbd-nbd: use netlink interface by default (pr#62175, Ilya Dryomov, Ramana Raja)
rbd: \\"rbd bench\\" always writes the same byte (pr#59501, Ilya Dryomov)
rbd: amend \\"rbd {group,} rename\\" and \\"rbd mirror pool\\" command descriptions (pr#59601, Ilya Dryomov)
rbd: handle --{group,image}-namespace in \\"rbd group image {add,rm}\\" (pr#61171, Ilya Dryomov)
rbd: open images in read-only mode for \\"rbd mirror pool status --verbose\\" (pr#61169, Ilya Dryomov)
Revert \\"reef: rgw/amqp: lock erase and create connection before emplace\\" (pr#59016, Rongqi Sun)
Revert \\"rgw/auth: Fix the return code returned by AuthStrategy,\\" (pr#61405, Casey Bodley, Pritha Srivastava)
rgw/abortmp: Race condition on AbortMultipartUpload (pr#61133, Casey Bodley, Artem Vasilev)
rgw/admin/notification: add command to dump notifications (pr#58070, Yuval Lifshitz)
rgw/amqp: lock erase and create connection before emplace (pr#59018, Rongqi Sun)
rgw/amqp: lock erase and create connection before emplace (pr#58715, Rongqi Sun)
rgw/archive: avoid duplicating objects when syncing from multiple zones (pr#59341, Shilpa Jagannath)
rgw/auth: ignoring signatures for HTTP OPTIONS calls (pr#60455, Tobias Urdin)
rgw/beast: fix crash observed in SSL stream.async_shutdown() (pr#57425, Mark Kogan)
rgw/http/client-side: disable curl path normalization (pr#59258, Oguzhan Ozmen)
rgw/http: finish_request() after logging errors (pr#59440, Casey Bodley)
rgw/iam: fix role deletion replication (pr#59126, Alex Wojno)
rgw/kafka: refactor topic creation to avoid rd_kafka_topic_name() (pr#59764, Yuval Lifshitz)
rgw/kafka: set message timeout to 5 seconds (pr#56158, Yuval Lifshitz)
rgw/lc: make lc worker thread name shorter (pr#61485, lightmelodies)
rgw/lua: add lib64 to the package search path (pr#59343, Yuval Lifshitz)
rgw/lua: add more info on package install errors (pr#59127, Yuval Lifshitz)
rgw/multisite: allow PutACL replication (pr#58546, Shilpa Jagannath)
rgw/multisite: avoid writing multipart parts to the bucket index log (pr#57127, Juan Zhu)
rgw/multisite: don\'t retain RGW_ATTR_OBJ_REPLICATION_TRACE attr on copy_object (pr#58764, Shilpa Jagannath)
rgw/multisite: Fix use-after-move in retry logic in logbacking (pr#61329, Adam Emerson)
rgw/multisite: metadata polling event based on unmodified mdlog_marker (pr#60793, Shilpa Jagannath)
rgw/notifications/test: fix rabbitmq and kafka issues in centos9 (pr#58312, Yuval Lifshitz)
rgw/notifications: cleanup all coroutines after sending the notification (pr#59354, Yuval Lifshitz)
rgw/rados: don\'t rely on IoCtx::get_last_version() for async ops (pr#60097, Casey Bodley)
rgw/rgw_rados: fix server side-copy orphans tail-objects (pr#61367, Adam Kupczyk, Gabriel BenHanokh, Daniel Gryniewicz)
rgw/s3select: s3select response handler refactor (pr#57229, Seena Fallah, Gal Salomon)
rgw/sts: changing identity to boost::none, when role policy (pr#59346, Pritha Srivastava)
rgw/sts: fix to disallow unsupported JWT algorithms (pr#62046, Pritha Srivastava)
rgw/swift: preserve dashes/underscores in swift user metadata names (pr#56615, Juan Zhu, Ali Maredia)
rgw/test/kafka: let consumer read events from the beginning (pr#61595, Yuval Lifshitz)
rgw: add versioning status during radosgw-admin bucket stats
(pr#59261, J. Eric Ivancich)
rgw: append query string to redirect URL if present (pr#61160, Seena Fallah)
rgw: compatibility issues on BucketPublicAccessBlock (pr#59125, Seena Fallah)
rgw: cumulatively fix 6 AWS SigV4 request failure cases (pr#58435, Zac Dover, Casey Bodley, Ali Maredia, Matt Benjamin)
rgw: decrement qlen/qactive perf counters on error (pr#59669, Mark Kogan)
rgw: Delete stale entries in bucket indexes while deleting obj (pr#61061, Shasha Lu)
rgw: do not assert on thread name setting failures (pr#58058, Yuval Lifshitz)
rgw: fix bucket link operation (pr#61052, Yehuda Sadeh)
RGW: fix cloud-sync not being able to sync folders (pr#56554, Gabriel Adrian Samfira)
rgw: fix CompleteMultipart error handling regression (pr#57301, Casey Bodley)
rgw: fix data corruption when rados op return ETIMEDOUT (pr#61093, Shasha Lu)
rgw: Fix LC process stuck issue (pr#61531, Soumya Koduri, Tongliang Deng)
rgw: fix the Content-Length in response header of static website (pr#60741, xiangrui meng)
rgw: fix user.rgw.user-policy attr remove by modify user (pr#59134, ivan)
rgw: increase log level on abort_early (pr#59124, Seena Fallah)
rgw: invalidate and retry keystone admin token (pr#59075, Tobias Urdin)
rgw: keep the tails when copying object to itself (pr#62656, Jane Zhu)
rgw: link only radosgw with ALLOC_LIBS (pr#60733, Matt Benjamin)
rgw: load copy source bucket attrs in putobj (pr#59415, Seena Fallah)
rgw: modify string match_wildcards with fnmatch (pr#57901, zhipeng li, Adam Emerson)
rgw: optimize gc chain size calculation (pr#58168, Wei Wang)
rgw: S3 Delete Bucket Policy should return 204 on success (pr#61432, Simon Jürgensmeyer)
rgw: swift: tempurl fixes for ceph (pr#59356, Casey Bodley, Marcus Watts)
rgw: update options yaml file so LDAP uri isn\'t an invalid example (pr#56721, J. Eric Ivancich)
rgw: when there are a large number of multiparts, the unorder list result may miss objects (pr#60745, J. Eric Ivancich)
rgwfile: fix lock_guard decl (pr#59351, Matt Benjamin)
run-make-check: use get_processors in run-make-check script (pr#58872, John Mulligan)
src/ceph-volume/ceph_volume/devices/lvm/listing.py : lvm list filters with vg name (pr#58998, Pierre Lemay)
src/exporter: improve usage message (pr#61332, Anthony D\'Atri)
src/mon/ConnectionTracker.cc: Fix dump function (pr#60004, Kamoltat)
src/pybind/mgr/pg_autoscaler/module.py: fix \'pg_autoscale_mode\' output (pr#59444, Kamoltat)
suites: test should ignore osd_down warnings (pr#59146, Nitzan Mordechai)
test/cls_lock: expired lock before unlock and start check (pr#59271, Nitzan Mordechai)
test/lazy-omap-stats: Convert to boost::regex (pr#57456, Brad Hubbard)
test/librbd/fsx: switch to netlink interface for rbd-nbd (pr#61259, Ilya Dryomov)
test/librbd/test_notify.py: conditionally ignore some errors (pr#62688, Ilya Dryomov)
test/librbd: clean up unused TEST_COOKIE variable (pr#58549, Rongqi Sun)
test/rbd_mirror: clear Namespace::s_instance at the end of a test (pr#61959, Ilya Dryomov)
test/rbd_mirror: flush watch/notify callbacks in TestImageReplayer (pr#61957, Ilya Dryomov)
test/rgw/multisite: add meta checkpoint after bucket creation (pr#60977, Casey Bodley)
test/rgw/notification: use real ip address instead of localhost (pr#59304, Yuval Lifshitz)
test/rgw: address potential race condition in reshard testing (pr#58793, J. Eric Ivancich)
test/store_test: fix deferred writing test cases (pr#55778, Igor Fedotov)
test/store_test: fix DeferredWrite test when prefer_deferred_size=0 (pr#56199, Igor Fedotov)
test/store_test: get rid off assert_death (pr#55774, Igor Fedotov)
test/store_test: refactor spillover tests (pr#55200, Igor Fedotov)
test: ceph daemon command with asok path (pr#61481, Nitzan Mordechai)
test: Create ParallelPGMapper object before start threadpool (pr#58920, Mohit Agrawal)
Test: osd-recovery-space.sh extends the wait time for \\"recovery toofull\\" (pr#59043, Nitzan Mordechai)
teuthology/bluestore: Fix running of compressed tests (pr#57094, Adam Kupczyk)
tool/ceph-bluestore-tool: fix wrong keyword for \'free-fragmentation\' … (pr#62124, Igor Fedotov)
tools/ceph_objectstore_tool: Support get/set/superblock (pr#55015, Matan Breizman)
tools/cephfs: recover alternate_name of dentries from journal (pr#58232, Patrick Donnelly)
tools/objectstore: check for wrong coll open_collection (pr#58734, Pere Diaz Bou)
valgrind: update suppression for SyscallParam under call_init (pr#52611, Casey Bodley)
win32_deps_build.sh: pin zlib tag (pr#61630, Lucian Petrut)
workunit/dencoder: dencoder test forward incompat fix (pr#61750, NitzanMordhai, Nitzan Mordechai)
In Part One we introduced the concepts behind Ceph’s replication strategies, emphasizing the benefits of a stretch cluster for achieving zero data loss (RPO=0). In Part Two we will focus on the practical steps for deploying a two-site stretch cluster plus a tie-breaker Monitor using cephadm
.
In a stretch architecture, the network plays a crucial role in maintaining the overall health and performance of the cluster.
Ceph features Level 3 routing, enabling communication among Ceph servers and components across subnets and CIDRs at each data center / site.
Ceph standalone or stretch clusters can be configured with two distinct networks:
The single public network must be accessible across all three sites, including the tie-breaker site, since all Ceph services rely on it.
The cluster network is only needed across the two sites that house OSDs and should not be configured at the tie-breaker site.
Unstable networking between the OSD sites will cause availability and performance issues in the cluster.
The network must not only be accessible 100% of the time but also provide consistent latency (low jitter).
Frequent spikes in latency can lead to unstable clusters, affecting client performance with issues including OSD flapping, loss of Monitor quorum, and slow (blocked) requests.
A maximum 10ms RTT (network packet Round Trip Time) is tolerated between the data sites where OSDs are located.
Up to 100ms RTT is acceptable for the tie-breaker site, which can be deployed as a VM or at a loud provider if security policies allow.
If the tie-breaker node is in the cloud or on a remote network across a WAN it is recommended to:
Set up a VPN among the data sites and the tie-breaker site for the public network.
Enable encryption in transit using Ceph messenger v2 encryption, which secures communication among Monitors and other Ceph components.
Every write operation in Ceph practices strong consistencyl. Written data must be persisted to all configured OSDs in the relevent placement group\'s acting set before success can be acknowledged to the client.
This adds at a minimum the network\'s RTT (Round Trip Time) between sites to the latency of every client write operation. Note that these replication writes (sub-ops) from the primary OSD to secondary OSDs happen in parallel.
[!info] For example, if the RTT between sites is 6 ms, every write operation will have at least 6 ms of additional latency due to replication between sites.
The inter-site bandwidth (throughput) constrains:
When a node fails, 67% of recovery traffic will be remote, meaning it will read two thirds of data from OSDs at the other site, consuming the shared inter-site bandwidth alongside client IO.
Ceph designates a primary OSD for each placement group (PG). All client writes go through this primary OSD, which may reside in a different data center than the client or RGW instance.
By default, all reads go through the primary OSD, which can increase cross-site latency.
The read_from_local_replica
feature allows RGW and RBD clients to read from a replica at the same (local) site instead of always reading from the primary OSD, which has a 50% chance of being at the other site.
This minimizes cross-site latency, reduces inter-site bandwidth usage, and improves performance for read-heavy workloads.
Available since Squid for both block (RBD) and object (RGW) storage. Local reads are not yet implemented for CephFS clients.
The hardware requirements and recommendations for stretch clusters are identical to those for traditional (standalone, non-stretch) deployments, with a few exceptions that will be discussed below.
Ceph in stretch mode recommends all-flash (SSD) configurations. HDD media are not recommended for any stretch Ceph cluster role. You have been warned.
Ceph in stretch mode requires replication with size=4
as the data replication policy. Erasure coding or replication with fewer copies is not supported. Plan accordingly for the raw and usable storage capacities that you must provision.
Clusters with multiple device classes are not supported. A CRUSH rule containing type replicated class hdd
will not work. If any CRUSH rule specifies a device class (typically ssd
but potentially nvme
) all CRUSH rules must specify that device class.
Local-only non-stretch pools are not supported. That is, neither site may provision a pool that does not extend to the other site.
Ceph services, including Monitors, OSDs, and RGWs, must be placed to eliminate single points of failure and ensure that the cluster can withstand the loss of an entire site without impacting client access to data.
Monitors: At least five Monitors are required, two per data site and one at the tie-breaker site. This strategy maintains quorum by ensuring that more than 50% of the Monitors are available even when an entire site is offline.
Manager: We can configure two or four Managers per data site. Four managers are recommended to provide high availability with an active/passive pair at a surviving site in case of a data site failure.
OSDs: Distributed equally across data sites. Custom CRUSH rules must be created when configuring stretch mode, placing two copies at each site, four total for a two-site stretch cluster.
RGWs: Four RGW instances, two per data site, are recommended at minimum to ensure high availability for object storage from the remaining site in case of a site failure.
MDS: The minimum recommended number of CephFS Metadata Server instances is four, two per data site. In the case of a site failure, we will still have two MDS services at the remaining site, one active and the other acting as a standby.
NFS: Four NFS server instances, two per data site, are recommended at minimum to ensure high availability for the shared filesystem when a site goes offline.
During the cluster bootstrap process with the cephadm
deployment tool we can utilize a service definition YAML file to handle most cluster configuration in a single step.
The below stretched.yaml
file provides an example template for deploying a Ceph cluster configured in stretch mode. This is just an example and must be customized to fit your specific deployment\'s details and needs.
service_type: host\\naddr: ceph-node-00.cephlab.com\\nhostname: ceph-node-00\\nlabels:\\n - mon\\n - osd\\n - rgw\\n - mds\\nlocation:\\n root: default\\n datacenter: DC1\\n---\\n\\nservice_type: host\\naddr: ceph-node-01.cephlab.com\\nhostname: ceph-node-01\\nlabels:\\n - mon\\n - mgr\\n - osd\\n - mds\\nlocation:\\n root: default\\n datacenter: DC1\\n---\\n\\nservice_type: host\\naddr: ceph-node-02.cephlab.com\\nhostname: ceph-node-02\\nlabels:\\n - osd\\n - rgw\\nlocation:\\n root: default\\n datacenter: DC1\\n---\\n\\nservice_type: host\\naddr: ceph-node-03.cephlab.com\\nhostname: ceph-node-03\\nlabels:\\n - mon\\n - osd\\nlocation:\\n root: default\\n datacenter: DC2\\n---\\n\\nservice_type: host\\naddr: ceph-node-04.cephlab.com\\nhostname: ceph-node-04\\nlabels:\\n - mon\\n - mgr\\n - osd\\n - mds\\nlocation:\\n root: default\\n datacenter: DC2\\n---\\n\\nservice_type: host\\naddr: ceph-node-05.cephlab.com\\nhostname: ceph-node-05\\nlabels:\\n - osd\\n - rgw\\n - mds\\nlocation:\\n root: default\\n datacenter: DC2\\n---\\n\\nservice_type: host\\naddr: ceph-node-06.cephlab.com\\nhostname: ceph-node-06\\nlabels:\\n - mon\\n---\\nservice_type: mon\\nservice_name: mon\\nplacement:\\n label: mon\\nspec:\\n crush_locations:\\n ceph-node-00:\\n - datacenter=DC1\\n ceph-node-01:\\n - datacenter=DC1\\n ceph-node-03:\\n - datacenter=DC2\\n ceph-node-04:\\n - datacenter=DC2\\n ceph-node-06:\\n - datacenter=DC3\\n\\n---\\nservice_type: mgr\\nservice_name: mgr\\nplacement:\\n label: mgr\\n---\\n\\nservice_type: mds\\nservice_id: cephfs\\nplacement:\\n label: \\"mds\\"\\n\\n---\\nservice_type: osd\\nservice_id: all-available-devices\\nservice_name: osd.all-available-devices\\nspec:\\n data_devices:\\n all: true\\nplacement:\\n label: \\"osd\\"\\n
With the specification file customized for your deployment, run the cephadm bootstrap
command. Note that we pass the YAML specification file with --apply-spec stretched.yml
so that all services are deployed and configured in one step.
# cephadm bootstrap --registry-json login.json --dashboard-password-noupdate --mon-ip 192.168.122.12 --apply-spec stretched.yml --allow-fqdn-hostname\\n
Once complete, verify that the cluster recognizes all hosts and their appropiate labels:
# ceph orch host ls\\nHOST ADDR LABELS STATUS\\nceph-node-00 192.168.122.12 _admin,mon,osd,rgw,mds\\nceph-node-01 192.168.122.179 mon,mgr,osd\\nceph-node-02 192.168.122.94 osd,rgw,mds\\nceph-node-03 192.168.122.180 mon,osd,mds\\nceph-node-04 192.168.122.138 mon,mgr,osd\\nceph-node-05 192.168.122.175 osd,rgw,mds\\nceph-node-06 192.168.122.214 mon\\n
Add the _admin
label to at least one node in each datacenter so that you can run Ceph CLI commands. This way, even if you lose an entire datacenter, you can execute Ceph admin commands from a surviving host. It is not uncommon to assign the _admin
label to all cluster nodes.
# ceph orch host label add ceph-node-03 _admin\\nAdded label _admin to host ceph-node-03\\n# ceph orch host label add ceph-node-06 _admin\\nAdded label _admin to host ceph-node-06\\n# ssh ceph-node-03 ls /etc/ceph\\nceph.client.admin.keyring\\nceph.conf\\n
Ceph, when configured in stretch mode, requires all pools to use the replication data protection strategy with size=4
. This means two copies of data at each site, ensuring availability when an entire site goes down.
Ceph uses the CRUSH map to determine where place data replicas. The CRUSH map logically represents the physical hardware layout, organized in a hierarchy of bucket types that include datacenters
, rooms
, and most often racks
, and hosts
. To configure a stretch mode CRUSH map, we define two datacenters
under the default CRUSH root, then place the host buckets within the appropriate datacenter
CRUSH bucket.
The following example shows a stretch mode CRUSH map featuring two datacenters, DC1 and DC2, each with three Ceph OSD hosts. We get this topology right out of the box, thanks to the spec file we used during bootstrap, where we specify the location of each host in the CRUSH map.
# ceph osd tree\\nID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF\\n-1 0.58557 root default\\n-3 0.29279 datacenter DC1\\n-2 0.09760 host ceph-node-00\\n 0 hdd 0.04880 osd.0 up 1.00000 1.00000\\n 1 hdd 0.04880 osd.1 up 1.00000 1.00000\\n-4 0.09760 host ceph-node-01\\n 3 hdd 0.04880 osd.3 up 1.00000 1.00000\\n 7 hdd 0.04880 osd.7 up 1.00000 1.00000\\n-5 0.09760 host ceph-node-02\\n 2 hdd 0.04880 osd.2 up 1.00000 1.00000\\n 5 hdd 0.04880 osd.5 up 1.00000 1.00000\\n-7 0.29279 datacenter DC2\\n-6 0.09760 host ceph-node-03\\n 4 hdd 0.04880 osd.4 up 1.00000 1.00000\\n 6 hdd 0.04880 osd.6 up 1.00000 1.00000\\n-8 0.09760 host ceph-node-04\\n10 hdd 0.04880 osd.10 up 1.00000 1.00000\\n11 hdd 0.04880 osd.11 up 1.00000 1.00000\\n-9 0.09760 host ceph-node-05\\n 8 hdd 0.04880 osd.8 up 1.00000 1.00000\\n 9 hdd 0.04880 osd.9 up 1.00000 1.00000\\n
Here, we have two datacenters, DC1
and DC2
. A third datacenter DC3
houses the tie-breaker monitor on ceph-node-06
but does not host OSDs.
To achieve our goal of having two copies per site, we define a stretched CRUSH rule to assign to our Ceph RADOS pools.
ceph-base
package to get the crushtool
binary, here demonstrated on a RHEL system.# dnf -y install ceph-base\\n
# ceph osd getcrushmap > crush.map.bin\\n
# crushtool -d crush.map.bin -o crush.map.txt\\n
crush.map.txt
file to add a new rule at the end of the file, taking care that the numeric rule id
attribute must be unique:rule stretch_rule {\\n id 1\\n type replicated\\n step take default\\n step choose firstn 0 type datacenter\\n step chooseleaf firstn 2 type host\\n step emit\\n}\\n
# crushtool -c crush.map.txt -o crush2.map.bin\\n# ceph osd setcrushmap -i crush2.map.bin\\n
# ceph osd crush rule ls\\nreplicated_rule\\nstretch_rule\\n
Thanks to our bootstrap spec file, the Monitors are labeled according to the data center to which they belong. This labeling ensures Ceph can maintain quorum even if one data center experiences an outage. In such cases, the tie-breaker Monitor in DC 3
acts in concert with the Monitors an the surviving data site to maintain the cluster\'s Monitor quorum.
# ceph mon dump | grep location\\n0: [v2:192.168.122.12:3300/0,v1:192.168.122.12:6789/0] mon.ceph-node-00; crush_location {datacenter=DC1}\\n1: [v2:192.168.122.214:3300/0,v1:192.168.122.214:6789/0] mon.ceph-node-06; crush_location {datacenter=DC3}\\n2: [v2:192.168.122.138:3300/0,v1:192.168.122.138:6789/0] mon.ceph-node-04; crush_location {datacenter=DC2}\\n3: [v2:192.168.122.180:3300/0,v1:192.168.122.180:6789/0] mon.ceph-node-03; crush_location {datacenter=DC2}\\n4: [v2:192.168.122.179:3300/0,v1:192.168.122.179:6789/0] mon.ceph-node-01; crush_location {datacenter=DC1}\\n
When running a stretch cluster with three sites, only communication between one site and a second is affected when we have an asymmetrical network error. This can result in an unresolvable Monitor election storm, where no Monitor can be selected as the leader.
To avoid this problem, we will change our election strategy from the classic approach to a connectivity-based one. The connectivity mode assesses the connection scores each Monitor provides for its peers and elects the Monitor with the highest score. This model is specifically designed to handle network partitioning aka a netsplit. Network partitioning may occur when your cluster is spread across multiple data centers, and all links connecting one site to another are lost.
# ceph mon dump | grep election\\nelection_strategy: 1\\n# ceph mon set election_strategy connectivity\\n# ceph mon dump | grep election\\nelection_strategy: 3\\n
You can check monitor scores with a command of the following form:
# ceph daemon mon.{name} connection scores dump\\n
To learn more about the Monitor connectivity election strategy, check out this excellent video from Greg Farnum. Further information is also available here.
To enter stretch mode, run the following command:
# ceph mon enable_stretch_mode ceph-node-06 stretch_rule datacenter\\n
Where:
ceph-node-06 is the tiebreaker (arbiter) monitor in DC3.
stretch_rule is the CRUSH rule that enforces two copies in each data center.
datacenter is our failure domain
Check the updated MON configuration:
# ceph mon dump\\nepoch 20\\nfsid 90441880-e868-11ef-b468-52540016bbfa\\nlast_changed 2025-02-11T14:44:10.163933+0000\\ncreated 2025-02-11T11:08:51.178952+0000\\nmin_mon_release 19 (squid)\\nelection_strategy: 3\\nstretch_mode_enabled 1\\ntiebreaker_mon ceph-node-06\\ndisallowed_leaders ceph-node-06\\n0: [v2:192.168.122.12:3300/0,v1:192.168.122.12:6789/0] mon.ceph-node-00; crush_location {datacenter=DC1}\\n1: [v2:192.168.122.214:3300/0,v1:192.168.122.214:6789/0] mon.ceph-node-06; crush_location {datacenter=DC3}\\n2: [v2:192.168.122.138:3300/0,v1:192.168.122.138:6789/0] mon.ceph-node-04; crush_location {datacenter=DC2}\\n3: [v2:192.168.122.180:3300/0,v1:192.168.122.180:6789/0] mon.ceph-node-03; crush_location {datacenter=DC2}\\n4: [v2:192.168.122.179:3300/0,v1:192.168.122.179:6789/0] mon.ceph-node-01; crush_location {datacenter=DC1}\\n
Ceph specifically disallows the tie-breaker monitor from ever assuming the leader role. The tie-breaker’s sole purpose is to provide an additional vote to maintain quorum when one primary site fails, preventing a split-brain scenario. By design, it resides in a separate, often smaller environment (perhaps a cloud VM) and may have higher network latency and fewer resources. Allowing it to become the leader could undermine performance and consistency. Therefore, Ceph marks the tie-breaker monitor as disallowed\\\\leader
, ensuring that the data sites retain primary control of the cluster while benefiting from the tie-breaker quorum vote.
When stretch mode is enabled, Object Storage Daemons (OSDs) will only activate Placement Groups (PGs) when they peer across data centers, provided both are available. The following constraints apply:
The number of replicas (each pool\'s size
attribute) will increase from the default of 3
to 4
, with the expectation of two copies at each site.
OSDs are permitted to connect only to monitors within the same datacenter.
New monitors cannot join the cluster unless their location is specified.
# ceph osd pool ls detail\\npool 1 \'.mgr\' replicated size 4 min_size 2 crush_rule 1 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 199 lfor 199/199/199 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 12.12\\npool 2 \'rbdpool\' replicated size 4 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 199 lfor 199/199/199 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 3.38\\n
Inspect the placement groups (PGs) for a specific pool ID and confirm which OSDs are in the acting set:
# ceph pg dump pgs_brief | grep 2.c\\ndumped pgs_brief\\n2.c active+clean [2,3,6,9] 2 [2,3,6,9] 2\\n
In this example, PG 2.c
has OSDs 2 and 3 from DC1, and OSDs 6 and 9 from DC2.
You can confirm the location of those OSDs with the ceph osd tree command:
# ceph osd tree | grep -Ev \'(osd.1|osd.7|osd.5|osd.4|osd.0|osd.8)\'\\nID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF\\n-1 0.58557 root default\\n-3 0.29279 datacenter DC1\\n-2 0.09760 host ceph-node-00\\n-4 0.09760 host ceph-node-01\\n 3 hdd 0.04880 osd.3 up 1.00000 1.00000\\n-5 0.09760 host ceph-node-02\\n 2 hdd 0.04880 osd.2 up 1.00000 1.00000\\n-7 0.29279 datacenter DC2\\n-6 0.09760 host ceph-node-03\\n 6 hdd 0.04880 osd.6 up 1.00000 1.00000\\n-8 0.09760 host ceph-node-04\\n-9 0.09760 host ceph-node-05\\n 9 hdd 0.04880 osd.9 up 1.00000 1.00000\\n
Here each PG has two replicas in DC1
and two in DC2
, which is a core concept of stretch mode.
By deploying a two-site stretch cluster with a third-site tie-breaker Monitor, you ensure that data remains highly available even during the outage of an entire data center. Leveraging a single specification file allows for automatic and consistent service placement across both sites—covering monitors, OSDs, and other Ceph components. The connectivity election strategy also helps maintain a stable quorum by prioritizing well-connected monitors. Combining these elements: careful CRUSH configuration, correct labeling, and an appropriate data protection strategy results in a resilient storage architecture that handles inter-site failures without compromising data integrity or service continuity.
In the final part of our series we will test the stretch cluster under real-world failure conditions. We will explore how Ceph automatically shifts into a degraded state when a complete site goes offline, the impact on client I/O during the outage, and the recovery process once the site is restored, ensuring zero data loss.
The authors would like to thank IBM for supporting the community with our time to create these posts.
When considering replication, disaster recovery, and backup + restore, we choose from multiple strategies with varying SLAs for data and application recovery. Key factors include the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Synchronous replication provides the lowest RPO, which means zero data loss. Ceph can implement synchronous replication among sites by stretching the Ceph cluster across multiple data centers.
Asynchronous replication inherently implies a non-zero RPO. With Ceph, async multisite replication involves replicating data to another Ceph cluster. Each Ceph storage access method (object, block, and file) has its own asynchronous replication method implemented at the service level.
Asynchronous Replication: Replication occurs at the service level (RBD, CephFS, or RGW), typically across fully independent Ceph clusters.
Synchronous Replication (“Stretch Cluster”): Replication is performed at the RADOS (cluster) layer, so writes must be completed in every site before an acknowledgment is sent to clients.
Both methods have distinct advantages and disadvantages, as well as different performance profiles and recovery considerations. Before discussing Ceph stretch clusters in detail, here is an overview of these replication modes.
Asynchronous replication is driven at the service layer. Each site provisions a complete, standalone Ceph cluster and maintains independent copies of the data.
RGW Multisite: Each site deploys one or more independent RGW zones. Changes are propagated asynchronously between sites using the RGW multisite replication framework. This replication is not journal-based. Instead, it relies on log-based replication, where each RGW tracks changes through a log of operations (sync logs), and these logs are replayed at peer sites to replicate data.
RBD Mirroring: Block data is mirrored either using a journal-based approach (as with Openstack) or a snapshot-based approach (as with ODF/OCP), depending on your requirements for performance, crash consistency, and scheduling.
CephFS Snapshot Mirroring (in active development): Uses snapshots to replicate file data at configurable intervals.
Asynchronous replication is well-suited for architectures with significant network latency between locations. This approach allows applications to continue operating without waiting for remote writes to complete. However, it is important to note that this strategy inherently informs a non-zero Recovery Point Objective (RPO), meaning there will be some delay before remote sites are consistent with the primary. As a result, a site failure could lead to loss of recently written data that is still in flight.
To explore Ceph\'s asynchronous replication, please check out our prior blog posts: Object storage Multisite Replication.
A stretch cluster is a single Ceph cluster deployed across multiple data centers or availability zones. Write operations return to clients only once persisted at all sites, or enough sites to meet each logical pool\'s replication schema requirement. This provides:
RPO = 0: No data loss if one site fails since every client write is synchronously replicated and will be replayed when a failed site comes back online.
Single cluster management: No special client-side replication configuration is needed: regular Ceph tools and workflows are applied.
A stretch cluster has strict networking requirements: a maximum 10ms RTT between sites. Because writes to OSDs must travel between sites before an acknowledgment is returned to the client, latency is critical. Network instability, insufficient bandwidth, and latency spikes can degrade performance and risk data integrity.
Ceph Stretch Clusters provide benefits that make them a good option for critical applications that require maximum uptime and resilience:
Fault Tolerance: a stretch cluster will handle the failure of an entire site transparently without impacting client operations. It can sustain a double site failure without data loss.
Strong Consistency, In a three-site setup, data uploaded online immediately becomes visible and accessible to all AZs/sites. Strong consistency enables clients at each site to always see the latest data.
Simple setup and day two operations: One of the best features of stretch clusters is straightforward operation. They are like any standard, single-site cluster in most ways. Also, no manual intervention is required to recover from a site failure, making them easy to manage and deploy.
Stretch clusters can be complemented with multisite asynchronous replication for cross-region data replication.
It is however essential to consider the caveats of Ceph stretch clusters:
Networking is crucial: Inter-site networking shortcomings including flapping, latency spikes and insufficient bandwidth impact performance and data integrity.
Performance: Write operation latency is increased by the RTT of the two most distant sites. When deploying across three sites pool data protection strategy should be configured for replication with a size
value 6
, which means write amplification of six OSD operations per client write. We must set workload expectations accordingly. For example, a high IOPS, low-latency OLTP database workload likely will struggle if storing data a stretch cluster.
Replica 6 (or Replica 4 two-site stretch) is recommended for reliability: We keep six (or four) copies of data. Erasure Coding is not at present an option at the moment due to performance impact, intersite network demands, and the nuances of ensuring simultaneous strong consistency and high availability. This in turn means that the total available usable capacity for a given amount of raw underlying storage must be carefully considered relative to a conventional single-site cluster.
Single Cluster across all sites: If data is damaged due to a software or user issue including deletion on the single stretch cluster, the data seen by all sites will be affected
A stretch cluster depends on robust networking to operate optimally. A suboptimal network configuration will impact performance and data integrity.
Equal Latency Across Sites: The sites are connected through a highly available L2 or L3 network infrastructure, where the latency among the data availability zones/sites is similar. The RTT is ideally less than 10ms. Inconsistent network latency (jitter) will degrade cluster performance.
Reliable L2/L3 network with minimal latency spikes: Inter-site path diversity with redundancy: full mesh or redundant transit.
Sufficient Bandwidth: The network should have adequate bandwidth to handle replication, client request, and recovery traffic. Network bandwidth must scale with cluster growth: as we add nodes, we must also increase inter-site network throughput to maintain performance.
Networking QoS is beneficial: Without QoS, a noisy neighbor sending or receiving substantial inter-site traffic can degrade cluster stability.
Global Load Balancer: Object storage that uses S3 RESTful endpoints needs a GLB to redirect client requests in case of a site failure.
Performance: Each client write will experience at least the latency the of the highest RTT between sites. The following diagram shows operation latency for a three-site stretch cluster with a 1.5ms RTT between sites, with a client and primary OSD at different sites:
Each data center (or availability zone) houses a share of the OSDs in a three-site stretch cluster. Two data replicas are stored in each zone, so the CRUSH pool\'s size
parameter is 6
. This allows the cluster to serve client operations with zero data unavailability or loss when an entire site goes offline. Some highlights are below:
No Tiebreaker: Because there are three full data sites (OSDs in all sites), the Monitors can form quorum with any two sites able to reach each other.
Enhanced Resilience: Survives a complete site failure plus one additional OSD or node failure at surviving sites.
Network Requirements: L3 routing is recommended, and at most 10ms RTT is required among the three sites.
To delve deeply into Ceph 3-site stretch configurations, check out this excellent Cephalocon video from Kamoltat Sirivadhna.
For deployments where only two data centers have low-latency connectivity, place OSDs in those two data centers with the third site elsewhere hosting a tie-breaker Monitor. This may even be a VM at a cloud provider. This ensures that the cluster maintains a quorum when a single site fails.
Two low latency main sites: each hosting half of the total OSD capacity.
One tie-breaker: site hosting a a tie-breaker Monitor.
Replicas: Pool data production strategy replication
with ``size=4`, which means two replicas per data center.
Latency: At most 10 ms RTT between the main, OSD-containing data centers. The tie-breaker site can tolerate much higher latency (e.g., 100 ms RTT).
Improved netsplit handling: Prevents a split-brain scenario
SSD OSDs required: HDD OSDs are not supported.
Ceph supports both asynchronous and synchronous replication strategies, each with specific trade-offs among recovery objectives, operational complexity, and networking demands. Asynchronous replication (RBD Mirroring, RGW Multisite, and CephFS Snapshot Mirroring) provides flexibility and easy geo-deployment but carries a non-zero RPO. In contrast, a stretch cluster delivers RPO=0 by synchronously writing to multiple data centers, ensuring no data loss but requiring robust, low-latency inter-site connectivity and increased replication overhead including higher operation latency.
Whether you choose to deploy a three-site or two-site with a tie-breaker design, a stretch cluster can seamlessly handle the loss of an entire data center with minimal operational intervention. However, it is crucial to consider the stringent networking requirements (both latency and bandwidth) and the higher capacity overhead of replication with size=4
. For critical applications where continuous availability and zero RPO are top priorities, the additional planning and resources for a stretch cluster may be well worth the investment. If a modest but nonzero RPO is acceptable, say if one data center is intended only for archival or as a reduced-performance disaster recovery site, asynchronous replication may be appealing in that capacity-efficient erasure coding may be used at both sites.
In our next post (part 2 of this series), we will explore two-site stretch clusters with a tie-breaker. We’ll provide practical steps for setting up Ceph across multiple data centers, discussing essential network and hardware considerations. Additionally, we will conduct a hands-on deployment, demonstrating how to automate the bootstrap of the cluster using a spec file. We will also cover how to configure CRUSH rules and enable stretch mode.
The authors would like to thank IBM for supporting the community with our time to create these posts.
In Part 2 we explored the hands-on deployment of a two-site Ceph cluster with a tie-breaker site and Monitor using a custom service definition file, CRUSH rules, and service placements.
In this final installment, we’ll test that configuration by examining what happens when an entire data center fails.
A key objective of any two-site stretch cluster design is to ensure that applications remain fully operational even if one data center goes offline. With synchronous replication, the cluster can handle client requests transparently, maintaining a Recovery Point Objective (RPO) of zero and preventing data loss, even during a complete site failure.
Our third and final post in this series will explore how Ceph automatically detects and isolates a failing data center. The cluster transitions into stretch degraded mode, with the tie-breaker Monitor ensuring quorum. During this time, replication constraints are temporarily adjusted to keep services available at the surviving site.
Once the offline data center is restored, we will demonstrate how the cluster seamlessly regains its complete stretch configuration, restoring full redundancy and synchronization operations without manual intervention. End users and storage administrators experience minimal disruption and zero data loss throughout this process.
The cluster is working as expected, our monitors are in quorum, and the acting set for our PGs includes four OSDs, two from each site. Our pools are configured with the replication rule, size=4
, and min_size 2
.
# ceph -s\\n cluster:\\n id: 90441880-e868-11ef-b468-52540016bbfa\\n health: HEALTH_OK\\n\\n services:\\n mon: 5 daemons, quorum ceph-node-00,ceph-node-06,ceph-node-04,ceph-node-03,ceph-node-01 (age 43h)\\n mgr: ceph-node-01.osdxwj(active, since 10d), standbys: ceph-node-04.vtmzkz\\n osd: 12 osds: 12 up (since 10d), 12 in (since 2w)\\n\\n data:\\n pools: 2 pools, 33 pgs\\n objects: 23 objects, 42 MiB\\n usage: 1.4 GiB used, 599 GiB / 600 GiB avail\\n pgs: 33 active+clean\\n\\n# ceph quorum_status --format json-pretty | jq .quorum_names\\n[\\n \\"ceph-node-00\\",\\n \\"ceph-node-06\\",\\n \\"ceph-node-04\\",\\n \\"ceph-node-03\\",\\n \\"ceph-node-01\\"\\n]\\n\\n# ceph pg map 2.1\\nosdmap e264 pg 2.1 (2.1) -> up [1,3,9,11] acting [1,3,9,11]\\n\\n# ceph osd pool ls detail | tail -2\\npool 2 \'rbdpool\' replicated size 4 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 199 lfor 199/199/199 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 3.38\\n
We will display a diagram for each phase to describe the various tages during a failure.
At this point, something unexpected happens, and we lose access to all nodes in DC1
:
Here is an excerpt from Monitor logs on one of the remaining sites: Monitors in DC1
are considered down
and are removed from the quorum:
2025-02-18T14:14:22.206+0000 7f05459fc640 0 log_channel(cluster) log [WRN] : [WRN] MON_DOWN: 2/5 mons down, quorum ceph-node-06,ceph-node-04,ceph-node-03\\n2025-02-18T14:14:22.206+0000 7f05459fc640 0 log_channel(cluster) log [WRN] : mon.ceph-node-00 (rank 0) addr [v2:192.168.122.12:3300/0,v1:192.168.122.12:6789/0] is down (out of quorum)\\n2025-02-18T14:14:22.206+0000 7f05459fc640 0 log_channel(cluster) log [WRN] : mon.ceph-node-01 (rank 4) addr [v2:192.168.122.179:3300/0,v1:192.168.122.179:6789/0] is down (out of quorum)\\n
The Monitor running on ceph-node-03
in DC2
calls for a monitor election, proposes itself, and is accepted as the new leader:
2025-02-18T14:14:33.087+0000 7f0548201640 0 log_channel(cluster) log [INF] : mon.ceph-node-03 calling monitor election\\n2025-02-18T14:14:33.087+0000 7f0548201640 1 paxos.3).electionLogic(141) init, last seen epoch 141, mid-election, bumping\\n2025-02-18T14:14:38.098+0000 7f054aa06640 0 log_channel(cluster) log [INF] : mon.ceph-node-03 is new leader, mons ceph-node-06,ceph-node-04,ceph-node-03 in quorum (ranks 1,2,3)\\n
Each Ceph OSD heartbeats other OSDs at random intervals of less than six seconds. If a peer OSD does not send a heartbeat within a 20-second grace period, the checking OSD considers the peer OSD to be down
and reports this to a Monitor, which will then update the cluster map.
By default, two OSDs from different hosts must report to the Monitors that another OSD is down before the Monitors acknowledge the failure. This helps prevent false alarms, flapping, and cascading issues. However, all reporting OSDs may happen to be hosted in a rack with a malfunctioning switch that affects connectivity with other OSDs. To avoid false alarms, we regard the reporting peers as a potential subcluster experiencing issues.
The Monitors OSD reporter subtree level groups the peers into the subcluster based on their common ancestor type in the CRUSH map. By default, two reports from different subtrees are needed to declare an OSD down.
2025-02-18T14:14:29.233+0000 7f0548201640 1 mon.ceph-node-03@3(leader).osd e264 prepare_failure osd.0 [v2:192.168.122.12:6804/636515504,v1:192.168.122.12:6805/636515504] from osd.10 is reporting failure:1\\n2025-02-18T14:14:29.235+0000 7f0548201640 0 log_channel(cluster) log [DBG] : osd.0 reported failed by osd.10\\n2025-02-18T14:14:31.792+0000 7f0548201640 1 mon.ceph-node-03@3(leader).osd e264 we have enough reporters to mark osd.0 down\\n2025-02-18T14:14:31.844+0000 7f054aa06640 0 log_channel(cluster) log [WRN] : Health check failed: 2 osds down (OSD_DOWN)\\n2025-02-18T14:14:31.844+0000 7f054aa06640 0 log_channel(cluster) log [WRN] : Health check failed: 1 host (2 osds) down (OSD_HOST_DOWN)\\n
In the output of the ceph status
command, we can see that quorum is maintained by ceph-node-06
, ceph-node-04
, and ceph-node-03
:
# ceph -s | grep mon\\n 2/5 mons down, quorum ceph-node-06,ceph-node-04,ceph-node-03\\nmon: 5 daemons, quorum ceph-node-06,ceph-node-04,ceph-node-03 (age 10s), out of quorum: ceph-node-00, ceph-node-01\\n
We see via the ceph osd tree
command that the OSDs in DC1
are marked down
:
# ceph osd tree\\nID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF\\n-1 0.58557 root default\\n-3 0.29279 datacenter DC1\\n-2 0.09760 host ceph-node-00\\n 0 hdd 0.04880 osd.0 down 1.00000 1.00000\\n 1 hdd 0.04880 osd.1 down 1.00000 1.00000\\n-4 0.09760 host ceph-node-01\\n 3 hdd 0.04880 osd.3 down 1.00000 1.00000\\n 7 hdd 0.04880 osd.7 down 1.00000 1.00000\\n-5 0.09760 host ceph-node-02\\n 2 hdd 0.04880 osd.2 down 1.00000 1.00000\\n 5 hdd 0.04880 osd.5 down 1.00000 1.00000\\n-7 0.29279 datacenter DC2\\n-6 0.09760 host ceph-node-03\\n 4 hdd 0.04880 osd.4 up 1.00000 1.00000\\n 6 hdd 0.04880 osd.6 up 1.00000 1.00000\\n-8 0.09760 host ceph-node-04\\n10 hdd 0.04880 osd.10 up 1.00000 1.00000\\n11 hdd 0.04880 osd.11 up 1.00000 1.00000\\n-9 0.09760 host ceph-node-05\\n 8 hdd 0.04880 osd.8 up 1.00000 1.00000\\n 9 hdd 0.04880 osd.9 up 1.00000 1.00000\\n
Ceph raises the OSD_DATACENTER_DOWN
health warning when an entire site fails. This indicates that one CRUSH datacenter
is unavailable due to a network outage, power loss, or other issue. From the Monitor logs:
2025-02-18T14:14:32.910+0000 7f054aa06640 0 log_channel(cluster) log [WRN] : Health check failed: 1 datacenter (6 osds) down (OSD_DATACENTER_DOWN)\\n
We can see the same from the ceph status
command.
# ceph -s\\n cluster:\\n id: 90441880-e868-11ef-b468-52540016bbfa\\n health: HEALTH_WARN\\n 3 hosts fail cephadm check\\n We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer\\n 2/5 mons down, quorum ceph-node-06,ceph-node-04,ceph-node-03\\n 1 datacenter (6 osds) down\\n 6 osds down\\n 3 hosts (6 osds) down\\n Degraded data redundancy: 46/92 objects degraded (50.000%), 18 pgs degraded, 33 pgs undersized\\n
When an entire data center fails in a two-site stretch scenario, Ceph enters stretch degraded mode. You’ll see a Monitor log entry like this:
2025-02-18T14:14:32.992+0000 7f05459fc640 0 log_channel(cluster) log [WRN] : Health check failed: We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer (DEGRADED_STRETCH_MODE)\\n
Stretch degraded mode is self-managing. It kicks in when the monitors confirm that an entire CRUSH datacenter is unreachable. Administrators do not need to promote or demote any site or DC manually. Ceph’s orchestrator updates the OSD map and PG states automatically. Once the cluster enters degraded stretch mode, actions automatically unfold.
Stretch degraded mode means that Ceph no longer requires an acknowledgment from offline OSDs in the failed data center to complete writes or to bring placement groups (PGs) to an active state.
In stretch mode, Ceph implements a specific stretch peering rule that mandates the participation of at least one OSD from each site in the acting set before a placement group (PG) can transition from peering to active+clean
. This rule ensures that new write operations are not acknowledged if one site is completely offline, thereby preventing split-brain scenarios and ensuring consistent site replication.
Once in degraded mode, Ceph temporarily modifies the CRUSH rule so that only the surviving site is needed to activate PGs, allowing client operations to continue seamlessly.
# ceph pg dump pgs_brief | grep 2.11\\ndumped pgs_brief\\n2.11 active+undersized+degraded [8,11] 8 [8,11] 8\\n
When one site goes offline, Ceph automatically lowers the pool’s min_size
attribute from 2
to 1
, allowing each placement group (PG) to remain active and clean with only one available replica. Were the min_size
to remain at 2, the surviving site cannot maintain active PGs after losing half of its local replicas, leading to a freeze in client I/O. By temporarily dropping min_size
to 1, Ceph ensures that the cluster can tolerate an OSD failure in the remaining site and continue to serve reads/writes until the offline site returns.
It’s essential to note that running temporarily with min_size=1
means that only one copy of data must be available until the offline site recovers. While this keeps the service operational, it also increases the risk of data loss if the surviving site experiences additional failures. A Ceph cluster with SSD media ensures fast recovery and minimizes the risk of data unavailability or loss when additional component fails during stretch degraded
operation.
# ceph osd pool ls detail\\npool 1 \'.mgr\' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 302 lfor 302/302/302 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 11.76\\npool 2 \'rbdpool\' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 302 lfor 302/302/302 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 2.62\\n
All PGs for which the primary OSD fails will experience a short blip in client operations until the affected OSDs are declared down
and the acting set is modified per stretch mode.
Clients continue reading and writing data from the surviving site\'s two copies, ensuring service availability and RPO=0 for all writes.
When the offline data center returns to service, its OSDs rejoin the cluster, and Ceph automatically moves back from degraded stretch mode to full stretch mode. The process involves recovery and backfill to restore each placement group (PG) to the correct replica count of 4.
When an OSD has valid PG logs (and is only briefly down), Ceph performs incremental recovery by copying only the new updates from other replicas. When OSDs are down for a long time and the PG logs don’t contain the a full set of deltas, Ceph initiates an OSD backfill operation to copy the entire PG. This systematically scans all RADOS objects in the authoritative replicas and updates the returning OSDs with changes that occurred while they were unavailable.
Recovery and backfill entail additional I/O as data is transferred between sites to restore full redundancy. This is why including the recovery throughput in your network calculations and planning is essential. Ceph is designed to throttle these operations via configurable mClock recovery/backfill settings so that it does not overwhelm client I/O. We want to return to HEALTH_OK
as soon as possible to ensure data availability and durability, so adequate inter-site bandwidth is crucial. This means not only bandwidth for daily reads and writes, but also for peaks when components fail or the cluster is expanded.
Once all affected PGs have finished recovery or backfill, they return to active+clean
with the required two copies up to date and available per site. Ceph then reverts the temporary changes made during degraded mode (e.g. min_size=1
back to the standard min_size=2
. The cluster’s degraded stretch mode warning disappears once complete, signaling that full redundancy has been restored.
In this quick demo we are running an application that is constantly doing reads and writes from an RBD block volume. The blue and green dots are the reads and writes of the application along with latency. On the left and write of the dashboard we have the status of the DCs, and individual servers will be shown as down when they are inaccessible. In the demo we can see how we loose an entire site and our application only reports 27 seconds of delayed IO: the time it takes to detect and confirm that OSDs are down. Once the site is recovered we can see that PGs are recovered using the replicas at the remaining site.
In this final installment, we’ve seen how a two-site stretch cluster reacts to a data center outage. It automatically transitions into a degraded state to keep services online and seamlessly recovers when the failed site recovers. With automatic down marking, relaxed peering rules, lowered min_size values
, and synchronization of modified data once connectivity returns, Ceph handles these events with minimal manual intervention and no data loss.
The authors would like to thank IBM for supporting the community with our time to create these posts.
Businesses integrate datasets from multiple sources to derive valuable business insights. Conventional analytics infrastructures, often reliant on specialized hardware, can lead to data silos, lack scalability, and result in escalating costs over time.
The rise of modern analytics architectures in public cloud-based SaaS environments has helped overcome many limitations, allowing for efficient operations and the ability to adapt dynamically to changing workload demands without compromising performance.
However, despite these advancements, not all organizations can realistically shift entirely to a cloud-based environment. Several crucial reasons exist for retaining data on-premises, such as regulatory compliance, security concerns, latency, and cost considerations.
Consequently, many organizations are exploring the benefits of hybrid cloud architectures, making their datasets from on-premises object-based data lake environments available to Cloud SaaS data platforms including Snowflake.
Snowflake is a cloud-based data platform that enhances data-driven insights by allowing governed access to vast amounts of data for collaboration and analysis. Thanks to its native support for the S3 API, it can unify diverse data sources and integrate seamlessly with on-premises solutions including Ceph. This integration enables Snowflake to leverage Ceph\'s robust and scalable storage capabilities, effectively bringing cloud data warehouse functionalities into the on-premises environment while ensuring comprehensive data control and security.
Ceph is open-source, software-defined, runs on industry-standard hardware, and has best-in-class coverage of the lingua franca of object storage: the AWS S3 API. Ceph was designed from the ground up as an object store, contrasting with approaches that bolt S3 API servers onto a distributed file system. With Ceph, data placement is by algorithm instead of by lookup. This allows Ceph to scale well into the billions of objects, even on modestly sized clusters. Data stored in Ceph is protected with efficient erasure coding, with in-flight and at-rest checksums, encryption, and robust access control that thoughtfully integrates with enterprise identity systems. Ceph is the perfect complement to Snowflake for establishing a security-first hybrid cloud data lake environment.
Ceph is a supported S3 compatible storage solution for Snowflake. Using Ceph\'s S3-compatible APIs, enterprises can configure Snowflake to access data stored on Ceph through external S3 stages or external S3 tables, enabling efficient queries without requiring data migration to and from the cloud.
Ceph Object Storage is the perfect platform for creating data lakes or lakehouses with key advantages:
Cost-effectiveness: Ceph utilizes commodity hardware and open-source software to reduce upfront infrastructure costs and enable incremental and evolutionary upgrades and expansion over time without forklifts or downtime.
High scalability: Ceph allows horizontal scaling to accommodate large volumes of growing data in a data lake or lakehouse.
High flexibility: Ceph can handle various data types, including structured, semi-structured, and unstructured data, including text, images, video, and sensor data, making it versatile and appropriate for data lakes.
High availability: Ceph is designed to provide durability and reliability for information stored in a data lake or lake house. Data is always accessible despite hardware failures or disruptions in the network. Ceph offers data replication across multiple geographic locations, providing redundancy and fault tolerance to prevent data loss.
High performance: Ceph enables parallel data access and processing through integration with data analytics frameworks to enable high throughput and low latency for data ingestion and processing within a data lake or lakehouse. Ceph Object also provides a cache data accelerator (D3N) and query pushdown with S3 Select.
Data governance: Ceph provides efficient management of metadata to enforce data governance policies, track data lineage, monitor data usage, and provide valuable information about the data stored in the data lake, including format and data source.
Security: Ceph has a broad security feature set: encryption at rest and over the wire, external identity integration, Secure Token Service, IAM roles/policies, per-object granular authorization, Object Lock, versioning, and MFA delete.
The most common way of accessing external S3 object storage from Snowflake is to create an External Stage and then use the Stage to copy the data into Snowflake or access it directly using n External Table.
Next, we will provide two simplistic examples for reference using an on-prem Ceph cluster::
Our Ceph cluster has an S3 Object Gateway configured at s3.cephlabs.blue
and we have a bucket named ecommtrans
containing a CSV-formatted file named transactions
.
$ aws s3 ls s3://ecommtrans/transactions/ \\n2024-06-04 11:33:54 13096729 transaction_data_20240604112945.csv\\n
The CSV file has the following format:
client_id,transaction_id,item_id,transaction_date,country,customer_type,item_description,category,quantity,total_amount,marketing_campaign,returned\\n799315,f47b56a5-2392-4d7c-a3fe-fad18c8b0901,a06210e5-217f-4c3d-8ab9-06e1d8f605e2,2024-03-17 20:35:26,DK,Returning,Smartwatch,Electronics,3,1790.2,,False\\n858067,9351638c-9d23-4d32-9218-69bbba6b258d,858aa970-9a95-4c99-8b64-d783129dd5cb,2024-02-13 16:18:42,ES,New,Dress,Clothing,4,196.96,,False\\n528665,7cc494c8-a19d-4771-9686-989d7dfa4c96,0bb7529b-59e8-4d15-adb8-c224b7d7d5b9,2024-03-04 \\n
In the Snowflake UI, we open a new SQL worksheet and run the following SQL code:
CREATE OR REPLACE TABLE onprem_database_ingest.raw.transactions\\n(\\n client_id VARCHAR(16777216),\\n transaction_id VARCHAR(16777216),\\n item_id VARCHAR(16777216),\\n transaction_date TIMESTAMP_NTZ(9),\\n country VARCHAR(16777216),\\n customer_type VARCHAR(16777216),\\n item_description VARCHAR(16777216),\\n category VARCHAR(16777216),\\n quantity NUMBER(38,0),\\n total_amount NUMBER(38,0),\\n marketing_campaign VARCHAR(16777216),\\n returned BOOLEAN\\n);\\n\\nLIST @CEPH_INGEST_STAGE/transactions/;\\n\\n---\x3e copy the Menu file into the Menu table\\n\\nCOPY INTO onprem_database_ingest.raw.transactions\\nFROM @CEPH_INGEST_STAGE/transactions/;\\n\\n-- Sample query to verify the setup\\n\\nSELECT * FROM onprem_database_ingest.raw.transactions\\nLIMIT 10;~\\n
In the Snowflake UI, we open a new SQL worksheet and run the following SQL code:
transactions/;\\n\\n-- Create the External Table with defining expressions for each column\\n\\nCREATE OR REPLACE EXTERNAL TABLE onprem_database_ingest_trans.raw.trans_external\\n(\\n client_id STRING AS (VALUE:\\"c1\\"::STRING),\\n transaction_id STRING AS (VALUE:\\"c2\\"::STRING),\\n item_id STRING AS (VALUE:\\"c3\\"::STRING),\\n transaction_date TIMESTAMP AS (VALUE:\\"c4\\"::TIMESTAMP),\\n country STRING AS (VALUE:\\"c5\\"::STRING),\\n customer_type STRING AS (VALUE:\\"c6\\"::STRING),\\n item_description STRING AS (VALUE:\\"c7\\"::STRING),\\n category STRING AS (VALUE:\\"c8\\"::STRING),\\n quantity NUMBER AS (VALUE:\\"c9\\"::NUMBER),\\n total_amount NUMBER AS (VALUE:\\"c10\\"::NUMBER),\\n marketing_campaign STRING AS (VALUE:\\"c11\\"::STRING),\\n returned BOOLEAN AS (VALUE:\\"c12\\"::BOOLEAN)\\n)\\n\\nLOCATION = @CEPH_INGEST_STAGE_TRANS/transactions/\\nFILE_FORMAT = (TYPE = \'CSV\' FIELD_OPTIONALLY_ENCLOSED_BY = \'\\"\' SKIP_HEADER = 1 FIELD_DELIMITER = \',\' NULL_IF = (\'\'))\\nREFRESH_ON_CREATE = FALSE\\nAUTO_REFRESH = FALSE\\nPATTERN = \'.*.csv\';\\n\\n-- Refresh the metadata for the external table\\n\\nALTER EXTERNAL TABLE onprem_database_ingest_trans.raw.trans_external REFRESH;\\n\\n-- Sample query to verify the setup\\n\\nSELECT * FROM onprem_database_ingest_trans.raw.trans_external\\n\\nLIMIT 10;~\\n
Hybrid cloud architectures are increasingly popular, incorporating on-premises solutions including Ceph and cloud-based SaaS platforms including Snowflake. Ceph, which is now supported by Snowflake as an S3-compatible store, makes it possible to access on-premises data lake datasets, enhancing Snowflake\'s data warehousing capabilities. This integration establishes a secure, scalable, cost-effective hybrid data lake environment.
The authors would like to thank IBM for supporting the community with our time to create these posts.
The new S3 bucket logging feature introduced as a Technology Preview in Squid 19.2.2 makes tracking, monitoring, and securing bucket operations more straightforward than ever. It aligns with the S3 self-service use case, enabling end users to configure and manage their application storage access logging through the familiar S3 API. This capability empowers users to monitor access patterns, detect unauthorized activities, and analyze usage trends without needing direct intervention from administrators.
By leveraging Ceph\'s logging features, users gain actionable insights through logs stored in dedicated buckets, offering flexibility and granularity in tracking operations.
It’s important to note that this feature is not designed to provide real-time performance metrics and monitoring: we have the observability stack provided by Ceph for those needs.
In AWS, the equivalent to Ceph\'s Bucket Logging S3 API is referenced as S3 Server Access Logging.
In this blog, we will build an example interactive Superset dashboard for our application with the log data generated when enabling S3 bucket logging.
Application Compliance and Auditing
Application teams in regulated industries (finance, healthcare, insurance, etc.) must maintain detailed access logs to meet compliance requirements and ensure traceability for data operations.
Security and Intrusion Detection
Monitor bucket access patterns to identify unauthorized activities, detect anomalies, and respond to potential security breaches.
Per Application Usage Analytics
Generate detailed insights into buckets, including which objects are frequently accessed, peak traffic times, and operation patterns.
End User Cost Optimizations
Track resource usage, such as the number of GET
, PUT
, and DELETE
requests, to optimize storage and operational costs.
Self-Service Monitoring for End Users
In a self-service S3-as-a-service setup, end users can configure logging to have a historical view of their activity, helping them manage their data and detect issues independently of the administrators.
Change Tracking for Incremental Backups (Journal Mode Specific)
With journal mode enabled, all changes in a bucket are logged before the operations are complete, creating a reliable change log. Backup applications can use this log to inventory changes to perform efficient incremental backups. Here is an example Rclone PR from Yuval Lifshitz that uses the bucket logging feature to allow more efficient incremental copying from S3.
Log records are written to the log bucket after the operation is completed. If the logging operation fails, it does so silently without notifying the client.
Log records are written to the log bucket before the operation is completed. If logging fails, the operation is halted, and an error is returned to the client. An exception is for multi-delete and delete operations, where the operation may succeed even if logging fails. Note that logs may reflect successful writes even if the operation fails.
As context, I have a Ceph Object Gateway (RGW) service running on a Squid cluster.
# ceph version\\nceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)\\n# ceph orch ls rgw\\nNAME PORTS RUNNING REFRESHED AGE PLACEMENT \\nrgw.default ?:8000 4/4 6m ago 8M ceph04;ceph03;ceph02;ceph06;count:4\\n
I have an IAM account named analytic_ap
, and a root user for the account named rootana
: the S3 user profile for rootana
has already been configured via the AWS CLI.
Through the RGW endpoint using the IAM API (no RGW admin intervention required) I will create a new user with an attached managed policy of AmazonS3FullAccess
so that this user can access all buckets in the analytic_ap
account.
# aws --profile rootana iam create-user --user-name app_admin_shooters\\n{\\n \\"User\\": {\\n\\"Path\\": \\"/\\",\\n\\"UserName\\": \\"app_admin_shooters\\",\\n\\"UserId\\": \\"d915f592-6cbc-4c4c-adf2-900c499e8a4a\\",\\n \\"Arn\\": \\"arn:aws:iam::RGW46950437120753278:user/app_admin_shooters\\",\\n\\"CreateDate\\": \\"2025-01-23T08:26:44.086883+00:00\\"\\n }\\n}\\n\\n# aws --profile rootana iam create-access-key --user-name app_admin_shooters\\n{\\n \\"AccessKey\\": {\\n\\"UserName\\": \\"app_admin_shooters\\",\\n\\"AccessKeyId\\": \\"YI80WC6HTMHMY958G3EO\\",\\n\\"Status\\": \\"Active\\",\\n \\"SecretAccessKey\\": \\"67Vp071aBf92fJiEe8pBtV6RYqtWBhSceneeZVLH\\",\\n\\"CreateDate\\": \\"2025-01-23T08:27:03.268781+00:00\\"\\n }\\n}\\n\\n# aws --profile rootana iam attach-user-policy --user-name app_admin_shooters --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess\\n
I configured a new profile via the AWS CLI with the credentials of the S3 end user we just created named app_admin_shooters
, which with the managed policy attached to the user gives me access to the S3 resources available in the account:
# aws --profile app_admin_shooters s3 ls\\n#\\n
Ok, with everything set, let’s create three source buckets. These buckets belong to three different shooter games/applications and one logging destination bucket named shooterlogs
:
# aws --profile app_admin_shooters s3 mb s3://shooterlogs\\nmake_bucket: shooterlogs\\n# aws --profile app_admin_shooters s3 mb s3://shooterapp1\\nmake_bucket: shooterapp1\\n]# aws --profile app_admin_shooters s3 mb s3://shooterapp2\\nmake_bucket: shooterapp2\\n# aws --profile app_admin_shooters s3 mb s3://shooterapp3\\nmake_bucket: shooterapp3\\n
Now let’s enable bucket logging for each of my shooter app buckets. I will use the target bucket shooter logs
and to organize the logs for each shooterapp
bucket, I will use a TargetPprefix
with the name of the bucket
# cat << EOF > enable_logging.json\\n{\\n \\"LoggingEnabled\\": {\\n \\"TargetBucket\\": \\"shooterlogs\\",\\n \\"TargetPrefix\\": \\"shooterapp1\\"\\n }\\n}\\nEOF\\n
Once the JSON file is ready we can apply it to our buckets, using the sed
command in each iteration to change the TargetPrefix
.
# aws --profile app_admin_shooters s3api put-bucket-logging --bucket shooterapp1 --bucket-logging-status file://enable_logging.json\\n# sed -i \'s/shooterapp1/shooterapp2/\' enable_logging.json\\n# aws --profile app_admin_shooters s3api put-bucket-logging --bucket shooterapp2 --bucket-logging-status file://enable_logging.json\\n# sed -i \'s/shooterapp2/shooterapp3/\' enable_logging.json\\n# aws --profile app_admin_shooters s3api put-bucket-logging --bucket shooterapp3 --bucket-logging-status file://enable_logging.json\\n
We can list the logging configuration for a bucket with the following s3api command.
# aws --profile app_admin_shooters s3api get-bucket-logging --bucket shooterapp1\\n{\\n \\"LoggingEnabled\\": {\\n \\"TargetBucket\\": \\"shooterlogs\\",\\n \\"TargetPrefix\\": \\"shooterapp1\\",\\n \\"TargetObjectKeyFormat\\": {\\n \\"SimplePrefix\\": {}\\n }\\n }\\n}\\n
We will PUT
some objects into our first bucket shooterapp1
, and also delete a set of objects so that we can create logs in our log bucket.
# for i in {1..20} ; do aws --profile app_admin_shooters s3 cp /etc/hosts s3://shooterapp1/file${i} ; done\\nupload: ../etc/hosts to s3://shooterapp1/file1 \\nupload: ../etc/hosts to s3://shooterapp1/file2 \\n… \\n# for i in {1..5} ; do aws --profile app_admin_shooters s3 rm s3://shooterapp1/file${i} ; done\\ndelete: s3://shooterapp1/file1\\ndelete: s3://shooterapp1/file2\\n…\\n
When we check our configured log bucket shooterlogs
, it’s empty. Why !?
# aws --profile app_admin_shooters s3 ls s3://shooterlogs/\\n# \\n
Note from the docs: For performance reasons, even though the log records are written to persistent storage, the log object will appear in the log bucket only after some configurable amount of time (or if the maximum object size of 128MB is reached). This time (in seconds) could be set per source bucket via a Ceph extension to the REST API, or globally via the rgw_bucket_logging_obj_roll_time
configuration option. If not set, the default time is 5 minutes. Adding a log object to the log bucket is done \\"lazily\\", meaning that if no more records are written to the object, it may remain outside the log bucket even after the configured time has passed.
If we don’t want to wait for the object roll time (defaults to 5 minutes), we can force a flush of the log buffer with the radosgw-admin
command:
# radosgw-admin bucket logging flush --bucket shooterapp1\\n
When we check again, the object with the logs for bucket shooterapp1
is there as expected:
# aws --profile app_admin_shooters s3 ls s3://shooterlogs/\\n2025-01-23 08:28:16 8058 shooterapp12025-01-23-13-21-00-A54CQC9GIO7O4F9D\\n# aws --profile app_admin_shooters s3 cp s3://shooterlogs/shooterapp12025-01-23-13-21-00-A54CQC9GIO7O4F9D - | cat\\nRGW46950437120753278 shooterapp1 [23/Jan/2025:13:21:00 +0000] - d915f592-6cbc-4c4c-adf2-900c499e8a4a fcabdf4a-86f2-452f-a13f-e0902685c655.323278.12172315742054314872 REST.GET.get_bucket_logging - \\"GET /shooterapp1?logging HTTP/1.1\\" 200 - - - - 14ms - - - - - - - s3.cephlabs.com.s3.cephlabs.com - -\\nRGW46950437120753278 shooterapp1 [23/Jan/2025:13:23:33 +0000] - d915f592-6cbc-4c4c-adf2-900c499e8a4a fcabdf4a-86f2-452f-a13f-e0902685c655.323242.15617167555539888584 REST.PUT.put_obj file1 \\"PUT /shooterapp1/file1 HTTP/1.1\\" 200 - 333 333 - 19ms - - - - - - - s3.cephlabs.com.s3.cephlabs.com - -\\n…\\nRGW46950437120753278 shooterapp1 [23/Jan/2025:13:24:01 +0000] - d915f592-6cbc-4c4c-adf2-900c499e8a4a fcabdf4a-86f2-452f-a13f-e0902685c655.323242.18353336346755391699 REST.DELETE.delete_obj file1 \\"DELETE /shooterapp1/file1 HTTP/1.1\\" 204 NoContent - 333 - 11ms - - - - - - - s3.cephlabs.com.s3.cephlabs.com - -\\nRGW46950437120753278 shooterapp1 [23/Jan/2025:13:24:02 +0000] - d915f592-6cbc-4c4c-adf2-900c499e8a4a fcabdf4a-86f2-452f-a13f-e0902685c655.311105.12134465030800156375 REST.DELETE.delete_obj file2 \\"DELETE /shooterapp1/file2 HTTP/1.1\\" 204 NoContent - 333 - 11ms - - - - - - - s3.cephlabs.com.s3.cephlabs.com - -\\nRGW46950437120753278 shooterapp1 [23/Jan/2025:13:24:03 +0000] - d915f592-6cbc-4c4c-adf2-900c499e8a4a fcabdf4a-86f2-452f-a13f-e0902685c655.323260.3289411001891924009 REST.DELETE.delete_obj file3 \\"DELETE /shooterapp1/file3 HTTP/1.1\\" 204 NoContent - 333 - 9ms - - - - - - - s3.cephlabs.com.s3.cephlabs.com - \\n
NOTE: To explore the output fields available with the current implementation of the bucket logging feature, check out the documentation.
Let\'s see if bucket logging works for our other bucket, shooterapp2
.
# for i in {1..3} ; do aws --profile app_admin_shooters s3 cp /etc/hosts s3://shooterapp2/file${i} ; done\\nupload: ../etc/hosts to s3://shooterapp2/file1\\nupload: ../etc/hosts to s3://shooterapp2/file2\\nupload: ../etc/hosts to s3://shooterapp2/file3\\n# for i in {1..3} ; do aws --profile app_admin_shooters s3 cp s3://shooterapp2/file${i} - ; done\\n# radosgw-admin bucket logging flush --bucket shooterapp2\\nflushed pending logging object \'shooterapp22025-01-23-10-01-57-TJNTA3FU60TS21MK\' to target bucket \'shooterlogs\'\\n
Checking our configured log bucket, we can see that we now have two objects in the bucket with the prefix of the source bucket name.
# aws --profile app_admin_shooters s3 ls s3://shooterlogs/\\n2025-01-23 08:28:16 8058 shooterapp12025-01-23-13-21-00-A54CQC9GIO7O4F9D\\n2025-01-23 10:01:57 2628 shooterapp22025-01-23-15-00-48-FIE6B8NNMANFTTFI\\n# aws --profile app_admin_shooters s3 cp s3://shooterlogs/shooterapp22025-01-23-15-00-48-FIE6B8NNMANFTTFI - | cat\\nRGW46950437120753278 shooterapp2 [23/Jan/2025:15:00:48 +0000] - d915f592-6cbc-4c4c-adf2-900c499e8a4a fcabdf4a-86f2-452f-a13f-e0902685c655.323242.10550516265852869740 REST.PUT.put_obj file1 \\"PUT /shooterapp2/file1 HTTP/1.1\\" 200 - 333 333 - 22ms - - - - - - - s3.cephlabs.com.s3.cephlabs.com - -\\nRGW46950437120753278 shooterapp2 [23/Jan/2025:15:00:49 +0000] - d915f592-6cbc-4c4c-adf2-900c499e8a4a \\n…\\nRGW46950437120753278 shooterapp2 [23/Jan/2025:15:01:36 +0000] - d915f592-6cbc-4c4c-adf2-900c499e8a4a fcabdf4a-86f2-452f-a13f-e0902685c655.323278.16364063589570559207 REST.HEAD.get_obj file1 \\"HEAD /shooterapp2/file1 HTTP/1.1\\" 200 - - 333 - 4ms - - - - - - - s3.cephlabs.com.s3.cephlabs.com - -\\nRGW46950437120753278 shooterapp2 [23/Jan/2025:15:01:36 +0000] - d915f592-6cbc-4c4c-adf2-900c499e8a4a fcabdf4a-86f2-452f-a13f-e0902685c655.323242.2016501269767674837 REST.GET.get_obj file1 \\"GET /shooterapp2/file1 HTTP/1.1\\" 200 - - 333 - 3ms - - - - - - - s3.cephlabs.com.s3.cephlabs.com - -\\n
In this guide, we\'ll walk you through setting up Trino to query your application logs stored in S3-compatible storage and explore some powerful SQL queries you can use to analyze your data.
We already have Trino up and running; I have configured the hive connector to use S3A to access our S3 endpoint for instructions on how to Setup Trino.
The first step is configuring an external table pointing to your log data stored in an S3-compatible bucket: in our example, shooterlogs
. Below is an example of how the table is created:
trino> SHOW CREATE TABLE hive.default.raw_logs;\\n Create Table \\n----------------------------------------------\\n CREATE TABLE hive.default.raw_logs ( \\n line varchar \\n ) \\n WITH ( \\n external_location = \'s3a://shooterlogs/\', \\n format = \'TEXTFILE\' \\n )\\n
To make the data more usable you can parse each log line into meaningful fields including account ID, bucket name, operation type, HTTP response code, etc. To simplify querying and reuse, I will create a view that encapsulates the log parsing logic:
trino> CREATE VIEW hive.default.log_summary AS\\n -> SELECT\\n -> split(line, \' \')[1] AS account_id, -- Account ID\\n -> split(line, \' \')[2] AS bucket_name, -- Bucket Name\\n -> split(line, \' \')[3] AS timestamp, -- Timestamp\\n -> split(line, \' \')[6] AS user_id, -- User ID\\n -> split(line, \' \')[8] AS operation_type, -- Operation Type\\n -> split(line, \' \')[9] AS object_key, -- Object Key\\n -> regexp_extract(line, \'\\"([^\\"]+)\\"\', 1) AS raw_http_request, -- Raw HTTP Request\\n -> CAST(regexp_extract(line, \'\\"[^\\"]+\\" ([0-9]+) \', 1) AS INT) AS http_status, -- HTTP Status\\n -> CAST(CASE WHEN split(line, \' \')[14] = \'-\' THEN NULL ELSE split(line, \' \')[14] END AS BIGINT) AS object_size, -- Object Size\\n -> CASE WHEN split(line, \' \')[17] = \'-\' THEN NULL ELSE split(line, \' \')[17] END AS request_duration, -- Request Duration\\n -> regexp_extract(line, \'[0-9]+ms\', 0) AS request_time, -- Request Time (e.g., 22ms)\\n -> regexp_extract(line, \' ([^ ]+) [^ ]+ [^ ]+$\', 1) AS hostname -- Hostname (third-to-last field)\\n -> FROM hive.default.raw_logs;\\n -> \\nCREATE VIEW\\n
With the view in place, you can quickly write queries to summarize log data. For example:
trino> SELECT operation_type, COUNT(*) AS operation_count\\n -> FROM hive.default.log_summary\\n -> GROUP BY operation_type;\\n operation_type | operation_count \\n-----------------------------+-----------------\\n REST.DELETE.delete_obj | 5 \\n REST.HEAD.get_obj | 3 \\n REST.GET.get_bucket_logging | 1 \\n REST.PUT.put_obj | 23 \\n REST.GET.list_bucket | 1 \\n REST.GET.get_obj | 3 \\n(6 rows)\\n
With the view we created from your Trino data we can perform historical monitoring and analyze your S3 bucket activity effectively. Below is an example list of potential visualizations you can create to provide actionable insights into bucket usage and access patterns.
I will use Superset in this example, but another visualization tool could achieve the same outcome. I have a running instance of Superset, and I have configured Trino as a source database for Superset.
Here is the query used and the resulting graph in a superset dashboard. We present per-bucket operation type counts and average latency.
SELECT operation_type AS operation_type, bucket_name AS bucket_name, sum(\\"Operations Count\\") AS \\"SUM(Operations Count)\\" \\nFROM (SELECT \\n bucket_name, \\n operation_type, \\n COUNT(*) AS \\"Operations Count\\", \\n AVG(CAST(regexp_extract(request_time, \'[0-9]+\', 0) AS DOUBLE)) AS \\"Average Latency\\"\\nFROM hive.default.log_summary\\nGROUP BY bucket_name, operation_type\\nORDER BY \\"Operations Count\\" DESC\\n) AS virtual_table GROUP BY operation_type, bucket_name ORDER BY \\"SUM(Operations Count)\\" DESC\\nLIMIT 1000;\\n
Another example related to HTTP requests is presenting the distribution of HTTP request codes in a pie chart.
Here we show the top users making requests to the shooter
application buckets.
These are just some basic graph examples you can create. Far more advanced graphs can be built with the features available in Superset.
The introduction of the S3 bucket logging feature is a game-changer for storage access management. This feature provides transparency and control by empowering end users to configure and manage their application access logs. With the ability to log operations at the bucket level, users can monitor activity, troubleshoot issues, and enhance their security posture—all tailored to their specific requirements without admin intervention.
To showcase its potential, we explored how tools like Trino and Superset can analyze and visualize the log data generated by S3 Bucket Logging. These are just examples of the many possibilities the bucket logging feature enables.
Ceph developers are working hard to improve bucket logging for future releases, including bug fixes, enhancements, and better AWS S3 compatibility. Stay Tuned!
The authors would like to thank IBM for supporting the community via our time to create these posts.
In this article we analyze the results of performance benchmarks conducted on Trino with the Ceph Object S3 Select feature enabled, using TPC-DS benchmark queries at 1TB and 3TB scale. We demonstrate that, on average, queries run 2.5X faster. In some cases, we achieved nine times improvement with a network data processing reduction of 144TB compared to using Trino without the S3 Select feature enabled. Combining IBM Storage Ceph\'s S3 Select with Trino/Presto can enhance data lake performance, reduce costs, and simplify data access for organizations.
We would like to thank Gal Salomon and Tim Wilkinson for conducting the TPC-DS benchmarking and providing us with these results.
Trino is a distributed SQL query engine that allows users to query data from multiple sources using a single SQL statement. It provides data warehouse-like capabilities directly on a data lake.
You may have encountered references to Trino, Presto, and PrestoDB, all of which originated from the same project. Presto was the initial project from Facebook, which was open-sourced in 2013. PrestoSQL became a community-based open-source project in 2018 and was rebranded to Trino in 2020.
Presto is an essential tool for data engineers who require a fast query engine for their higher-level Business Intelligence (BI) tools.
Ceph provides the S3 API S3 Select feature. S3 Select significantly improves the efficiency of SQL queries of data stored in S3-compatible object storage. By pushing the query to the Ceph cluster, S3 Select can dramatically enhance performance, processing queries faster and minimizing network and CPU resource costs. S3 Select and Trino are horizontally scalable, handling increasing data volumes and user queries without sacrificing performance. Trino\'s support for SQL and S3 Select\'s ability to query data in place enables users to access and analyze data without complex data movement or transformation tasks.
Ceph\'s Object Datacenter-Data-Delivery Network (D3N) feature uses high-speed storage such as NVMe SSDs or DRAM to cache datasets on the access side. D3N improves the performance of big-data jobs running in analysis clusters by accelerating recurring reads from the data lake or lakehouse.
We executed the following 72 TPC-DS queries at three different scale factors, 1TB, 2TB and 3TB, to characterize performance and resource consumption. The datasets were in uncompressed CSV format. We executed each query numerous times with and without S3 Select and ensured consistent results by monitoring the standard deviations for each run.
If you’re interested in exploring this topic further, please check out Gal Salomon\'s GitHub repository, where you will find instructions on how to set up a testing environment with Trino and Ceph. Instructions are also provided for the TPC-DS benchmarking tools used for this benchmark.
The hardware used for the benchmark was the following:
These S3 Select settings were adjusted:
The Trino engine processes complex queries by dividing the original query into multiple parallel S3 Select requests. These requests split the requested table (an S3 object) into equal ranges that are then distributed across our Ceph cluster\'s RGW service. The load balancer efficiently channels requests among Ceph Object Gateways, ensuring optimal performance and scalability for our data processing needs.
This next section provides an overview of TPC-DS benchmark results. These results help us understand how the Ceph Object S3 Select feature yields substantial benefits when working with CSV datasets. The benefits include improved query times and reduced data processing. We have included a diagram below that shows the total network traffic reduction achieved by using S3 Select. We can save 144TB of network traffic by utilizing this feature.
The following graph shows the per-query speedup achieved using S3 Select for the 3TB scale dataset. The X axis value is the query number from the above repository and the Y axix value is the speed improvement for each query. During testing we observed that enabling S3 Select improved all 72 queries. The query acheiving the most speedup was 9 times faster, and the overall average improvement was around 2.5x.
When S3 Select is enabled we offload computational work to the Ceph Object Gateways, so as expected they saw increased CPU usage when executing the queries with S3 Select enabled. However, the CPU utilization remained at an acceptable level Memory demand increase with S3 Select enabled was barely noticeable, with an average increase 2.50%. Pushdown can process objects of any size since it does so in chunks without preloading the entire object.
Query number 9 was able to reduce the network data processing by 18TB. The total reduction in processed data across all 72 queries was 144 TB when enabling S3 Select.
In this post, we shared the results of our benchmark testing, where we ran 72 TPC-DS 72 queries at 1TB and 3TB scale. We have found that utilizing Ceph Object S3 Select pushdown performance optimizations enables queries to complete more quickly than before with significantly lower resource demands. With Trino and S3 Select, you can push the computational work of projection and prediction operations to Ceph, achieving up to 9x performance improvement in query runtime, with an average of 2.5x. This significantly reduces data transfer across the network, saving 144TB of network traffic for the 72 executed queries. Organizations can enhance data lake performance, educe costs, and simplify data access by combining Ceph S3 Select with Trino and Presto.
The authors would like to thank IBM for supporting the community with our time to create these posts.
The Ceph Foundation and Ambassadors is busy building a full calendar of Ceph events for 2025!
As many of you know, Ceph has always thrived on its amazing community. Our events like Ceph Days and Cephalocon are key opportunities where we all can learn, connect, and share experience.
Following the success of Cephalocon at CERN and Ceph Days in India, we’ve announced Ceph Days London & Silicon Valley -- check out https://ceph.io/en/community/events/ to get involved. And watch that space -- Ceph Days in Seattle, New York, and Berlin will be announced soon!
Looking forward, we need your help to help shape our future events... and to plan our Cephalocon! If you have a moment, please share your thoughts in our Ceph Events survey: https://forms.gle/Rm41d547Rb59S8xf9
Looking forward to seeing you at an event soon!
In a world where data must be quickly accessed and protected across multiple geographical regions, multi-site replication is a critical feature of object storage. Whether running global applications or maintaining a robust disaster recovery plan, replicating objects between different regions is essential for redundancy and business continuity.
Ceph Object storage multi-site replication is asynchronous and log-based. The nature of async replication can present challenges validating where your data currently resides or confirming that it has fully replicated to all remote zones. This is not acceptable for certain applications and use cases that require close to strong consistency on write, and all the objects to be replicated and available in all sites before they are made accessible to the application or user.
As a side note, complete consistency on write can be provided using a Ceph Stretch Cluster solution that provides synchronous replication, but this has its limitations for a geo-dispersed solution as latency is a key factor for stretch clusters. If you need geo-replication you thus often will implement multisite async replication for object storage as your solution.
Application owners often need to know if an object is already available in the destination zone before triggering subsequent actions (e.g., analytics jobs, further data processing, or regulatory archiving). Storage operations teams may want clear insight into how long replication takes, enabling them to alert and diagnose slow or faulty network links. Automation and data pipelines might require a programmatic way to track replication status before proceeding with the next step in a workflow.
To address this visibility gap, Ceph Squid introduces new HTTP response headers that expose exactly where each object is in the replication process:
x-amz-replication-status
A quick way to determine if the object is pending replication, in progress, or already fully replicated. The status might show PENDING
, COMPLETED
, or REPLICA
depending on configuration.
x-rgw-replicated-from
Shows the source zone from which the object was initially replicated.
x-rgw-replicated-at
Provides a timestamp indicating when the object was successfully replicated. By comparing this to the object’s Last-Modified
header, you get an instant measure of replication latency that is valuable for real-time monitoring or performance tuning.
These headers enable a programmatic and deterministic way to know whether data has propagated to the target zone. It’s vital to note that the primary use case for these new HTTP replication headers is to query the status of objects during ingest to help the application make decisions based on the replication status of the objects. This is not intended for infra teams to check the replication status of all objects by scanning through billions of objects.
Developers can integrate these headers into their application logic. After uploading an object, the application can poll the x-amz-replication-status
header to ensure the object is fully available in the destination zone before triggering subsequent actions.
An automated job (sometimes called a synthetic test or canary test) can periodically upload and delete an object, checking how long replication takes. If latency breaches a certain threshold, the operations team can be alerted to investigate potential network or configuration issues.
While polling headers is often the most straightforward approach, you may leverage Ceph S3 bucket notifications for certain replication-related events. Integrating these with a message broker like Kafka can help orchestrate larger, event-driven workflows. Fore more information see the Ceph docs on S3 notifications.
I have multi-site replication set up between two Ceph clusters. They are part of a zonegroup named multizg
and we have bidirectional full-zone replication configured between zone1
and zone2
.
# radosgw-admin sync info\\n{\\n \\"sources\\": [\\n {\\n \\"id\\": \\"all\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"*\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"*\\"\\n },\\n \\"dests\\": [\\n {\\n \\"id\\": \\"all\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"*\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"*\\"\\n },\\n...\\n
For detailed information about Ceph Object Storage multisite replication, see the blog series that covers this feature in-depth, from architecture to setup and fine-tuning:
multisite part1
multisite part2
multisite part3
multisite part4
multisite part5
multisite part6
multisite part7
multisite part8
A simple way to view these new replication headers is to use s3cmd
with the --debug
flag, which prints raw HTTP response headers from the Ceph Object Gateway. By filtering for rgw-
or x-amz-
lines, we can easily spot replication-related information.
Let\'s check it out. I uploaded an object into zone1
:
# s3cmd --host ceph-node-00:8088 put /etc/hosts s3://bucket1/file20\\nupload: \'/etc/hosts\' -> \'s3://bucket1/file20\' [1 of 1]\\n 640 of 640 100% in 0s 7.63 KB/s done\\n
When I check the object\'s status on the source zone where I uploaded the object, it’s in the PENDING
state, indicating the object is still replicating. Eventually, once replication is complete, the status will transition to COMPLETED
in the source zone and REPLICA
in the destination zone.
# s3cmd --host ceph-node-00:8088 --debug info s3://bucket1/file20 2>&1 | grep -B 2 \'rgw-\'\\n \'x-amz-replication-status\': \'PENDING\',\\n \'x-amz-request-id\': \'tx00000f2948c72a2d2fb8e-0067a5c961-35964-zone1\',\\n \'x-rgw-object-type\': \'Normal\'},\\n
Now, let’s check on the destination zone endpoint:
# s3cmd --host ceph-node-05:8088 --debug info s3://bucket1/file20 2>&1 | grep -B 2 \'rgw-\'\\n \'x-amz-replication-status\': \'REPLICA\',\\n \'x-amz-request-id\': \'tx00000a98cf7b6a584b95b-0067a5cac9-29779-zone2\',\\n \'x-rgw-object-type\': \'Normal\',\\n \'x-rgw-replicated-at\': \'Fri, 07 Feb 2025 08:50:07 GMT\',\\n \'x-rgw-replicated-from\': \'b6c9ca95-6683-42a5-9dff-ba209039c61b:bucket1:b6c9ca95-6683-42a5-9dff-ba209039c61b.32035.1\'},\\n
Here, the relevant headers tell us:
x-amz-replication-status: REPLICA
x-rgw-replicated-at: \'Fri, 07 Feb 2025 08:50:07 GMT\'
x-rgw-replicated-from: 8f8c3759-aaaf-4e6d-b346-...:bucket1:...
Let\'s check that the status of the object in the source site has moved into the COMPLETE
state:
# s3cmd --host ceph-node-00:8088 --debug info s3://bucket1/file20 2>&1 | grep x-amz-replication-status\\n \'x-amz-replication-status\': \'COMPLETED\',\\n
This straightforward polling mechanism—via HEAD
or info requests—can be interpolated into application workflows to confirm full replication before taking further actions. Let’s check out a basic example.
Imagine a Content Delivery Network (CDN) scenario where you must replicate files globally to ensure low-latency access for end users across multiple geographic regions. An application in one region uploads media assets (images, videos, or static website content) that must be replicated to other RGW zones before we can make them available to the end users for consumption.
Here is a code snippet with an example of using the Python library boto3
to upload media content to a site, then poll the replication status of our newly uploaded media content by querying the replication status header. Once the object has been replicated we print out relevant information including source and destination RGW zones and replication latency.
Application Example output:
The new replication headers in Ceph Squid Object Storage mark a significant step forward in giving developers, DevOps teams, and storage administrators more granular control over and visibility into multisite replication. By querying the x-amz-replication-status
, x-rgw-replicated-from
, and x-rgw-replicated-at
headers, applications can confirm that objects have fully synchronized before proceeding with downstream workflows. This simple yet powerful capability can streamline CDN distribution, data analytics pipelines, and other use cases that demand multisite consistency.
Note that some features described here may not be available before the Squid 19.2.2 release.
The authors would like to thank IBM for supporting the community with our time to create these posts.
A block is a small, fixed-sized piece of data, like a 512-byte chunk. Imagine breaking a book into many pages: each page is a \\"block\\". When you put all the pages together, you get a complete book, just like combining many smaller data blocks creates a larger storage unit.
Block-based storage is commonly used in computing, and in devices including:
Ceph RADOS Block Devices are a type of storage served by a Ceph cluster that works like a virtual physical storage drive. Instead of storing data on a single device, it spreads the data across multiple storage nodes and devices (called OSDs) in a Ceph cluster. This makes it efficient, scalable, and reliable.
Ceph Block Devices have some amazing features:
Clients use the librbd
library to talk to the components of a Ceph cluster and manage data efficiently.
In short, Ceph Block Devices provide fast, scalable, and reliable storage for modern computing needs, ensuring that data is always available, accurate, and safe.
As a storage administrator, you have the power to seamlessly move (live-migrate) RBD images within your Ceph cluster or to a different Ceph cluster. Think of this like moving a file from one folder to another on your computer, but in this case, it happens within or between large, distributed storage clusters.
Note that RBD images are also known as volumes, a term that is less likely to be confused with graphics files containing the likeness of Taylor Swift or cats.
Note: Linux KRBD kernel clients currently do not support live migration.
Want to migrate data from an external source or storage provider? No problem! You can:
When you start a live migration, here’s what happens behind the scenes:
Live migration of RBD images in Ceph allows you to move storage seamlessly between pools, RBD namespaces, and clusters in different formats with minimal downtime. Let\'s break it down into three simple steps, along with the necessary commands to execute them.
Before starting the migration, a new target image is created and linked to the source image.
Syntax:
rbd migration prepare SOURCE_POOL_NAME/SOURCE_IMAGE_NAME TARGET_POOL_NAME/TARGET_IMAGE_NAME\\n
Example:
rbd migration prepare source_pool1/source_image1 target_pool1/target_image1\\n
Initiate the import-only live migration process by running the rbd migration prepare
command with the --import-only
flag and either --source-spec
or --source-spec-path
option, passing a JSON file that describes how to access the source image data.
[ceph: root@rbd-client /]# cat testspec.json\\n {\\n \\"type\\": \\"raw\\",\\n \\"stream\\": {\\n \\"type\\": \\"s3\\",\\n \\"url\\": \\"https:hots_ip:80/testbucket1/image.raw\\",\\n \\"access_key\\": \\"Access key\\",\\n \\"secret_key\\": \\"Secret Key\\"}\\n}\\n
Syntax:
rbd migration prepare --import-only --source-spec-path \\"JSON_FILE\\" TARGET_POOL_NAME/TARGET_IMAGE_NAME\\n
Example:
[ceph: root@rbd-client /]# rbd migration prepare --import-only --source-spec-path \\"testspec.json\\" target_pool/traget_image\\n
Check the status of the migration with the rbd status
command:
[ceph: root@rbd-client /]# rbd status target_pool/target_image\\nWatchers: none\\nMigration:\\nsource: {\\"stream\\":{\\"access_key\\":\\"RLJOCP6345BGB38YQXI5\\",\\"secret_key\\":\\"oahWRB2ote2rnLy4dojYjDrsvaBADriDDgtSfk6o\\",\\"type\\":\\"s3\\",\\"url\\":\\"http://10.74.253.18:80/testbucket1/image.raw\\"},\\"type\\":\\"raw\\"}\\ndestination: targetpool1/sourceimage1 (b13865345e66)\\nstate: prepared\\n
After preparation is complete, Ceph starts deep copying all existing data from the source image to the target image.
Syntax:
rbd migration execute TARGET_POOL_NAME/TARGET_IMAGE_NAME\\n
Example:
rbd migration execute target_pool1/target_image1\\n
After the data has been fully transferred, commit or abort the migration.
Committing the migration removes all links between the source and target images.
Syntax:
rbd migration commit TARGET_POOL_NAME/TARGET_IMAGE_NAME\\n
Example:
rbd migration commit target_pool1/target_image1\\n
Migrations can be cancelled. Cancelling a migration will cause the following to happen:
Syntax:
rbd migration abort TARGET_POOL_NAME/TARGET_IMAGE_NAME\\n
Example:
rbd migration abort targetpool1/targetimage1\\n
The following example shows how to migrate data from one Ceph cluster to another Ceph cluster, here named cluster1 and cluster2:
[ceph: root@rbd-client /]# cat /tmp/native_spec\\n{\\n \\"cluster_name\\": \\"c1\\",\\n \\"type\\": \\"native\\",\\n \\"pool_name\\": \\"pool1\\",\\n \\"image_name\\": \\"image1\\",\\n \\"snap_name\\": \\"snap1\\"\\n}\\n[ceph: root@rbd-client /]# rbd migration prepare --import-only --source-spec-path /tmp/native_spec c2pool1/c2image1 --cluster c2\\n[ceph: root@rbd-client /]# rbd migration execute c2pool1/c2image1 --cluster c2\\nImage migration: 100% complete...done.\\n[ceph: root@rbd-client /]# rbd migration commit c2pool1/c2image1 --cluster c2\\nCommit image migration: 100% complete...done.\\n
Live migration supports three primary formats:
The native format does not include the stream since it utilizes native Ceph operations. For example, to import from the image rbd/ns1/image1@snap1
, the source specification can be constructed as below:
{\\n\\"type\\": \\"native\\",\\n\\"pool_name\\": \\"rbd\\",\\n\\"pool_namespace\\": \\"ns1\\",\\n\\"image_name\\": \\"image1\\",\\n\\"snap_name\\": \\"snap1\\"\\n}\\n
The QCOW format describes a QEMU copy-on-write (QCOW) block device. QCOW v1 and v2 formats are currently supported, with the exception of certain features including compression, encryption, backing files, and external data files. Use the QCOW format with any supported stream source:
{\\n \\"type\\": \\"qcow\\",\\n \\"stream\\": {\\n \\"type\\": \\"file\\",\\n \\"file_path\\": \\"/mnt/image.qcow\\"\\n }\\n}\\n
{\\n \\"type\\": \\"raw\\",\\n \\"stream\\": {\\n \\"type\\": \\"file\\",\\n \\"file_path\\": \\"/mnt/image-head.raw\\"\\n },\\n \\"snapshots\\": [\\n {\\n \\"type\\": \\"raw\\",\\n \\"name\\": \\"snap1\\",\\n \\"stream\\": {\\n \\"type\\": \\"file\\",\\n \\"file_path\\": \\"/mnt/image-snap1.raw\\"\\n }\\n },\\n (optional oldest to newest ordering of snapshots)\\n}\\n
Multiple stream types are available for importing from various data sources:
Use a file
stream to import from a locally accessible POSIX file source.
{\\n <format-specific parameters>\\n \\"stream\\": {\\n \\"type\\": \\"file\\",\\n \\"file_path\\": \\"FILE_PATH\\"\\n }\\n}\\n
Use an HTTP
stream to import from an HTTP or HTTPS web server.
{\\n <format-specific parameters>\\n \\"stream\\": {\\n \\"type\\": \\"http\\",\\n \\"url\\": \\"URL_PATH\\"\\n }\\n}\\n
Use an s3
stream to import from an S3 bucket.
{\\n <format-specific parameters>\\n \\"stream\\": {\\n \\"type\\": \\"s3\\",\\n \\"url\\": \\"URL_PATH\\",\\n \\"access_key\\": \\"ACCESS_KEY\\",\\n \\"secret_key\\": \\"SECRET_KEY\\"\\n }\\n}\\n
Use an NBD
stream to import from a remote NBD export.
{\\n <format-specific parameters>\\n \\"stream\\": {\\n \\"type\\": \\"nbd\\",\\n \\"uri\\": \\"<nbd-uri>\\",\\n }\\n}\\n
The nbd-uri
parameter must follow the NBD URI specification. The default NBD port is tcp/10809
.
Disaster Recovery and Data Migration
Scenario: An organization runs mission-critical applications on a primary Ceph cluster in one data center. Due to an impending maintenance window, potential hardware failure, or a disaster event, they need to migrate RBD images to a secondary Ceph cluster in a different location.
Benefit: Live migration ensures that applications using RBD volumes can continue functioning with minimal downtime and no data loss during the transition to the secondary cluster.
Bursting and Workload Distribution
Scenario: An organization operates a Ceph cluster that accommodates routine workloads but occasionally requires extra capacity during peak usage. By migrating RBD images to an external Ceph cluster (possibly deployed in a cloud) they can temporarily scale operations then scale then back.
Benefit: Dynamic workload balancing helps admins leverage external resources only when needed, reducing operational costs and improving scalability.
Data Center Migration
Scenario: An organization is migrating infrastructure from one physical data center to another due to an upgrade, consolidation, or relocation. All RBD images from the source Ceph cluster need to be moved to a destination Ceph cluster in the new location.
Benefit: Live migration minimizes disruptions to services during data center migrations, maintaining application availability.
Compliance and Data Sovereignty
Scenario: A organization must comply with local data residency regulations that require sensitive data to be stored within specific geographic boundaries. Data held in RBD images thus must be migrated from a general-purpose Ceph cluster to one dedicated to and within the regulated region.
Benefit: The live migration feature enables seamless relocation of RBD data without halting ongoing operations, ensuring compliance with regulations.
Multi-Cluster Load and Capacity Balancing
Scenario: An organization runs multiple Ceph clusters to handle high traffic workloads. To prevent overloading any single cluster, they redistribute RBD images among clusters as workload patterns shift.
Benefit: Live migration allows for efficient rebalancing of workloads across Ceph clusters, optimizing resource utilization and performance.
Dev/Test to Production Migration
Scenario: Developers run test environments on a dedicated Ceph cluster. After testing is complete, production-ready RBD images can be migrated to the production Ceph cluster without data duplication or downtime.
Benefit: Simplifies the process of promoting test data to production while maintaining data integrity.
Hardware Lifecycle Management
Scenario: A Ceph cluster is running on older hardware that is nearing the end of its lifecycle. The admin plans to migrate RBD images to a new Ceph cluster with upgraded hardware for better performance and reliability.
Benefit: Live migration facilitates a smooth transition from legacy to modern infrastructure without impacting application uptime.
Note: In many situations one can incrementally replace 100% of Ceph cluster hardware in situ without downtime or migration, but in others it may be desirable to stand up a new, independent cluster and migrate data between the two.
Global Data Replication
Scenario: An enterprise has Ceph clusters distributed across locations to improve latency for regional end users. RBD images can be migrated from one region to another based on data center additions or closures, changes in user traffic patterns, or business priorities.
Benefit: Enhances user experience by moving data closer to the point of consumption while maintaining data consistency.
Ceph live migration of RBD images provides a seamless and efficient way to move storage data and workloads without disrupting operations. By leveraging native Ceph operations and external stream sources, administrators can ensure smooth and flexible data migration processes.
The authors would like to thank IBM for supporting the community by through our time to create these posts.
This is the first backport release in the Squid series. We recommend all users update to this release.
CephFS: The command fs subvolume create
now allows tagging subvolumes by supplying the option --earmark
with a unique identifier needed for NFS or SMB services. The earmark string for a subvolume is empty by default. To remove an already present earmark, an empty string can be assigned to it. Additionally, the commands ceph fs subvolume earmark set
, ceph fs subvolume earmark get
, and ceph fs subvolume earmark rm
have been added to set, get and remove earmark from a given subvolume.
CephFS: Expanded removexattr support for CephFS virtual extended attributes. Previously one had to use setxattr to restore the default in order to \\"remove\\". You may now properly use removexattr to remove. You can also now remove layout on the root inode, which then will restore the layout to the default.
RADOS: A performance bottleneck in the balancer mgr module has been fixed.
Related Tracker: https://tracker.ceph.com/issues/68657
RADOS: Based on tests performed at scale on an HDD-based Ceph cluster, it was found that scheduling with mClock was not optimal with multiple OSD shards. For example, in the test cluster with multiple OSD node failures, the client throughput was found to be inconsistent across test runs coupled with multiple reported slow requests. However, the same test with a single OSD shard and with multiple worker threads yielded significantly better results in terms of consistency of client and recovery throughput across multiple test runs. Therefore, as an interim measure until the issue with multiple OSD shards (or multiple mClock queues per OSD) is investigated and fixed, the following change to the default HDD OSD shard configuration is made:
osd_op_num_shards_hdd = 1
(was 5)osd_op_num_threads_per_shard_hdd = 5
(was 1)For more details, see https://tracker.ceph.com/issues/66289.
mgr/REST: The REST manager module will trim requests based on the \'max_requests\' option. Without this feature, and in the absence of manual deletion of old requests, the accumulation of requests in the array can lead to Out Of Memory (OOM) issues, resulting in the Manager crashing.
doc/rgw/notification: add missing admin commands (pr#60609, Yuval Lifshitz)
squid: [RGW] Fix the handling of HEAD requests that do not comply with RFC standards (pr#59123, liubingrun)
squid: a series of optimizations for kerneldevice discard (pr#59065, Adam Kupczyk, Joshua Baergen, Gabriel BenHanokh, Matt Vandermeulen)
squid: Add Containerfile and build.sh to build it (pr#60229, Dan Mick)
squid: AsyncMessenger: Don\'t decrease l_msgr_active_connections if it is negative (pr#60447, Mohit Agrawal)
squid: blk/aio: fix long batch (64+K entries) submission (pr#58676, Yingxin Cheng, Igor Fedotov, Adam Kupczyk, Robin Geuze)
squid: blk/KernelDevice: using join() to wait thread end is more safe (pr#60616, Yite Gu)
squid: bluestore/bluestore_types: avoid heap-buffer-overflow in another way to keep code uniformity (pr#58816, Rongqi Sun)
squid: ceph-bluestore-tool: Fixes for multilple bdev label (pr#59967, Adam Kupczyk, Igor Fedotov)
squid: ceph-volume: add call to ceph-bluestore-tool zap-device
(pr#59968, Guillaume Abrioux)
squid: ceph-volume: add new class UdevData (pr#60091, Guillaume Abrioux)
squid: ceph-volume: add TPM2 token enrollment support for encrypted OSDs (pr#59196, Guillaume Abrioux)
squid: ceph-volume: do not convert LVs\'s symlink to real path (pr#58954, Guillaume Abrioux)
squid: ceph-volume: do source devices zapping if they\'re detached (pr#58964, Guillaume Abrioux, Igor Fedotov)
squid: ceph-volume: drop unnecessary call to get\\\\_single\\\\_lv()
(pr#60353, Guillaume Abrioux)
squid: ceph-volume: fix dmcrypt activation regression (pr#60734, Guillaume Abrioux)
squid: ceph-volume: fix generic activation with raw osds (pr#59598, Guillaume Abrioux)
squid: ceph-volume: fix OSD lvm/tpm2 activation (pr#59953, Guillaume Abrioux)
squid: ceph-volume: pass self.osd_id to create_id() call (pr#59622, Guillaume Abrioux)
squid: ceph-volume: switch over to new disk sorting behavior (pr#59623, Guillaume Abrioux)
squid: ceph.spec.in: we need jsonnet for all distroes for make check (pr#60075, Kyr Shatskyy)
squid: cephadm/services/ingress: fixed keepalived config bug (pr#58381, Bernard Landon)
Squid: cephadm: bootstrap should not have \\"This is a development version of cephadm\\" message (pr#60880, Shweta Bhosale)
squid: cephadm: emit warning if daemon\'s image is not to be used (pr#59929, Matthew Vernon)
squid: cephadm: fix apparmor profiles with spaces in the names (pr#58542, John Mulligan)
squid: cephadm: pull container images from quay.io (pr#60354, Guillaume Abrioux)
squid: cephadm: Support Docker Live Restore (pr#59933, Michal Nasiadka)
squid: cephadm: update default image and latest stable release (pr#59827, Adam King)
squid: cephfs,mon: fix bugs related to updating MDS caps (pr#59672, Rishabh Dave)
squid: cephfs-shell: excute cmd \'rmdir_helper\' reported error (pr#58810, teng jie)
squid: cephfs: Fixed a bug in the readdir_cache_cb function that may have us… (pr#58804, Tod Chen)
squid: cephfs_mirror: provide metrics for last successful snapshot sync (pr#59070, Jos Collin)
squid: cephfs_mirror: update peer status for invalid metadata in remote snapshot (pr#59406, Jos Collin)
squid: cephfs_mirror: use snapdiff api for incremental syncing (pr#58984, Jos Collin)
squid: client: calls to _ll_fh_exists() should hold client_lock (pr#59487, Venky Shankar)
squid: client: check mds down status before getting mds_gid_t from mdsmap (pr#58587, Yite Gu, Dhairya Parmar)
squid: cls/user: reset stats only returns marker when truncated (pr#60164, Casey Bodley)
squid: cmake: use ExternalProjects to build isa-l and isa-l_crypto libraries (pr#60107, Casey Bodley)
squid: common,osd: Use last valid OSD IOPS value if measured IOPS is unrealistic (pr#60660, Sridhar Seshasayee)
squid: common/dout: fix FTBFS on GCC 14 (pr#59055, Radoslaw Zarzynski)
squid: common/options: Change HDD OSD shard configuration defaults for mClock (pr#59973, Sridhar Seshasayee)
squid: corpus: update submodule with mark cls_rgw_reshard_entry forward_inco… (pr#58923, NitzanMordhai)
squid: crimson/os/seastore/cached_extent: add the \\"refresh\\" ability to lba mappings (pr#58957, Xuehan Xu)
squid: crimson/os/seastore/lba_manager: do batch mapping allocs when remapping multiple mappings (pr#58820, Xuehan Xu)
squid: crimson/os/seastore/onode: add hobject_t into Onode (pr#58830, Xuehan Xu)
squid: crimson/os/seastore/transaction_manager: consider inconsistency between backrefs and lbas acceptable when cleaning segments (pr#58837, Xuehan Xu)
squid: crimson/os/seastore: add checksum offload to RBM (pr#59298, Myoungwon Oh)
squid: crimson/os/seastore: add writer level stats to RBM (pr#58828, Myoungwon Oh)
squid: crimson/os/seastore: track transactions/conflicts/outstanding periodically (pr#58835, Yingxin Cheng)
squid: crimson/osd/pg_recovery: push the iteration forward after finding unfound objects when starting primary recoveries (pr#58958, Xuehan Xu)
squid: crimson: access coll_map under alien tp with a lock (pr#58841, Samuel Just)
squid: crimson: audit and correct epoch captured by IOInterruptCondition (pr#58839, Samuel Just)
squid: crimson: simplify obc loading by locking excl for load and demoting to needed lock (pr#58905, Matan Breizman, Samuel Just)
squid: debian pkg: record python3-packaging dependency for ceph-volume (pr#59202, Kefu Chai, Thomas Lamprecht)
squid: doc,mailmap: update my email / association to ibm (pr#60338, Patrick Donnelly)
squid: doc/ceph-volume: add spillover fix procedure (pr#59540, Zac Dover)
squid: doc/cephadm: add malformed-JSON removal instructions (pr#59663, Zac Dover)
squid: doc/cephadm: Clarify \\"Deploying a new Cluster\\" (pr#60809, Zac Dover)
squid: doc/cephadm: clean \\"Adv. OSD Service Specs\\" (pr#60679, Zac Dover)
squid: doc/cephadm: correct \\"ceph orch apply\\" command (pr#60432, Zac Dover)
squid: doc/cephadm: how to get exact size_spec from device (pr#59430, Zac Dover)
squid: doc/cephadm: link to \\"host pattern\\" matching sect (pr#60644, Zac Dover)
squid: doc/cephadm: Update operations.rst (pr#60637, rhkelson)
squid: doc/cephfs: add cache pressure information (pr#59148, Zac Dover)
squid: doc/cephfs: add doc for disabling mgr/volumes plugin (pr#60496, Rishabh Dave)
squid: doc/cephfs: edit \\"Disabling Volumes Plugin\\" (pr#60467, Zac Dover)
squid: doc/cephfs: edit \\"Layout Fields\\" text (pr#59021, Zac Dover)
squid: doc/cephfs: edit 3rd 3rd of mount-using-kernel-driver (pr#61080, Zac Dover)
squid: doc/cephfs: improve \\"layout fields\\" text (pr#59250, Zac Dover)
squid: doc/cephfs: improve cache-configuration.rst (pr#59214, Zac Dover)
squid: doc/cephfs: rearrange subvolume group information (pr#60435, Indira Sawant)
squid: doc/cephfs: s/mountpoint/mount point/ (pr#59294, Zac Dover)
squid: doc/cephfs: s/mountpoint/mount point/ (pr#59289, Zac Dover)
squid: doc/cephfs: use \'p\' flag to set layouts or quotas (pr#60482, TruongSinh Tran-Nguyen)
squid: doc/dev/peering: Change acting set num (pr#59062, qn2060)
squid: doc/dev/release-checklist: check telemetry validation (pr#59813, Yaarit Hatuka)
squid: doc/dev/release-checklists.rst: enable rtd for squid (pr#59812, Neha Ojha)
squid: doc/dev/release-process.rst: New container build/release process (pr#60971, Dan Mick)
squid: doc/dev: add \\"activate latest release\\" RTD step (pr#59654, Zac Dover)
squid: doc/dev: instruct devs to backport (pr#61063, Zac Dover)
squid: doc/dev: remove \\"Stable Releases and Backports\\" (pr#60272, Zac Dover)
squid: doc/glossary.rst: add \\"Dashboard Plugin\\" (pr#60896, Zac Dover)
squid: doc/glossary: add \\"ceph-ansible\\" (pr#59007, Zac Dover)
squid: doc/glossary: add \\"flapping OSD\\" (pr#60864, Zac Dover)
squid: doc/glossary: add \\"object storage\\" (pr#59424, Zac Dover)
squid: doc/glossary: add \\"PLP\\" to glossary (pr#60503, Zac Dover)
squid: doc/governance: add exec council responsibilites (pr#60139, Zac Dover)
squid: doc/governance: add Zac Dover\'s updated email (pr#60134, Zac Dover)
squid: doc/install: Keep the name field of the created user consistent with … (pr#59756, hejindong)
squid: doc/man: edit ceph-bluestore-tool.rst (pr#59682, Zac Dover)
squid: doc/mds: improve wording (pr#59585, Piotr Parczewski)
squid: doc/mgr/dashboard: fix TLS typo (pr#59031, Mindy Preston)
squid: doc/rados/operations: Improve health-checks.rst (pr#59582, Anthony D\'Atri)
squid: doc/rados/troubleshooting: Improve log-and-debug.rst (pr#60824, Anthony D\'Atri)
squid: doc/rados: add \\"pgs not deep scrubbed in time\\" info (pr#59733, Zac Dover)
squid: doc/rados: add blaum_roth coding guidance (pr#60537, Zac Dover)
squid: doc/rados: add confval directives to health-checks (pr#59871, Zac Dover)
squid: doc/rados: add link to messenger v2 info in mon-lookup-dns.rst (pr#59794, Zac Dover)
squid: doc/rados: add osd_deep_scrub_interval setting operation (pr#59802, Zac Dover)
squid: doc/rados: correct \\"full ratio\\" note (pr#60737, Zac Dover)
squid: doc/rados: document unfound object cache-tiering scenario (pr#59380, Zac Dover)
squid: doc/rados: edit \\"Placement Groups Never Get Clean\\" (pr#60046, Zac Dover)
squid: doc/rados: fix sentences in health-checks (2 of x) (pr#60931, Zac Dover)
squid: doc/rados: fix sentences in health-checks (3 of x) (pr#60949, Zac Dover)
squid: doc/rados: make sentences agree in health-checks.rst (pr#60920, Zac Dover)
squid: doc/rados: standardize markup of \\"clean\\" (pr#60500, Zac Dover)
squid: doc/radosgw/multisite: fix Configuring Secondary Zones -> Updating the Period (pr#60332, Casey Bodley)
squid: doc/radosgw/qat-accel: Update and Add QATlib information (pr#58874, Feng, Hualong)
squid: doc/radosgw: Improve archive-sync-module.rst (pr#60852, Anthony D\'Atri)
squid: doc/radosgw: Improve archive-sync-module.rst more (pr#60867, Anthony D\'Atri)
squid: doc/radosgw: Improve config-ref.rst (pr#59578, Anthony D\'Atri)
squid: doc/radosgw: improve qat-accel.rst (pr#59179, Anthony D\'Atri)
squid: doc/radosgw: s/Poliicy/Policy/ (pr#60707, Zac Dover)
squid: doc/radosgw: update rgw_dns_name doc (pr#60885, Zac Dover)
squid: doc/rbd: add namespace information for mirror commands (pr#60269, N Balachandran)
squid: doc/README.md - add ordered list (pr#59798, Zac Dover)
squid: doc/README.md: create selectable commands (pr#59834, Zac Dover)
squid: doc/README.md: edit \\"Build Prerequisites\\" (pr#59637, Zac Dover)
squid: doc/README.md: improve formatting (pr#59785, Zac Dover)
squid: doc/README.md: improve formatting (pr#59700, Zac Dover)
squid: doc/rgw/account: Handling notification topics when migrating an existing user into an account (pr#59491, Oguzhan Ozmen)
squid: doc/rgw/d3n: pass cache dir volume to extra_container_args (pr#59767, Mark Kogan)
squid: doc/rgw/notification: clarified the notification_v2 behavior upon upg… (pr#60662, Yuval Lifshitz)
squid: doc/rgw/notification: persistent notification queue full behavior (pr#59233, Yuval Lifshitz)
squid: doc/start: add supported Squid distros (pr#60557, Zac Dover)
squid: doc/start: add vstart install guide (pr#60461, Zac Dover)
squid: doc/start: fix \\"are are\\" typo (pr#60708, Zac Dover)
squid: doc/start: separate package chart from container chart (pr#60698, Zac Dover)
squid: doc/start: update os-recommendations.rst (pr#60766, Zac Dover)
squid: doc: Correct link to Prometheus docs (pr#59559, Matthew Vernon)
squid: doc: Document the Windows CI job (pr#60033, Lucian Petrut)
squid: doc: Document which options are disabled by mClock (pr#60671, Niklas Hambüchen)
squid: doc: documenting the feature that scrub clear the entries from damage… (pr#59078, Neeraj Pratap Singh)
squid: doc: explain the consequence of enabling mirroring through monitor co… (pr#60525, Jos Collin)
squid: doc: fix email (pr#60233, Ernesto Puerta)
squid: doc: fix typo (pr#59991, N Balachandran)
squid: doc: Harmonize \'mountpoint\' (pr#59291, Anthony D\'Atri)
squid: doc: s/Whereas,/Although/ (pr#60593, Zac Dover)
squid: doc: SubmittingPatches-backports - remove backports team (pr#60297, Zac Dover)
squid: doc: Update \\"Getting Started\\" to link to start not install (pr#59907, Matthew Vernon)
squid: doc: update Key Idea in cephfs-mirroring.rst (pr#60343, Jos Collin)
squid: doc: update nfs doc for Kerberos setup of ganesha in Ceph (pr#59939, Avan Thakkar)
squid: doc: update tests-integration-testing-teuthology-workflow.rst (pr#59548, Vallari Agrawal)
squid: doc:update e-mail addresses governance (pr#60084, Tobias Fischer)
squid: docs/rados/operations/stretch-mode: warn device class is not supported (pr#59099, Kamoltat Sirivadhna)
squid: global: Call getnam_r with a 64KiB buffer on the heap (pr#60127, Adam Emerson)
squid: librados: use CEPH_OSD_FLAG_FULL_FORCE for IoCtxImpl::remove (pr#59284, Chen Yuanrun)
squid: librbd/crypto/LoadRequest: clone format for migration source image (pr#60171, Ilya Dryomov)
squid: librbd/crypto: fix issue when live-migrating from encrypted export (pr#59145, Ilya Dryomov)
squid: librbd/migration/HttpClient: avoid reusing ssl_stream after shut down (pr#61095, Ilya Dryomov)
squid: librbd/migration: prune snapshot extents in RawFormat::list_snaps() (pr#59661, Ilya Dryomov)
squid: librbd: avoid data corruption on flatten when object map is inconsistent (pr#61168, Ilya Dryomov)
squid: log: save/fetch thread name infra (pr#60279, Milind Changire)
squid: Make mon addrs consistent with mon info (pr#60751, shenjiatong)
squid: mds/QuiesceDbManager: get requested state of members before iterating… (pr#58912, junxiang Mu)
squid: mds: CInode::item_caps used in two different lists (pr#56887, Dhairya Parmar)
squid: mds: encode quiesce payload on demand (pr#59517, Patrick Donnelly)
squid: mds: find a new head for the batch ops when the head is dead (pr#57494, Xiubo Li)
squid: mds: fix session/client evict command (pr#58727, Neeraj Pratap Singh)
squid: mds: only authpin on wrlock when not a locallock (pr#59097, Patrick Donnelly)
squid: mgr/balancer: optimize \'balancer status detail\' (pr#60718, Laura Flores)
squid: mgr/cephadm/services/ingress Fix HAProxy to listen on IPv4 and IPv6 (pr#58515, Bernard Landon)
squid: mgr/cephadm: add \\"original_weight\\" parameter to OSD class (pr#59410, Adam King)
squid: mgr/cephadm: add --no-exception-when-missing flag to cert-store cert/key get (pr#59935, Adam King)
squid: mgr/cephadm: add command to expose systemd units of all daemons (pr#59931, Adam King)
squid: mgr/cephadm: bump monitoring stacks version (pr#58711, Nizamudeen A)
squid: mgr/cephadm: make ssh keepalive settings configurable (pr#59710, Adam King)
squid: mgr/cephadm: redeploy when some dependency daemon is add/removed (pr#58383, Redouane Kachach)
squid: mgr/cephadm: Update multi-site configs before deploying daemons on rgw service create (pr#60321, Aashish Sharma)
squid: mgr/cephadm: use host address while updating rgw zone endpoints (pr#59948, Aashish Sharma)
squid: mgr/client: validate connection before sending (pr#58887, NitzanMordhai)
squid: mgr/dashboard: add cephfs rename REST API (pr#60620, Yite Gu)
squid: mgr/dashboard: Add group field in nvmeof service form (pr#59446, Afreen Misbah)
squid: mgr/dashboard: add gw_groups support to nvmeof api (pr#59751, Nizamudeen A)
squid: mgr/dashboard: add gw_groups to all nvmeof endpoints (pr#60310, Nizamudeen A)
squid: mgr/dashboard: add restful api for creating crush rule with type of \'erasure\' (pr#59139, sunlan)
squid: mgr/dashboard: Changes for Sign out text to Login out (pr#58988, Prachi Goel)
Squid: mgr/dashboard: Cloning subvolume not listing _nogroup if no subvolume (pr#59951, Dnyaneshwari talwekar)
squid: mgr/dashboard: custom image for kcli bootstrap script (pr#59879, Pedro Gonzalez Gomez)
squid: mgr/dashboard: Dashboard not showing Object/Overview correctly (pr#59038, Aashish Sharma)
squid: mgr/dashboard: Fix adding listener and null issue for groups (pr#60078, Afreen Misbah)
squid: mgr/dashboard: fix bucket get for s3 account owned bucket (pr#60466, Nizamudeen A)
squid: mgr/dashboard: fix ceph-users api doc (pr#59140, Nizamudeen A)
squid: mgr/dashboard: fix doc links in rgw-multisite (pr#60154, Pedro Gonzalez Gomez)
squid: mgr/dashboard: fix gateways section error:”404 - Not Found RGW Daemon not found: None” (pr#60231, Aashish Sharma)
squid: mgr/dashboard: fix group name bugs in the nvmeof API (pr#60348, Nizamudeen A)
squid: mgr/dashboard: fix handling NaN values in dashboard charts (pr#59961, Aashish Sharma)
squid: mgr/dashboard: fix lifecycle issues (pr#60378, Pedro Gonzalez Gomez)
squid: mgr/dashboard: Fix listener deletion (pr#60292, Afreen Misbah)
squid: mgr/dashboard: fix setting compression type while editing rgw zone (pr#59970, Aashish Sharma)
Squid: mgr/dashboard: Forbid snapshot name \\".\\" and any containing \\"/\\" (pr#59995, Dnyaneshwari Talwekar)
squid: mgr/dashboard: handle infinite values for pools (pr#61096, Afreen)
squid: mgr/dashboard: ignore exceptions raised when no cert/key found (pr#60311, Nizamudeen A)
squid: mgr/dashboard: Increase maximum namespace count to 1024 (pr#59717, Afreen Misbah)
squid: mgr/dashboard: introduce server side pagination for osds (pr#60294, Nizamudeen A)
squid: mgr/dashboard: mgr/dashboard: Select no device by default in EC profile (pr#59811, Afreen Misbah)
Squid: mgr/dashboard: multisite sync policy improvements (pr#59965, Naman Munet)
Squid: mgr/dashboard: NFS Export form fixes (pr#59900, Dnyaneshwari Talwekar)
squid: mgr/dashboard: Nvme mTLS support and service name changes (pr#59819, Afreen Misbah)
squid: mgr/dashboard: provide option to enable pool based mirroring mode while creating a pool (pr#58638, Aashish Sharma)
squid: mgr/dashboard: remove cherrypy_backports.py (pr#60632, Nizamudeen A)
Squid: mgr/dashboard: remove orch required decorator from host UI router (list) (pr#59851, Naman Munet)
squid: mgr/dashboard: Rephrase dedicated pool helper in rbd create form (pr#59721, Aashish Sharma)
Squid: mgr/dashboard: RGW multisite sync remove zones fix (pr#59825, Naman Munet)
squid: mgr/dashboard: rm nvmeof conf based on its daemon name (pr#60604, Nizamudeen A)
Squid: mgr/dashboard: service form hosts selection only show up to 10 entries (pr#59760, Naman Munet)
squid: mgr/dashboard: show non default realm sync status in rgw overview page (pr#60232, Aashish Sharma)
squid: mgr/dashboard: Show which daemons failed in CEPHADM_FAILED_DAEMON healthcheck (pr#59597, Aashish Sharma)
Squid: mgr/dashboard: sync policy\'s in Object >> Multi-site >> Sync-policy, does not show the zonegroup to which policy belongs to (pr#60346, Naman Munet)
Squid: mgr/dashboard: The subvolumes are missing from the dropdown menu on the \\"Create NFS export\\" page (pr#60356, Dnyaneshwari Talwekar)
Squid: mgr/dashboard: unable to edit pipe config for bucket level policy of bucket (pr#60293, Naman Munet)
squid: mgr/dashboard: Update nvmeof microcopies (pr#59718, Afreen Misbah)
squid: mgr/dashboard: update period after migrating to multi-site (pr#59964, Aashish Sharma)
squid: mgr/dashboard: update translations for squid (pr#60367, Nizamudeen A)
squid: mgr/dashboard: use grafana server instead of grafana-server in grafana 10.4.0 (pr#59722, Aashish Sharma)
Squid: mgr/dashboard: Wrong(half) uid is observed in dashboard when user created via cli contains $ in its name (pr#59693, Dnyaneshwari Talwekar)
squid: mgr/dashboard: Zone details showing incorrect data for data pool values and compression info for Storage Classes (pr#59596, Aashish Sharma)
Squid: mgr/dashboard: zonegroup level policy created at master zone did not sync to non-master zone (pr#59892, Naman Munet)
squid: mgr/nfs: generate user_id & access_key for apply_export(CephFS) (pr#59896, Avan Thakkar, avanthakkar, John Mulligan)
squid: mgr/orchestrator: fix encrypted flag handling in orch daemon add osd (pr#59473, Yonatan Zaken)
squid: mgr/rest: Trim requests array and limit size (pr#59372, Nitzan Mordechai)
squid: mgr/rgw: Adding a retry config while calling zone_create() (pr#59138, Kritik Sachdeva)
squid: mgr/rgwam: use realm/zonegroup/zone method arguments for period update (pr#59945, Aashish Sharma)
squid: mgr/volumes: add earmarking for subvol (pr#59894, Avan Thakkar)
squid: Modify container/ software to support release containers and the promotion of prerelease containers (pr#60962, Dan Mick)
squid: mon/ElectionLogic: tie-breaker mon ignore proposal from marked down mon (pr#58669, Kamoltat)
squid: mon/MonClient: handle ms_handle_fast_authentication return (pr#59306, Patrick Donnelly)
squid: mon/OSDMonitor: Add force-remove-snap mon command (pr#59402, Matan Breizman)
squid: mon/OSDMonitor: fix get_min_last_epoch_clean() (pr#55865, Matan Breizman)
squid: mon: Remove any pg_upmap_primary mapping during remove a pool (pr#58914, Mohit Agrawal)
squid: msg: insert PriorityDispatchers in sorted position (pr#58991, Casey Bodley)
squid: node-proxy: fix a regression when processing the RedFish API (pr#59997, Guillaume Abrioux)
squid: node-proxy: make the daemon discover endpoints (pr#58482, Guillaume Abrioux)
squid: objclass: deprecate cls_cxx_gather (pr#57819, Nitzan Mordechai)
squid: orch: disk replacement enhancement (pr#60486, Guillaume Abrioux)
squid: orch: refactor boolean handling in drive group spec (pr#59863, Guillaume Abrioux)
squid: os/bluestore: enable async manual compactions (pr#58740, Igor Fedotov)
squid: os/bluestore: Fix BlueFS allocating bdev label reserved location (pr#59969, Adam Kupczyk)
squid: os/bluestore: Fix ceph-bluestore-tool allocmap command (pr#60335, Adam Kupczyk)
squid: os/bluestore: Fix repair of multilabel when collides with BlueFS (pr#60336, Adam Kupczyk)
squid: os/bluestore: Improve documentation introduced by #57722 (pr#60893, Anthony D\'Atri)
squid: os/bluestore: Multiple bdev labels on main block device (pr#59106, Adam Kupczyk)
squid: os/bluestore: Mute warnings (pr#59217, Adam Kupczyk)
squid: os/bluestore: Warning added for slow operations and stalled read (pr#59464, Md Mahamudur Rahaman Sajib)
squid: osd/scheduler: add mclock queue length perfcounter (pr#59035, zhangjianwei2)
squid: osd/scrub: decrease default deep scrub chunk size (pr#59791, Ronen Friedman)
squid: osd/scrub: exempt only operator scrubs from max_scrubs limit (pr#59020, Ronen Friedman)
squid: osd/scrub: reduce osd_requested_scrub_priority default value (pr#59885, Ronen Friedman)
squid: osd: fix require_min_compat_client handling for msr rules (pr#59492, Samuel Just, Radoslaw Zarzynski)
squid: PeeringState.cc: Only populate want_acting when num_osds < bucket_max (pr#59083, Kamoltat)
squid: qa/cephadm: extend iscsi teuth test (pr#59934, Adam King)
squid: qa/cephfs: fix TestRenameCommand and unmount the clinet before failin… (pr#59398, Xiubo Li)
squid: qa/cephfs: ignore variant of MDS_UP_LESS_THAN_MAX (pr#58788, Patrick Donnelly)
squid: qa/distros: reinstall nvme-cli on centos 9 nodes (pr#59471, Adam King)
squid: qa/rgw/multisite: specify realm/zonegroup/zone args for \'account create\' (pr#59603, Casey Bodley)
squid: qa/rgw: bump keystone/barbican from 2023.1 to 2024.1 (pr#61023, Casey Bodley)
squid: qa/rgw: fix s3 java tests by forcing gradle to run on Java 8 (pr#61053, J. Eric Ivancich)
squid: qa/rgw: force Hadoop to run under Java 1.8 (pr#61120, J. Eric Ivancich)
squid: qa/rgw: pull Apache artifacts from mirror instead of archive.apache.org (pr#61101, J. Eric Ivancich)
squid: qa/standalone/scrub: fix the searched-for text for snaps decode errors (pr#58967, Ronen Friedman)
squid: qa/standalone/scrub: increase status updates frequency (pr#59974, Ronen Friedman)
squid: qa/standalone/scrub: remove TEST_recovery_scrub_2 (pr#60287, Ronen Friedman)
squid: qa/suites/crimson-rados/perf: add ssh keys (pr#61109, Nitzan Mordechai)
squid: qa/suites/rados/thrash-old-clients: Add noscrub, nodeep-scrub to ignorelist (pr#58629, Kamoltat)
squid: qa/suites/rados/thrash-old-clients: test with N-2 releases on centos 9 (pr#58607, Laura Flores)
squid: qa/suites/rados/verify/validater: increase heartbeat grace timeout (pr#58785, Sridhar Seshasayee)
squid: qa/suites/rados: Cancel injectfull to allow cleanup (pr#59156, Brad Hubbard)
squid: qa/suites/rbd/iscsi: enable all supported container hosts (pr#60089, Ilya Dryomov)
squid: qa/suites: drop --show-reachable=yes from fs:valgrind tests (pr#59068, Jos Collin)
squid: qa/task: update alertmanager endpoints version (pr#59930, Nizamudeen A)
squid: qa/tasks/mgr/test_progress.py: deal with pre-exisiting pool (pr#58263, Kamoltat)
squid: qa/tasks/nvme_loop: update task to work with new nvme list format (pr#61026, Adam King)
squid: qa/upgrade: fix checks to make sure upgrade is still in progress (pr#59472, Adam King)
squid: qa: adjust expected io_opt in krbd_discard_granularity.t (pr#59232, Ilya Dryomov)
squid: qa: ignore container checkpoint/restore related selinux denials for c… (issue#67117, issue#66640, pr#58808, Venky Shankar)
squid: qa: load all dirfrags before testing altname recovery (pr#59521, Patrick Donnelly)
squid: qa: remove all bluestore signatures on devices (pr#60021, Guillaume Abrioux)
squid: qa: suppress __trans_list_add valgrind warning (pr#58790, Patrick Donnelly)
squid: RADOS: Generalize stretch mode pg temp handling to be usable without stretch mode (pr#59084, Kamoltat)
squid: rbd-mirror: use correct ioctx for namespace (pr#59771, N Balachandran)
squid: rbd: \\"rbd bench\\" always writes the same byte (pr#59502, Ilya Dryomov)
squid: rbd: amend \\"rbd {group,} rename\\" and \\"rbd mirror pool\\" command descriptions (pr#59602, Ilya Dryomov)
squid: rbd: handle --{group,image}-namespace in \\"rbd group image {add,rm}\\" (pr#61172, Ilya Dryomov)
squid: rgw/beast: optimize for accept when meeting error in listenning (pr#60244, Mingyuan Liang, Casey Bodley)
squid: rgw/http: finish_request() after logging errors (pr#59439, Casey Bodley)
squid: rgw/kafka: refactor topic creation to avoid rd_kafka_topic_name() (pr#59754, Yuval Lifshitz)
squid: rgw/lc: Fix lifecycle not working while bucket versioning is suspended (pr#61138, Trang Tran)
squid: rgw/multipart: use cls_version to avoid racing between part upload and multipart complete (pr#59678, Jane Zhu)
squid: rgw/multisite: metadata polling event based on unmodified mdlog_marker (pr#60792, Shilpa Jagannath)
squid: rgw/notifications: fixing radosgw-admin notification json (pr#59302, Yuval Lifshitz)
squid: rgw/notifications: free completion pointer using unique_ptr (pr#59671, Yuval Lifshitz)
squid: rgw/notify: visit() returns copy of owner string (pr#59226, Casey Bodley)
squid: rgw/rados: don\'t rely on IoCtx::get_last_version() for async ops (pr#60065, Casey Bodley)
squid: rgw: add s3select usage to log usage (pr#59120, Seena Fallah)
squid: rgw: decrement qlen/qactive perf counters on error (pr#59670, Mark Kogan)
squid: rgw: decrypt multipart get part when encrypted (pr#60130, sungjoon-koh)
squid: rgw: ignore zoneless default realm when not configured (pr#59445, Casey Bodley)
squid: rgw: load copy source bucket attrs in putobj (pr#59413, Seena Fallah)
squid: rgw: optimize bucket listing to skip past regions of namespaced entries (pr#61070, J. Eric Ivancich)
squid: rgw: revert account-related changes to get_iam_policy_from_attr() (pr#59221, Casey Bodley)
squid: rgw: RGWAccessKey::decode_json() preserves default value of \'active\' (pr#60823, Casey Bodley)
squid: rgw: switch back to boost::asio for spawn() and yield_context (pr#60133, Casey Bodley)
squid: rgwlc: fix typo in getlc (ObjectSizeGreaterThan) (pr#59223, Matt Benjamin)
squid: RGW|BN: fix lifecycle test issue (pr#59010, Ali Masarwa)
squid: RGW|Bucket notification: fix for v2 topics rgw-admin list operation (pr#60774, Oshrey Avraham, Ali Masarwa)
squid: seastar: update submodule (pr#58955, Matan Breizman)
squid: src/ceph_release, doc: mark squid stable (pr#59537, Neha Ojha)
squid: src/crimson/osd/scrub: fix the null pointer error (pr#58885, junxiang Mu)
squid: src/mon/ConnectionTracker.cc: Fix dump function (pr#60003, Kamoltat)
squid: suites/upgrade/quincy-x: update the ignore list (pr#59624, Nitzan Mordechai)
squid: suites: adding ignore list for stray daemon (pr#58267, Nitzan Mordechai)
squid: suites: test should ignore osd_down warnings (pr#59147, Nitzan Mordechai)
squid: test/neorados: remove depreciated RemoteReads cls test (pr#58144, Laura Flores)
squid: test/rgw/notification: fixing backport issues in the tests (pr#60545, Yuval Lifshitz)
squid: test/rgw/notification: use real ip address instead of localhost (pr#59303, Yuval Lifshitz)
squid: test/rgw/notifications: don\'t check for full queue if topics expired (pr#59917, Yuval Lifshitz)
squid: test/rgw/notifications: fix test regression (pr#61119, Yuval Lifshitz)
squid: Test: osd-recovery-space.sh extends the wait time for \\"recovery toofull\\" (pr#59041, Nitzan Mordechai)
upgrade/cephfs/mds_upgrade_sequence: ignore osds down (pr#59865, Kamoltat Sirivadhna)
squid: rgw: Don\'t crash on exceptions from pool listing (pr#61306, Adam Emerson)
squid: container/Containerfile: replace CEPH_VERSION label for backward compact (pr#61583, Dan Mick)
squid: container/build.sh: fix up org vs. repo naming (pr#61584, Dan Mick)
squid: container/build.sh: don\'t require repo creds on NO_PUSH (pr#61585, Dan Mick)
Crimson is the project name for the new OSD high performance architecture. Crimson is built on top of Seastar framework an advanced, open-source C++ framework for high-performance server applications on modern hardware. Seastar implements I/O reactors in a share-nothing architecture, using asynchronous computation primitives, like futures, promises, and coroutines. The I/O reactor threads are normally pinned to a specific CPU core in the system. However, to support interaction with legacy software that is, non-reactor blocking tasks, the mechanism of Alien threads are available in Seastar. These allow the interface between non-reactor and I/O reactor architecture. In Crimson, the alien threads are used to support Bluestore.
There are very good introductions to the project, in particular check the following videos by Sam Just and Matan Breizman in the Ceph community Youtube channel.
An important question from the performance point of view is the allocation of Seastar reactor threads to the available CPU cores. This is particularly key in the modern NUMA (Non-Uniform Memory Access) architectures, where there is a latency penalty for accessing memory from a different CPU socket as opposed to local memory belonging to the same CPU socket where the thread is running. We also want to ensure mutual exclusion between Reactors and other non-reactor threads within the same CPU core. The main reason is that the Seastar reactor threads are non-blocking, whereas non-reactor threads are allowed to block.
As part of this PR we introduced a new option in the vstart.sh
script to set the CPU allocation strategy for the OSD threads:
By default, if the option is not given, vstart will allocate the reactor threads in consecutive order, as follows:
Worth mentioning that the vstart.sh
script is used in Developer mode only, very useful for experimenting, as in this case.
The structure of this blog entry is as follows:
first we briefly describe the hardware and performance tests we executed, illustrated with some snippets. Readers familiarised with Ceph might want to skip this section.
In the second section, we show the results of the performance tests, comparing the three CPU allocation strategies. We used the three backend classes supported by Crimson, which are:
Cyanstore: this is an in-memory pure-reactor OSD class, which does not exercise the physical drives in the system. The reason for using this OSD class is to saturate the memory access rate in the machine, to identify the highest I/O rate possible in Crimson without interference (that is latencies) from physical drives.
Seastore: this is also a pure-reactor OSD class which exercises the physical NVMe drives in the machine. We expect that the overall performance of this class would be a fraction of that achieved by Cyanstore.
Bluestore: this is the default OSD class for Crimson, as well for the Classic OSD in Ceph. This class involves the participation of Alien threads, which is the technique in Seastore to deal with blocking thread pools.
The comparison results are interesting as they highlight both limitations and opportunities for performance optimisations.
In a nutshell, we want to measure the performance using some typical client workloads (random 4K write, random 4K read; sequential 64K write, sequential 64K read) for a number of cluster configurations involving a fixed number of OSD and ranging over a number of I/O reactors (which implicitly ranges over the corresponding number of CPU cores). We want to compare across the exisiting object stores: Cyanstore (in memory), Seastore and Bluestore. The former two are \\"pure reactor\\", whilst the latter involves (blocking) Alien thread pools.
In terms of the client, we exercise an RBD volume of 10 GiB size, using FIO for the typical workloads as mentioned before. We synthetise the client results from an FIO .json output (I/O throughput and latency) and integrate it with measurements from the OSD process. These typically involve CPU and Memory utilisation (from the Linux command top). This workflow is illustrated in the following diagram.
In our actual experimentations, we ranged over a number of OSD (1, 3, 5, 8), as well as over the number of reactors (1,2,4,6). Since the number of results would be considerably large and would make reading this blog rather tedious, we decided only show the representative 8 OSDs and 5 I/O reactors.
We used a single node cluster, with the following hardware and system configuration:
We build Ceph with the following options:
# ./do_cmake.sh -DWITH_SEASTAR=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo\\n
At a high level, we initiate the performance tests as follows:
/root/bin/run_balanced_crimson.sh -t cyan\\n
This will run the test plan for the Cyanstore object backend, producing data for response curves over the three CPU allocation strategies. The argument \'-t
\' is used to specify the object storage backend: cyan
, sea
and blue
for Cyanstore, Seastore and Bluestore, respectively.
crimson_be_table[\\"cyan\\"]=\\"--cyanstore\\"\\ncrimson_be_table[\\"sea\\"]=\\"--seastore --seastore-devs ${STORE_DEVS}\\"\\ncrimson_be_table[\\"blue\\"]=\\"--bluestore --bluestore-devs ${STORE_DEVS}\\"\\n
Once the execution completes, the results are archived in .zip files according to the workloads and saved in the output directory which can be specified with option -d
(/tmp
by default).
cyan_8osd_5reactor_8fio_bal_osd_rc_1procs_randread.zip cyan_8osd_5reactor_8fio_bal_socket_rc_1procs_seqread.zip cyan_8osd_6reactor_8fio_bal_osd_rc_1procs_randread.zip cyan_8osd_6reactor_8fio_bal_socket_rc_1procs_seqread.zip\\ncyan_8osd_5reactor_8fio_bal_osd_rc_1procs_randwrite.zip cyan_8osd_5reactor_8fio_bal_socket_rc_1procs_seqwrite.zip cyan_8osd_6reactor_8fio_bal_osd_rc_1procs_randwrite.zip cyan_8osd_6reactor_8fio_bal_socket_rc_1procs_seqwrite.zip\\ncyan_8osd_5reactor_8fio_bal_osd_rc_1procs_seqread.zip\\n
Each archive contains the result output files and measurements from that workload execution.
cyan_8osd_5reactor_8fio_bal_osd_rc_1procs_randread.json
: the (combined) FIO output file, which contains the I/O throughput and latency measurements. It contains integrated the CPU and MEM utilisation from the OSD process as well.cyan_8osd_5reactor_8fio_bal_osd_rc_1procs_randread_cpu_avg.json
: the OSD and FIO CPU and MEM utilisation averages. These have been collected from the OSD process and the FIO client process via top (30 samples over 5 minutes).cyan_8osd_5reactor_8fio_bal_osd_rc_1procs_randread_diskstat.json
: the diskstat output. A sample is taken before and after the test, the .json contains the differences.cyan_8osd_5reactor_8fio_bal_osd_rc_1procs_randread_top.json
: the output from top, parsed via jc to produce a .json. Note: jc does not yet support individual CPU core utilisation, so we have to rely on the overall CPU utilisation (per thread).new_cluster_dump.json
: the output from ceph tell ${osd_i} dump_metrics
command, which contains the individual OSD performance metrics.FIO_cyan_8osd_5reactor_8fio_bal_osd_rc_1procs_randread_top_cpu.plot
: the plot mechanically generated from the top output, showing the CPU utilisation over time for the FIO client.OSD_cyan_8osd_5reactor_8fio_bal_osd_rc_1procs_randread_top_mem.plot
: the plot mechanically generated from the top output, showing the MEM utilisation over time for the OSD process.To produce the post-processing and side-by-side comparison, the following script is run:
# /root/bin/pp_balanced_cpu_cmp.sh -d /tmp/_seastore_8osd_5_6_reactor_8fio_rc_cmp \\\\\\n -t sea -o seastore_8osd_5vs6_reactor_8fio_cpu_cmp.md\\n
The arguments are: the input directory that contains the runs we want to compare, the type of object store backend, and the output .md to produce.
We will show the comparisons produced in the next section.
We end this section by looking behind the curtains of the above scripts, showing details on preconditioning the drives, creation of the cluster, execution of FIO and the metrics collected.
In order to ensure that the drives are in a consistent state, we run a write workload using FIO with the steadystate
option. This option ensures that the drives are in a steady state before the actual performance tests are run. We precondition up to 70 percent of the total capacity of the drive.
We take a measurement of the diskstats before and after the test, and we use the diskstat_diff.py
script to calculate the difference. The script is available in the ceph
repository under src/tools/contrib
.
# jc --pretty /proc/diskstats > /tmp/blue_8osd_6reactor_192at_8fio_socket_cond.json\\n# fio rbd_fio_examples/randwrite64k.fio && jc --pretty /proc/diskstats \\\\\\n | python3 diskstat_diff.py -d /tmp/ -a blue_8osd_6reactor_192at_8fio_socket_cond.json\\n\\nJobs: 8 (f=8): [w(8)][30.5%][w=24.3GiB/s][w=398k IOPS][eta 13m:41s]\\nnvme0n1p2: (groupid=0, jobs=8): err= 0: pid=375444: Fri Jan 31 11:43:35 2025\\n write: IOPS=397k, BW=24.2GiB/s (26.0GB/s)(8742GiB/360796msec); 0 zone resets\\n slat (nsec): min=1543, max=823010, avg=5969.62, stdev=2226.84\\n clat (usec): min=57, max=50322, avg=5152.50, stdev=2982.28\\n lat (usec): min=70, max=50328, avg=5158.47, stdev=2982.27\\n clat percentiles (usec):\\n | 1.00th=[ 281], 5.00th=[ 594], 10.00th=[ 1037], 20.00th=[ 2008],\\n | 30.00th=[ 3032], 40.00th=[ 4080], 50.00th=[ 5145], 60.00th=[ 6194],\\n | 70.00th=[ 7242], 80.00th=[ 8291], 90.00th=[ 9241], 95.00th=[ 9634],\\n | 99.00th=[10028], 99.50th=[10421], 99.90th=[14091], 99.95th=[16188],\\n | 99.99th=[19268]\\n bw ( MiB/s): min=15227, max=24971, per=100.00%, avg=24845.68, stdev=88.47, samples=5768\\n iops : min=243638, max=399547, avg=397527.12, stdev=1415.43, samples=5768\\n lat (usec) : 100=0.01%, 250=0.61%, 500=3.18%, 750=2.90%, 1000=2.88%\\n lat (msec) : 2=10.28%, 4=19.25%, 10=59.90%, 20=1.00%, 50=0.01%\\n lat (msec) : 100=0.01%\\n cpu : usr=19.80%, sys=15.60%, ctx=104026691, majf=0, minf=2647\\n IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%\\n submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%\\n complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%\\n issued rwts: total=0,143224767,0,0 short=0,0,0,0 dropped=0,0,0,0\\n latency : target=0, window=0, percentile=100.00%, depth=256\\n steadystate : attained=yes, bw=24.3GiB/s (25.5GB/s), iops=398k, iops mean dev=1.215%\\n\\nRun status group 0 (all jobs):\\n WRITE: bw=24.2GiB/s (26.0GB/s), 24.2GiB/s-24.2GiB/s (26.0GB/s-26.0GB/s), io=8742GiB (9386GB), run=360796-360796msec\\n
We use the vstart.sh
script to create the cluster, with the appropriate options for Crimson.
# CPU allocation strategies\\ndeclare -A bal_ops_table\\nbal_ops_table[\\"default\\"]=\\"\\"\\nbal_ops_table[\\"bal_osd\\"]=\\" --crimson-balance-cpu osd\\"\\nbal_ops_table[\\"bal_socket\\"]=\\"--crimson-balance-cpu socket\\"\\n
We essentially traverse over the order of the CPU strategies, for each of the Crimson backends. In the snippet, we iterate over the number of OSDs and reactors, and set the CPU allocation strategy with the new option.
IMPORTANT: Notice that we set the list of CPU cores available for vstart with the VSTART_CPU_CORES
variable. We use this to ensure we \\"reserve\\" some CPUs to be used by the FIO client (since we are using a single node cluster).
# Run balanced vs default CPU core/reactor distribution in Crimson using either Cyan, Seastore or Bluestore\\nfun_run_bal_vs_default_tests() {\\n local OSD_TYPE=$1\\n local NUM_ALIEN_THREADS=7 # default \\n local title=\\"\\"\\n\\n for KEY in default bal_osd bal_socket; do\\n for NUM_OSD in 8; do\\n for NUM_REACTORS in 5 6; do\\n title=\\"(${OSD_TYPE}) $NUM_OSD OSD crimson, $NUM_REACTORS reactor, fixed FIO 8 cores, response latency \\"\\n\\n cmd=\\"MDS=0 MON=1 OSD=${NUM_OSD} MGR=1 taskset -ac \'${VSTART_CPU_CORES}\' /ceph/src/vstart.sh \\\\\\n --new -x --localhost --without-dashboard\\\\\\n --redirect-output ${crimson_be_table[${OSD_TYPE}]} --crimson --crimson-smp ${NUM_REACTORS}\\\\\\n --no-restart ${bal_ops_table[${KEY}]}\\"\\n # Alien setup for Bluestore, see below.\\n\\n test_name=\\"${OSD_TYPE}_${NUM_OSD}osd_${NUM_REACTORS}reactor_8fio_${KEY}_rc\\"\\n echo \\"${cmd}\\" | tee >> ${RUN_DIR}/${test_name}_cpu_distro.log\\n echo $test_name\\n eval \\"$cmd\\" >> ${RUN_DIR}/${test_name}_cpu_distro.log\\n echo \\"Sleeping for 20 secs...\\"\\n\\n sleep 20 \\n fun_show_grid $test_name\\n fun_run_fio $test_name\\n /ceph/src/stop.sh --crimson\\n sleep 60\\n done\\n done\\n done\\n}\\n
For Bluestore, we have a special case, where we set the number of alien threads to be 4 times the number of backend CPU cores (crimson-smp).
if [ \\"$OSD_TYPE\\" == \\"blue\\" ]; then\\n NUM_ALIEN_THREADS=$(( 4 *NUM_OSD * NUM_REACTORS ))\\n title=\\"${title} alien_num_threads=${NUM_ALIEN_THREADS}\\"\\n cmd=\\"${cmd} --crimson-alien-num-threads $NUM_ALIEN_THREADS\\"\\n test_name=\\"${OSD_TYPE}_${NUM_OSD}osd_${NUM_REACTORS}reactor_${NUM_ALIEN_THREADS}at_8fio_${KEY}_rc\\"\\n fi\\n
Once the cluster is online, we create the pools and the RBD volume.
We first take some measurements of the cluster, then we create a single RBD pool and volume(s) as appropriate. We also show the status of the cluster, the pools and the PGs.
# Take some measurements\\nif pgrep crimson; then\\nbin/ceph daemon -c /ceph/build/ceph.conf osd.0 dump_metrics > /tmp/new_cluster_dump.json\\nelse\\nbin/ceph daemon -c /ceph/build/ceph.conf osd.0 perf dump > /tmp/new_cluster_dump.json\\nfi\\n\\n# Create the pools\\nbin/ceph osd pool create rbd\\nbin/ceph osd pool application enable rbd rbd\\n[ -z \\"$NUM_RBD_IMAGES\\" ] && NUM_RBD_IMAGES=1\\nfor (( i=0; i<$NUM_RBD_IMAGES; i++ )); do\\n bin/rbd create --size ${RBD_SIZE} rbd/fio_test_${i}\\n rbd du fio_test_${i}\\ndone\\nbin/ceph status\\nbin/ceph osd dump | grep \'replicated size\'\\n# Show a pool’s utilization statistics:\\nrados df\\n# Turn off auto scaler for existing and new pools - stops PGs being split/merged\\nbin/ceph osd pool set noautoscale\\n# Turn off balancer to avoid moving PGs\\nbin/ceph balancer off\\n# Turn off deep scrub\\nbin/ceph osd set nodeep-scrub\\n# Turn off scrub\\nbin/ceph osd set noscrub\\n
Here is an example of the default pools shown after the cluster has been created. Notice the default replica set as Crimson does not support Erasure Coding yet.
pool \'rbd\' created\\nenabled application \'rbd\' on pool \'rbd\'\\nNAME PROVISIONED USED\\nfio_test_0 10 GiB 0 B\\n cluster:\\n id: da51b911-7229-4eae-afb5-a9833b978a68\\n health: HEALTH_OK\\n\\n services:\\n mon: 1 daemons, quorum a (age 97s)\\n mgr: x(active, since 94s)\\n osd: 8 osds: 8 up (since 52s), 8 in (since 60s)\\n\\n data:\\n pools: 2 pools, 33 pgs\\n objects: 2 objects, 449 KiB\\n usage: 214 MiB used, 57 TiB / 57 TiB avail\\n pgs: 27.273% pgs unknown\\n 21.212% pgs not active\\n 17 active+clean\\n 9 unknown\\n 7 creating+peering\\n\\npool 1 \'.mgr\' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode off last_change 15 flags hashpspool,nopgchange,crimson stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 7.89\\npool 2 \'rbd\' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 33 flags hashpspool,nopgchange,selfmanaged_snaps,crimson stripe_width 0 application rbd read_balance_score 1.50\\nPOOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR\\n.mgr 449 KiB 2 0 6 0 0 0 41 35 KiB 55 584 KiB 0 B 0 B\\nrbd 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B\\n\\ntotal_objects 2\\ntotal_used 214 MiB\\ntotal_avail 57 TiB\\ntotal_space 57 TiB\\nnoautoscale is set, all pools now have autoscale off\\nnodeep-scrub is set\\nnoscrub is set\\n
We have written some basic infrastructure via stand-alone tools to drive FIO, the Linux flexible I/O exerciser. All of these tools are publically available at my github project repo here.
In essence, this basic infrastructure consists of:
A set of predefined FIO configuration files for the different workloads (random 4K write, random 4K read; sequential 64K write, sequential 64K read). These can be automatically generated on demand, especially for multiple clients, multiple RBD volumes, etc.
A set of performance test profiles, namely response latency curves which produce throughput and latency measurements for a range of I/O depths, with resource utilisation integrated. We can also produce quick latency target tests, which are useful to identify the maximum I/O throughput for a given latency target.
A set of monitoring routines, to measure resource utilisation (CPU, MEM) from the FIO client and the OSD process. We use the top
command, and we parse the output with jc
to produce a .json file. We integrate this with the FIO output in a single .json file, and generate gnuplot scripts dynamically. We also take a snapshot of the diskstats before and after the test, and we calculate the difference. We also aggregate the FIO traces, in terms of gnuplot charts.
iodepth
option to control the number of I/O requests that are issued to the device. Since we are interested in response latency curves (a.k.a hockey stick performance curves) we traverse from one to sixty four. We use a single job per RBD volume (but this could also be variable if required).# Option -w (WORKLOAD) is used as index for these:\\ndeclare -A m_s_iodepth=( [hockey]=\\"1 2 4 8 16 24 32 40 52 64\\" ...)\\ndeclare -A m_s_numjobs=( [hockey]=\\"1\\" ... )\\n
# Prime the volume(s) with a write workloads\\n RBD_NAME=fio_test_$i RBD_SIZE=\\"64k\\" fio ${FIO_JOBS}rbd_prime.fio 2>&1 >/dev/null & \\n echo \\"== priming $RBD_NAME ==\\";\\n ...\\n wait;\\n
IMPORTANT: the attentive reader would notice the use of the taskset
command to bind the FIO client to a set of CPU cores. This is to ensure that the FIO client does not interfere with the reactors of OSD process. The order of execution of the workloads is important to ensure reproducibility.
\\n for job in $RANGE_NUMJOBS; do\\n for io in $RANGE_IODEPTH; do\\n\\n # Take diskstats measurements before FIO instances\\n jc --pretty /proc/diskstats > ${DISK_STAT}\\n...\\n for (( i=0; i<${NUM_PROCS}; i++ )); do\\n\\n export TEST_NAME=${TEST_PREFIX}_${job}job_${io}io_${BLOCK_SIZE_KB}_${map[${WORKLOAD}]}_p${i};\\n echo \\"== $(date) == ($io,$job): ${TEST_NAME} ==\\";\\n echo fio_${TEST_NAME}.json >> ${OSD_TEST_LIST}\\n fio_name=${FIO_JOBS}${FIO_JOB_SPEC}${map[${WORKLOAD}]}.fio\\n\\n # Execute FIO\\n LOG_NAME=${log_name} RBD_NAME=fio_test_${i} IO_DEPTH=${io} NUM_JOBS=${job} \\\\\\n taskset -ac ${FIO_CORES} fio ${fio_name} --output=fio_${TEST_NAME}.json \\\\\\n --output-format=json 2> fio_${TEST_NAME}.err &\\n fio_id[\\"fio_${i}\\"]=$!\\n global_fio_id+=($!)\\n done # loop NUM_PROCS\\n sleep 30; # ramp up time\\n...\\n fun_measure \\"${all_pids}\\" ${top_out_name} ${TOP_OUT_LIST} &\\n...\\n wait;\\n # Measure the diskstats after the completion of FIO\\n jc --pretty /proc/diskstats | python3 /root/bin/diskstat_diff.py -a ${DISK_STAT}\\n\\n # Exit the loops if the latency disperses too much from the median\\n if [ \\"$RESPONSE_CURVE\\" = true ] && [ \\"$RC_SKIP_HEURISTIC\\" = false ]; then\\n mop=${mode[${WORKLOAD}]}\\n covar=$(jq \\".jobs | .[] | .${mop}.clat_ns.stddev/.${mop}.clat_ns.mean < 0.5 and \\\\\\n .${mop}.clat_ns.mean/1000000 < ${MAX_LATENCY}\\" fio_${TEST_NAME}.json)\\n if [ \\"$covar\\" != \\"true\\" ]; then\\n echo \\"== Latency std dev too high, exiting loops ==\\"\\n break 2\\n fi\\n fi\\n done # loop IODEPTH\\n done # loop NUM_JOBS \\n\\n
The basic monitoring routine is shown below, which is executed concurrently as FIO progresses.
fun_measure() {\\n local PID=$1 #comma separated list of pids\\n local TEST_NAME=$2\\n local TEST_TOP_OUT_LIST=$3\\n\\n top -b -H -1 -p \\"${PID}\\" -n ${NUM_SAMPLES} >> ${TEST_NAME}_top.out\\n echo \\"${TEST_NAME}_top.out\\" >> ${TEST_TOP_OUT_LIST}\\n}\\n
We have written a custom profile for top, so we get information about the parent process id, last CPU the thread was executed on, etc. (which are not normally shown by default). We also plan to extend jc to support individual CPU core utilisation.
We extended and implemented new tools in the CBT (Ceph Benchmarking Tool) as standalones since they can be used in either local laptop as well as in the client endpoints. Further proof of concepts are in progress.
In this section, we show the performance results for the three CPU allocation strategies across the three object storage backends. We show the results for the 8 OSDs and 5 reactors configuration.
It is interesting to point out that there is not a single CPU allocation strategy has a significant advantage over the others, but there are workloads that seem to gain benefits for different CPU allocation strategies. The results are consistent across the different object storage backends, for most of the workloads.
The response latency curves are extended with yerror bars describing the standard deviation of the latency. This is useful to observe how the latency disperses from the median (which is the average latency). For all the results shown we disabled the heuristic mentioned above so we see all the data points as requested (from iodepth 1 to 64).
For each workload, we show the comparison of the three CPU allocation strategies across the three object storage backends. At the end, we compare the results for a single CPU allocation strategy across the three object storage backends.
We first show the CPU and MEM utilisation for the OSD process and then the FIO client.
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
OSD CPU | OSD MEM |
---|---|
![]() | ![]() |
FIO CPU | FIO MEM |
---|---|
![]() | ![]() |
We briefly show the comparison of the default CPU allocation strategy across the three storage backends. We choose the default CPU allocation strategy as it is the currently used in the field/community.
Random read 4K | Random Write 4k |
---|---|
![]() | ![]() |
Seq read 64K | Seq Write 64k |
---|---|
![]() | ![]() |
In this blog entry, we have shown the performance results for the three CPU allocation strategies across the three object storage backends. It is interesting that none of the CPU allocation strategies significantly outperforms the existing default strategy, which is a bit of a surprise. Please notice that in order to keep a disciplined methodology, we used the same Ceph build (with the commit hash cited above) across all the tests, to ensure only a single parameter is modified on each step, as appropriate. Hence, these results do not represent the latest Seastore development progress yet.
This is my very first blog entry, and I hope you have found it useful. I am very thankful to the Ceph community for their support and guidance in this my first year of being part of such a vibrant community, especially to Matan Breizman, Yingxin Cheng, Aishwarya Mathuria, Josh Durgin, Neha Ojha and Bill Scales. I am looking forward to the next blog entry, where we will take a deep dive into the internals of performance metrics in Crimson OSD. We will try use some flamegraphs on the major code landmarks, and leverage the existing tools to identify latencies per component.
In the data-driven world, the demand for faster, more efficient storage solutions is escalating. As businesses, cloud providers, and data centers look to handle ever-growing volumes of data, the performance of storage becomes a critical factor. One of the most promising innovations in this space is NVMe over TCP (NVMe/TCP aka NVMeoF), which allows the deployment of high-performance Non-Volatile Memory Express (NVMe) storage devices over traditional TCP/IP networks. This blog delves into Ceph and the performance of our newest block protocol: NVMe over TCP, its benefits, challenges, and the outlook for this technology. We will explore performance profiles and nodes populated with NVMe SSDs to detail a design optimized for high performance.
Before diving into performance specifics, let’s clarify the key technologies involved:
NVMe (Non-Volatile Memory Express) is a protocol designed to provide fast data access to storage media by leveraging the high-speed PCIe (Peripheral Component Interconnect Express) bus. NVMe reduces latency, improves throughput, and enhances overall storage performance compared to legacy storage like SATA and SAS, while maintaining a price point that is at most slightly increased on a $/TB basis. Comparatively speaking, when concerned with performance, scale and throughput, NVMe drives are the clear cost-performer in this arena.
TCP/IP (Transmission Control Protocol/Internet Protocol) is one of the pillars of modern networking. It is a reliable, connection-oriented protocol that ensures data is transmitted correctly across networks. TCP is known for its robustness and widespread use, making it an attractive option for connecting NVMe devices over long distances and in cloud environments.
Ceph brings NVMe over TCP to market offering NVMe speed and low latency access to networked storage solutions, without the need for specialized hardware like Fibre Channel, InfiniBand or RDMA.
The performance of NVMe over TCP largely depends on the underlying network infrastructure, storage architecture and design, and the workload being handled. However, there are a few key factors to keep in mind:
Let’s define some important terms in the Ceph world to ensure that we see which parameters can move the needle for performance and scale.
OSD (Object Storage Daemon) is the object storage daemon for the Ceph software defined storage system. It manages data on physical storage drives with redundancy and provides access to that data over the network. For the purposes of this article, we can state that an OSD is the software service that manages disk IO for a given physical device.
Reactor / Reactor Core: this is an event handling model in software development that comprises an event loop running a single thread which handles IO requests for NVMe/TCP. By default, we begin with 4 reactor core threads, but this model is tunable via software parameters.
BDevs_per_cluster: BDev is short for Block Device, this driver is how the NVMe Gateways talk to Ceph RBD images. This is important because by default the NVMe/TCP Gateway leverages 32 BDevs in a single cluster context per librbd client (bdevs_per_cluster=32
), or storage client connecting to the underlying volume. This tunable parameter can be adjusted to provide scaling all the way to a 1:1 context for NVMe volume to librbd client, creating an uncontested path to performance for a given volume at the expense of more compute resources.
Starting off strong, below we see how adding drives (OSDs) and nodes to a Ceph cluster can increase IO performance across the board. A 4-node Ceph cluster with 24 drives per node can provide over 450,000 IOPS with a 70:30 read/write profile, using a 16k block size with 32 FIO clients. That’s over 100K IOPS average per node! This trend scales linearly as nodes and drives are added, showing a top-end of nearly 1,000,000 IOPS with a 12 node, 288 OSD cluster. It is noteworthy that the higher end numbers are shown with 12 reactors and 1 librbd client per namespace (bdevs_per_cluster=1
), which demonstrates how the addition of librbd clients enables more throughput to the OSDs serving the underlying RBD images and their mapped NVMe namespaces.
The next test below shows how tuning an environment to the underlying hardware can show massive improvements in software defined storage. We begin with a simple 4-node cluster, and show scale points of 16, 32, 64 and 96 OSDs. In this test the Ceph Object Storage Daemons have been mapped 1:1 directly to physical NVMe drives.
It may seem like adding drives and nodes alone only gains a modicum of performance, but with software defined storage there is always a trade-off between server utilization and storage performance – in this case for the better. When the same cluster has the default reactor cores increased from 4 to 10 (thus consuming more CPU cycles), and ``bdevs_per_cluster` is configured to increase software throughput via the addition of librbd clients, the performance nearly doubles. All this by simply tuning your environment to the underlying hardware and enabling Ceph to take advantage of this processing power.
The chart below shows the IOPS delivered at the three “t-shirt” sizes of tuned 4-node, 8-node and 12-node configurations, and a 4-node cluster with the defaults enabled for comparison. Again we see that, for <2ms latency workloads, Ceph scales linearly and in a dependable, expectable fashion. Note: as I/O becomes congested, at a certain point the workloads are still serviceable but with higher latency response times. Ceph continues to commit the required reads and writes, only plateauing once the existing platform design boundaries become saturated.
As storage needs continue to evolve, NVMe over TCP is positioned to become a key player in the high-performance storage landscape. With continued advancements in Ethernet speeds, TCP optimizations, and network infrastructure, NVMe over TCP will continue to offer compelling advantages for a wide range of applications, from enterprise data centers to edge computing environments.
Ceph is positioned to be the top performer in software defined storage for NVMe over TCP, by enabling not only high-performance, scale-out NVMe storage platforms, but also by enabling more performance on platform by user-controlled software enhancements and configuration.
Ceph’s NVMe over TCP Target offers a powerful, scalable, and cost-effective solution for high-performance storage networks.
The authors would like to thank IBM for supporting the community by through our time to create these posts.
SMB (Server Message Block) is a widely-used network protocol that facilitates the sharing of files, printers, and other resources across a network. To seamlessly integrate SMB services within a Ceph environment, Ceph 8.0 introduces the powerful SMB Manager module, which enables users to deploy, manage, and control Samba services for SMB access to CephFS. This module offers a user-friendly interface for managing clusters of Samba services and SMB shares, with the flexibility to choose between two management methods: imperative and declarative. By enabling the SMB Manager module using the command ceph mgr module enable smb
, administrators can efficiently streamline their SMB service operations, whether through the command-line or via orchestration with YAML or JSON resource descriptions. With the new SMB Manager module, Ceph admins can effortlessly extend file services, providing robust SMB access to CephFS while enjoying enhanced control and scalability.
Admins can interact with the Ceph Manager SMB module using following methods:
Imperative Method: Ceph commands to interact with the Ceph Manager SMB module.
Declarative Method: Resources specification in YAML or JSON format.
Create CephFS Volume/Subvolume
# ceph fs volume create cephfs\\n# ceph fs subvolumegroup create cephfs smb\\n# ceph fs subvolume create cephfs sv1 --group-name=smb --mode=0777\\n# ceph fs subvolume create cephfs sv2 --group-name=smb --mode=0777\\n
Enable SMB Management Module
# ceph mgr module enable smb\\n
Creating SMB Cluster/Share
# ceph smb cluster create smb1 user --define-user-pass=user1%passwd\\n# ceph smb share create smb1 share1 cephfs / --subvolume=smb/sv1\\n
Map a network drive from MS Windows clients
Create CephFS volume/subvolume
# ceph fs volume create cephfs\\n# ceph fs subvolumegroup create cephfs smb\\n# ceph fs subvolume create cephfs sv1 --group-name=smb --mode=0777\\n# ceph fs subvolume create cephfs sv2 --group-name=smb --mode=0777\\n
Enable SMB Management Module
# ceph mgr module enable smb\\n
Creating SMB Cluster/Share
# ceph smb apply -i - <<\'EOF\'\\n# --- Begin Embedded YAML\\n- resource_type: ceph.smb.cluster\\n cluster_id: smb1\\n auth_mode: user\\n user_group_settings:\\n - {source_type: resource, ref: ug1}\\n placement:\\n count: 1\\n- resource_type: ceph.smb.usersgroups\\n users_groups_id: ug1\\n values:\\n users:\\n - {name: user1, password: passwd}\\n - {name: user2, password: passwd}\\n groups: []\\n- resource_type: ceph.smb.share\\n cluster_id: smb1\\n share_id: share1\\n cephfs:\\n volume: cephfs\\n subvolumegroup: smb\\n subvolume: sv1\\n path: /\\n- resource_type: ceph.smb.share\\n cluster_id: smb1\\n share_id: share2\\n cephfs:\\n volume: cephfs\\n subvolumegroup: smb\\n subvolume: sv2\\n path: /\\n# --- End Embedded YAML\\nEOF\\n
Map a network drive from MS Windows clients
# ceph smb cluster create <cluster_id> {user} [--domain-realm=<domain_realm>] \\\\\\n [--domain-join-user-pass=<domain_join_user_pass>] \\\\\\n [--define-user-pass=<define_user_pass>] [--custom-dns=<custom_dns>]\\n
Example:
# ceph smb cluster create smb1 user --define_user_pass user1%passwd --placement label:smb --clustering default\\n
# ceph smb cluster create smb1 active-directory --domain_realm samba.qe --domain_join_user_pass Administrator%Redhat@123 --custom_dns 10.70.44.153 --placement label:smb --clustering default\\n
# ceph smb apply -i [--format <value>]\\n
Example:
# ceph smb apply -i resources.yaml\\n
# ceph smb share create <cluster_id> <share_id> <cephfs_volume> <path> [<share_name>] [<subvolume>] [--readonly] [--format]\\n
Example:
# ceph smb share create smb1 share1 cephfs / --subvolume=smb/sv1\\n
Listng SMB Shares
# ceph smb share ls <cluster_id> [--format <value>]\\n
Example:
# ceph smb share ls smb1\\n
# ceph smb show [<resource_names>]\\n
Example:
# ceph smb show ceph.smb.cluster.smb1\\n
# ceph smb share rm <cluster_id> <share_id>\\n
Example:
# ceph smb share rm smb1 share1\\n
# ceph smb cluster rm <cluster_id>\\n
Example:
# ceph smb share rm smb1\\n
The Ceph SMB Manager module in Ceph Squid brings an innovative and efficient way to manage SMB services for CephFS file systems. Whether through imperative or declarative methods, users can easily create, manage, and control SMB clusters and shares. This integration simplifies the setup of Samba services, enhances scalability, and offers greater flexibility for administrators.With the ability to manage SMB access to CephFS seamlessly, users can now have a streamlined process for providing secure and scalable file services.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
Deploying a production-ready object storage solution can be challenging, particularly when managing complex requirements including SSL/TLS encryption, optimal data placement, and multisite replication. During deployment, it’s easy to overlook configuration options that become crucial once the system is live in production.
Traditionally, configuring Ceph for high availability, security, and efficient data handling required users to manually adjust multiple parameters based on their needs, such as Multisite Replication, Encryption, and High Availability. This initial complexity made it tedious to achieve a production-ready Object Storage configuration.
To tackle these challenges, we have introduced several new features to Ceph\'s orchestrator that simplify the deployment of Ceph RGW and its associated services. Enhancing the Ceph Object Gateway and Ingress service specification files enables an out-of-the-box, production-ready RGW setup with just a few configuration steps. These enhancements include automated SSL/TLS configurations, virtual host bucket access support, erasure coding for cost-effective data storage, and more.
These improvements aim to provide administrators with a seamless deployment experience that ensures secure, scalable, and production-ready configurations for the Ceph Object Gateway and Ingress Service (load balancer).
In this blog post, we\'ll explore each of these new features, discuss the problems they solve, and demonstrate how they can be easily configured using cephadm
spec files to achieve a fully operational Ceph Object Gateway setup in minutes.
One of the major challenges in deploying RGW is ensuring seamless access to buckets using virtual host-style URLs. For applications and users that rely on virtual host bucket access, proper SSL/TLS certificates that include the necessary Subject Alternative Names (SANs) are crucial. To simplify this, we\'ve added the option to automatically generate self-signed certificates for the Object Gateway if the user does not provide custom certificates. These self-signed certificates include SAN entries that allow TLS/SSL to work seamlessly with virtual host bucket access.
Security is a top priority for any production-grade deployment, and the Ceph community has increasingly requested full TLS/SSL encryption from the client to the Object Gateway service. Previously, our ingress implementation only supported terminating SSL at the HAProxy level, which meant that communication between HAProxy and RGW could not be encrypted.
To address this, we\'ve added configurable options that allow users to choose whether to re-encrypt traffic between HAProxy and RGW or to use passthrough mode, where the TLS connection remains intact from the client to RGW. This flexibility allows users to achieve complete end-to-end encryption, ensuring sensitive data is always protected in transit.
In the past, Ceph multisite deployments involved running many commands to configure your Realm, zonegroup, and zone, and also establishing the relationship between the zones that will be involved in the Multisite replication. Thanks to the RGW manager module, the multisite bootstrap and configuration can now be done in two steps. There is an example in the Object Storage Replication blog post.
In Squid release, we have also added the possibility of configuring dedicating Object Gateways just for client traffic purposes through the cephadm
spec file with the RGW spec file option:
disable_multisite_sync_traffic: True\\n
The advantages of dedicating Ceph Object Gateways to specific tasks are covered in the blog post: Ceph Object Storage Multisite Replication Series. Part Three
Object Storage often uses Erasure Coding for the data pool to reduce the TCO of the object storage solution. We have included options for configuring erasure-coded (EC) pools in the spec file. This allows users to define the EC profile, device class, and failure domain for RGW data pools, which provides control over data placement and storage efficiency.
If you are new to Ceph and cephadm
, the Automating Ceph Cluster Deployments with Ceph: A Step-by-Step Guide Using Cephadm and Ansible (Part 1) blog post will give you a good overview of cephadm
and how we can define the desired state of Ceph services in a declarative YAML spec file to deploy and configure Ceph.
Below, we\'ll walk through the CLI commands required to deploy a production-ready RGW setup using the new features added to the cephadm
orchestrator.
The first step is to enable the RGW manager module. This module is required to manage RGW services through cephadm
.
# ceph mgr module enable rgw\\n
Next, we create a spec file for the Object Gateway service. This spec file includes realm, zone, and zonegroup settings, SSL/TLS, EC profile for the data pool, etc.
# cat << EOF > /root/rgw-client.spec\\nservice_type: rgw\\nservice_id: client\\nservice_name: rgw.client\\nplacement:\\n label: rgw\\n count_per_host: 1\\nnetworks:\\n - 192.168.122.0/24\\nspec:\\n rgw_frontend_port: 4443\\n rgw_realm: multisite\\n rgw_zone: zone1\\n rgw_zonegroup: multizg\\n generate_cert: true\\n ssl: true\\n zonegroup_hostnames:\\n - s3.cephlab.com\\n data_pool_attributes:\\n type: ec\\n k: 2\\n m: 2\\nextra_container_args:\\n - \\"--stop-timeout=120\\"\\nconfig:\\n rgw_exit_timeout_secs: \\"120\\"\\n rgw_graceful_stop: true\\nEOF\\n
In this spec file we specify that the RGW service should use erasure coding with a 2+2 profile (k: 2, m: 2)
for the data pool, which reduces storage costs compared to a replicated setup. We also generate a self-signed certificate (generate_cert: true)
for the RGW service to ensure secure SSL/TLS communication. With zonegroup_hostnames
, we enable virtual host bucket access using the specified domain bucket.s3.cephlab.com
. Thanks to the config parameter rgw_gracefull_stop
, we configure graceful stopping of object gateway services. During a graceful stop, the service will wait until all client connections are closed (drained) subject to the specified 120 second timeout.
Once the spec file is created, we bootstrap RGW services. This step creates and deploys RGW services with the configuration specified in our spec file.
# ceph rgw realm bootstrap -i rgw-client.spec\\n
The cephadm bootstrap command will asynchronously apply the configuration defined in our spec file. Soon the RGW services will be up and running, and we can verify their status using the ceph orch ps command
.
# ceph orch ps --daemon_type rgw\\nNAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID\\nrgw.client.ceph-node-05.yquamf ceph-node-05.cephlab.com 192.168.122.175:4443 running (32m) 94s ago 32m 91.2M - 19.2.0-53.el9cp fda78a7e8502 a0c39856ddd8\\nrgw.client.ceph-node-06.zfsutg ceph-node-06.cephlab.com 192.168.122.214:4443 running (32m) 94s ago 32m 92.9M - 19.2.0-53.el9cp fda78a7e8502 82c21d350cb7\\n\\n
This output shows that the RGW services run on the specified nodes and are accessible via the configured 4443/tcp
port.
To verify that the RGW data pools are correctly configured with erasure coding, we can use the following command:
# ceph osd pool ls detail | grep data\\npool 24 \'zone1.rgw.buckets.data\' erasure profile zone1_zone_data_pool_ec_profile size 4 min_size 3 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 258 lfor 0/0/256 flags hashpspool stripe_width 8192 application rgw\\n
To get more details about the erasure code profile used for the data pool, we can run the below:
# ceph osd erasure-code-profile get zone1_zone_data_pool_ec_profile\\ncrush-device-class=\\ncrush-failure-domain=host\\ncrush-num-failure-domains=0\\ncrush-osds-per-failure-domain=0\\ncrush-root=default\\njerasure-per-chunk-alignment=false\\nk=2\\nm=2\\nplugin=jerasure\\ntechnique=reed_sol_van\\nw=8\\n
This confirms that the erasure code profile is configured with k=2
and m=2
and uses the Reed-Solomon technique.
Finally, we must configure the ingress service to load balance traffic to multiple RGW daemons. We create a spec file for the ingress service:
# cat << EOF > rgw-ingress.yaml\\nservice_type: ingress\\nservice_id: rgw\\nplacement:\\n hosts:\\n - ceph-node-06.cephlab.com\\n - ceph-node-07.cephlab.com\\nspec:\\n backend_service: rgw.client\\n virtual_ip: 192.168.122.152/24\\n frontend_port: 443\\n monitor_port: 1967\\n use_tcp_mode_over_rgw: True\\nEOF\\n
This spec file sets up the ingress service with the virtual (floating) IP (VIP) address 192.168.122.152
and specifies that it should use TCP mode for communication with the Object Gateway, ensuring that SSL/TLS is maintained throughout. With the backend_service
we specify the RGW service we want to use as the backend for HAproxy, as it is possible for a Ceph cluster to run multiple, unrelated RGW services.
Our ingress service stack uses keepalived
for HA of the VIP, and HAproxy takes care of the load balancing:
# ceph orch ps --service_name ingress.rgw\\nNAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID\\nhaproxy.rgw.ceph-node-06.vooxuh ceph-node-06.cephlab.com *:443,1967 running (58s) 46s ago 58s 5477k - 2.4.22-f8e3218 0d25561e922f 4cd458e1f6b0\\nhaproxy.rgw.ceph-node-07.krdmsb ceph-node-07.cephlab.com *:443,1967 running (56s) 46s ago 56s 5473k - 2.4.22-f8e3218 0d25561e922f 4d18247e7615\\nkeepalived.rgw.ceph-node-06.cwraia ceph-node-06.cephlab.com running (55s) 46s ago 55s 1602k - 2.2.8 6926947c161f 50fd6cf57187\\nkeepalived.rgw.ceph-node-07.svljiw ceph-node-07.cephlab.com running (53s) 46s ago 53s 1598k - 2.2.8 6926947c161f aaab5d79ffdd\\n
When we check the haproxy configuration on ceph-node-06
where the service is running, we confirm that we are using TCP passthrough for the backend configuration of our Object Gateway services.
# ssh ceph-node-06.cephlab.com cat /var/lib/ceph/93d766b0-ae6f-11ef-a800-525400ac92a7/haproxy.rgw.ceph-node-06.vooxuh/haproxy/haproxy.cfg | grep -A 10 \\"frontend frontend\\"\\n...\\nbackend backend\\n mode tcp\\n balance roundrobin\\n option ssl-hello-chk\\n server rgw.client.ceph-node-05.yquamf 192.168.122.175:4443 check weight 100 inter 2s\\n server rgw.client.ceph-node-06.zfsutg 192.168.122.214:4443 check weight 100 inter 2s\\n
To verify that the SSL/TLS configuration is working correctly, we can use curl
to test the endpoint. We can see that the CA is not trusted by our client system where we are running the curl command:
# curl https://192.168.122.152\\ncurl: (60) SSL certificate problem: unable to get local issuer certificate\\nMore details here: https://curl.se/docs/sslcerts.html\\ncurl failed to verify the legitimacy of the server and therefore could not\\nestablish a secure connection to it.\\n
To fix this, we need to add the cephadm root CA certificate to the trusted store of our client system:
# ceph orch cert-store get cert cephadm_root_ca_cert > /etc/pki/ca-trust/source/anchors/cephadm-root-ca.crt\\n# update-ca-trust\\n
After updating the trusted store, we can test again:
# curl https://s3.cephlab.com\\n<?xml version=\\"1.0\\" encoding=\\"UTF-8\\"?><ListAllMyBucketsResult xmlns=\\"http://s3.amazonaws.com/doc/2006-03-01/\\"><Owner><ID>anonymous</ID></Owner><Buckets></Buckets></ListAllMyBucketsResult>\\n
This confirms that the SSL/TLS self-signed certificate configuration works correctly and that the RGW service is accessible using HTTPS. As you can see, we have configured our DNS subdomain s3.cephlab.com
and the wildcard *.s3.cephlab.com
to point to our VIP address 192.168.122.152
. Also, it\'s important to mention that you can have more than one VIP address configured so not all the traffic goes through a single haproxy LB node; when using a list of VIP IPs, you need to use the option: virtual_ips_list
These new features in the cephadm
orchestrator represents significant steps forward in making Ceph RGW deployments more accessible, secure, and production-ready. By automating complex configurations—such as SSL/TLS encryption, virtual host bucket access, multisite replication, and erasure coding administrators can now deploy an RGW setup ready for production with minimal manual intervention.
For further details on the Squid release, check Laura Flores\' blog post
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
Deploying a production-ready object storage solution can be challenging, particularly when managing complex requirements including SSL/TLS encryption, optimal data placement, and multisite replication. During deployment, it’s easy to overlook configuration options that become crucial once the system is live in production.
Traditionally, configuring Ceph for high availability, security, and efficient data handling required users to manually adjust multiple parameters based on their needs, such as Multisite Replication, Encryption, and High Availability. This initial complexity made it tedious to achieve a production-ready Object Storage configuration.
To tackle these challenges, we have introduced several new features to Ceph\'s orchestrator that simplify the deployment of Ceph RGW and its associated services. Enhancing the Ceph Object Gateway and Ingress service specification files enables an out-of-the-box, production-ready RGW setup with just a few configuration steps. These enhancements include automated SSL/TLS configurations, virtual host bucket access support, erasure coding for cost-effective data storage, and more.
These improvements aim to provide administrators with a seamless deployment experience that ensures secure, scalable, and production-ready configurations for the Ceph Object Gateway and Ingress Service (load balancer).
In this blog post, we\'ll explore each of these new features, discuss the problems they solve, and demonstrate how they can be easily configured using cephadm
spec files to achieve a fully operational Ceph Object Gateway setup in minutes.
One of the major challenges in deploying RGW is ensuring seamless access to buckets using virtual host-style URLs. For applications and users that rely on virtual host bucket access, proper SSL/TLS certificates that include the necessary Subject Alternative Names (SANs) are crucial. To simplify this, we\'ve added the option to automatically generate self-signed certificates for the Object Gateway if the user does not provide custom certificates. These self-signed certificates include SAN entries that allow TLS/SSL to work seamlessly with virtual host bucket access.
Security is a top priority for any production-grade deployment, and the Ceph community has increasingly requested full TLS/SSL encryption from the client to the Object Gateway service. Previously, our ingress implementation only supported terminating SSL at the HAProxy level, which meant that communication between HAProxy and RGW could not be encrypted.
To address this, we\'ve added configurable options that allow users to choose whether to re-encrypt traffic between HAProxy and RGW or to use passthrough mode, where the TLS connection remains intact from the client to RGW. This flexibility allows users to achieve complete end-to-end encryption, ensuring sensitive data is always protected in transit.
In the past, Ceph multisite deployments involved running many commands to configure your Realm, zonegroup, and zone, and also establishing the relationship between the zones that will be involved in the Multisite replication. Thanks to the RGW manager module, the multisite bootstrap and configuration can now be done in two steps. There is an example in the Object Storage Replication blog post.
In Squid release, we have also added the possibility of configuring dedicating Object Gateways just for client traffic purposes through the cephadm
spec file with the RGW spec file option:
disable_multisite_sync_traffic: True\\n
The advantages of dedicating Ceph Object Gateways to specific tasks are covered in the blog post: Ceph Object Storage Multisite Replication Series. Part Three
Object Storage often uses Erasure Coding for the data pool to reduce the TCO of the object storage solution. We have included options for configuring erasure-coded (EC) pools in the spec file. This allows users to define the EC profile, device class, and failure domain for RGW data pools, which provides control over data placement and storage efficiency.
If you are new to Ceph and cephadm
, the Automating Ceph Cluster Deployments with Ceph: A Step-by-Step Guide Using Cephadm and Ansible (Part 1) blog post will give you a good overview of cephadm
and how we can define the desired state of Ceph services in a declarative YAML spec file to deploy and configure Ceph.
Below, we\'ll walk through the CLI commands required to deploy a production-ready RGW setup using the new features added to the cephadm
orchestrator.
The first step is to enable the RGW manager module. This module is required to manage RGW services through cephadm
.
# ceph mgr module enable rgw\\n
Next, we create a spec file for the Object Gateway service. This spec file includes realm, zone, and zonegroup settings, SSL/TLS, EC profile for the data pool, etc.
# cat << EOF > /root/rgw-client.spec\\nservice_type: rgw\\nservice_id: client\\nservice_name: rgw.client\\nplacement:\\n label: rgw\\n count_per_host: 1\\nnetworks:\\n - 192.168.122.0/24\\nspec:\\n rgw_frontend_port: 4443\\n rgw_realm: multisite\\n rgw_zone: zone1\\n rgw_zonegroup: multizg\\n generate_cert: true\\n ssl: true\\n zonegroup_hostnames:\\n - s3.cephlab.com\\n data_pool_attributes:\\n type: ec\\n k: 2\\n m: 2\\nextra_container_args:\\n - \\"--stop-timeout=120\\"\\nconfig:\\n rgw_exit_timeout_secs: \\"120\\"\\n rgw_graceful_stop: true\\nEOF\\n
In this spec file we specify that the RGW service should use erasure coding with a 2+2 profile (k: 2, m: 2)
for the data pool, which reduces storage costs compared to a replicated setup. We also generate a self-signed certificate (generate_cert: true)
for the RGW service to ensure secure SSL/TLS communication. With zonegroup_hostnames
, we enable virtual host bucket access using the specified domain bucket.s3.cephlab.com
. Thanks to the config parameter rgw_gracefull_stop
, we configure graceful stopping of object gateway services. During a graceful stop, the service will wait until all client connections are closed (drained) subject to the specified 120 second timeout.
Once the spec file is created, we bootstrap RGW services. This step creates and deploys RGW services with the configuration specified in our spec file.
# ceph rgw realm bootstrap -i rgw-client.spec\\n
The cephadm bootstrap command will asynchronously apply the configuration defined in our spec file. Soon the RGW services will be up and running, and we can verify their status using the ceph orch ps command
.
# ceph orch ps --daemon_type rgw\\nNAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID\\nrgw.client.ceph-node-05.yquamf ceph-node-05.cephlab.com 192.168.122.175:4443 running (32m) 94s ago 32m 91.2M - 19.2.0-53.el9cp fda78a7e8502 a0c39856ddd8\\nrgw.client.ceph-node-06.zfsutg ceph-node-06.cephlab.com 192.168.122.214:4443 running (32m) 94s ago 32m 92.9M - 19.2.0-53.el9cp fda78a7e8502 82c21d350cb7\\n
This output shows that the RGW services run on the specified nodes and are accessible via the configured 4443/tcp
port.
To verify that the RGW data pools are correctly configured with erasure coding, we can use the following command:
# ceph osd pool ls detail | grep data\\npool 24 \'zone1.rgw.buckets.data\' erasure profile zone1_zone_data_pool_ec_profile size 4 min_size 3 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 258 lfor 0/0/256 flags hashpspool stripe_width 8192 application rgw\\n
To get more details about the erasure code profile used for the data pool, we can run the below:
# ceph osd erasure-code-profile get zone1_zone_data_pool_ec_profile\\ncrush-device-class=\\ncrush-failure-domain=host\\ncrush-num-failure-domains=0\\ncrush-osds-per-failure-domain=0\\ncrush-root=default\\njerasure-per-chunk-alignment=false\\nk=2\\nm=2\\nplugin=jerasure\\ntechnique=reed_sol_van\\nw=8\\n
This confirms that the erasure code profile is configured with k=2
and m=2
and uses the Reed-Solomon technique.
Finally, we must configure the ingress service to load balance traffic to multiple RGW daemons. We create a spec file for the ingress service:
# cat << EOF > rgw-ingress.yaml\\nservice_type: ingress\\nservice_id: rgw\\nplacement:\\n hosts:\\n - ceph-node-06.cephlab.com\\n - ceph-node-07.cephlab.com\\nspec:\\n backend_service: rgw.client\\n virtual_ip: 192.168.122.152/24\\n frontend_port: 443\\n monitor_port: 1967\\n use_tcp_mode_over_rgw: True\\nEOF\\n
This spec file sets up the ingress service with the virtual (floating) IP (VIP) address 192.168.122.152
and specifies that it should use TCP mode for communication with the Object Gateway, ensuring that SSL/TLS is maintained throughout. With the backend_service
we specify the RGW service we want to use as the backend for HAproxy, as it is possible for a Ceph cluster to run multiple, unrelated RGW services.
Our ingress service stack uses keepalived
for HA of the VIP, and HAproxy takes care of the load balancing:
# ceph orch ps --service_name ingress.rgw\\nNAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID\\nhaproxy.rgw.ceph-node-06.vooxuh ceph-node-06.cephlab.com *:443,1967 running (58s) 46s ago 58s 5477k - 2.4.22-f8e3218 0d25561e922f 4cd458e1f6b0\\nhaproxy.rgw.ceph-node-07.krdmsb ceph-node-07.cephlab.com *:443,1967 running (56s) 46s ago 56s 5473k - 2.4.22-f8e3218 0d25561e922f 4d18247e7615\\nkeepalived.rgw.ceph-node-06.cwraia ceph-node-06.cephlab.com running (55s) 46s ago 55s 1602k - 2.2.8 6926947c161f 50fd6cf57187\\nkeepalived.rgw.ceph-node-07.svljiw ceph-node-07.cephlab.com running (53s) 46s ago 53s 1598k - 2.2.8 6926947c161f aaab5d79ffdd\\n
When we check the haproxy configuration on ceph-node-06
where the service is running, we confirm that we are using TCP passthrough for the backend configuration of our Object Gateway services.
# ssh ceph-node-06.cephlab.com cat /var/lib/ceph/93d766b0-ae6f-11ef-a800-525400ac92a7/haproxy.rgw.ceph-node-06.vooxuh/haproxy/haproxy.cfg | grep -A 10 \\"frontend frontend\\"\\n...\\nbackend backend\\n mode tcp\\n balance roundrobin\\n option ssl-hello-chk\\n server rgw.client.ceph-node-05.yquamf 192.168.122.175:4443 check weight 100 inter 2s\\n server rgw.client.ceph-node-06.zfsutg 192.168.122.214:4443 check weight 100 inter 2s\\n
To verify that the SSL/TLS configuration is working correctly, we can use curl
to test the endpoint. We can see that the CA is not trusted by our client system where we are running the curl command:
# curl https://192.168.122.152\\ncurl: (60) SSL certificate problem: unable to get local issuer certificate\\nMore details here: https://curl.se/docs/sslcerts.html\\ncurl failed to verify the legitimacy of the server and therefore could not\\nestablish a secure connection to it.\\n
To fix this, we need to add the cephadm root CA certificate to the trusted store of our client system:
# ceph orch cert-store get cert cephadm_root_ca_cert > /etc/pki/ca-trust/source/anchors/cephadm-root-ca.crt\\n# update-ca-trust\\n
After updating the trusted store, we can test again:
# curl https://s3.cephlab.com\\n<?xml version=\\"1.0\\" encoding=\\"UTF-8\\"?><ListAllMyBucketsResult xmlns=\\"http://s3.amazonaws.com/doc/2006-03-01/\\"><Owner><ID>anonymous</ID></Owner><Buckets></Buckets></ListAllMyBucketsResult>\\n
This confirms that the SSL/TLS self-signed certificate configuration works correctly and that the RGW service is accessible using HTTPS. As you can see, we have configured our DNS subdomain s3.cephlab.com
and the wildcard *.s3.cephlab.com
to point to our VIP address 192.168.122.152
. Also, it\'s important to mention that you can have more than one VIP address configured so not all the traffic goes through a single haproxy LB node; when using a list of VIP IPs, you need to use the option: virtual_ips_list
These new features in the cephadm
orchestrator represents significant steps forward in making Ceph RGW deployments more accessible, secure, and production-ready. By automating complex configurations—such as SSL/TLS encryption, virtual host bucket access, multisite replication, and erasure coding administrators can now deploy an RGW setup ready for production with minimal manual intervention.
For further details on the Squid release, check Laura Flores\' blog post
Note that some features described here may not be available before the Squid 19.2.2 release.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
In the previous post of this series, we discussed everything related to load-balancing our RGW S3 endpoints. We covered multiple load-balancing techniques, including the bundled Ceph0provided load balancer, the Ingress service
. In this fifth article in this series, we will discuss multisite sync policy in detail.
In Ceph releases beginning with Quincy, Ceph Object Storage provides granular bucket-level replication, unlocking many valuable features. Users can enable or disable sync per individual bucket, enabling precise control over replication workflows. This empowers full-zone replication while opting out of the replication of specific buckets, replicating a single source bucket to multi-destination buckets, and implementing symmetrical and directional data flow configurations. The following diagram shows an example of the sync policy feature in action:
With our previous synchronization model, we would do full zone sync, meaning that all data and metadata would be synced between zones. The new sync policy feature gives us new flexibility and granularity that allows us to configure per-bucket replication.
Bucket sync policies apply to the archive zones. Movement from an archive zone is not bidirectional, wherein all the objects can be moved from the active zone to the archive zone. However, you cannot move objects from the archive zone to the active zone since the archive zone is read-only. We will cover the archive zone in detail in part six of the blog series.
Here is a list of features available in the Quincy and Reef releases:
Here are some of the sync policy concepts that we need to understand before we get our hands dirty. A sync policy comprises the following components:
A sync policy group can be in three states:
Enabled
: - sync is allowed and enabled. Replication will begub when enabled. For example, we can enable full zonegroup sync and then disable (forbid) on a per-bucket basis.Allowed
: sync is permitted. Replication is permitted but will not start. For example, we can configure the zonegroup policy to allowed
and then enable per-bucket policy sync.Forbidden
: Sync, as defined by this group, is not permitted.We can configure sync policies (groups, flows, and pipes) at the zonegroup and bucket levels. The bucket sync policy is always a subset of the defined policy of the zonegroup to which they belong. So if, for example, we don’t allow a flow at the zonegroup level it won’t work even if allowed at the bucket level. There are further details on the expected behaviour in the official documentation.
The following section will explain use of the new multisite sync policy feature. By default, once we set up multisite replication as we did in the initial post of this series, all metadata and data are replicated among the zones that are part of the zonegroup. We will call this sync method legacy
during the remainder of the article.
As we explained in the previous section, a sync policy is made up of a group, flow, and pipe. We first configure a zonegroup policy that is very lax and will allow bi-directional traffic for all buckets on all zones. Once in place we will add per-bucket sync policies that by design are a subset of the zonegroup policy, with more stringent rulesets.
We begin by adding the zonegroup policy. We create a new group called group1
and set the status to allowed
. Recall from the previous section that the zonegroup will allow sync traffic to flow. The policy will be set to allowed
and not enabled
. Data synchronization will not happen at the zonegroup level when in the allowed
state, the idea being to enable the synchronization on a per bucket basis.
[root@ceph-node-00 ~]# radosgw-admin sync group create --group-id=group1 --status=allowed --rgw-realm=multisite --rgw-zonegroup=multizg\\n
We now create a symmetrical/bi-directional flow, allowing data sync in both directions from our zones: zone1
and zone2
.
[root@ceph-node-00 ~]# radosgw-admin sync group flow create --group-id=group1 --flow-id=flow-mirror --flow-type=symmetrical --zones=zone1,zone2\\n
Finally we create a pipe. In the pipe, we specify the group-id to use then set an asterisk wildcard for the source and destination buckets and zones, meaning all zones and buckets can be replicated as the source and destination of the data.
[root@ceph-node-00 ~]# radosgw-admin sync group pipe create --group-id=group1 --pipe-id=pipe1 --source-zones=\'*\' --source-bucket=\'*\' --dest-zones=\'*\' --dest-bucket=\'*\'\\n
Zonegroup sync policy modifications need to update the period; bucket sync policy modifications don’t require a period update.
[root@ceph-node-00 ~]# radosgw-admin period update --commit\\n
Once we have committed the new period, all data sync in the zonegroup is going to stop because our zonegroup policy is set to Allowed
If we had set it to enabled
, syncing would continue in the same way as with the initial multisite configuration we had.
Now we can enablesync on a per-bucket basis. We will create a bucket-level policy rule for the existing bucket testbucket
. Note that the bucket must exist before setting this policy, and admin commands that modify bucket policies must be run on the master zone. However, bucket sync policies do not require a period update. There is no need to change the data flow, as it is inherited from the zonegroup policy. A bucket policy flow will only be a subset of the flow defined in the zone group policy; the same happens with pipes.
We create the bucket:
[root@ceph-node-00 ~]# aws --endpoint https://s3.zone1.cephlab.com:443 s3 mb s3://testbucket\\nmake_bucket: testbucket\\n
Create a bucket sync group, using the --bucket
parameter to specify the bucket and setting the status to enabled
so that replication will be enabled for our bucket testbucket
[root@ceph-node-00 ~]# radosgw-admin sync group create --bucket=testbucket --group-id=testbucket-1 --status=enabled\\n
There is no need to specify a flow as we will inherit the flow from the zonegroup, so we need only to define a pipe for our bucket sync policy group called testbucket-1
. As soon as this command is applied data sync replication will start for this bucket.
[root@ceph-node-00 ~]# radosgw-admin sync group pipe create --bucket=testbucket --group-id=testbucket-1 --pipe-id=test-pipe1 --source-zones=\'*\' --dest-zones=\'*\'\\n
NOTE: You can safely ignore the following warning:
WARNING: cannot find source zone id for name=*
With the sync group get
command you can review your group, flow, and pipe configurations. We run the command at the zonegroup level, where we see that the status is allowed
.
\\"allowed\\"\\n
And we run the sync group get
command at the bucket level supplying the --bucket
parameter. In this case, the status is Enabled
for testbucket
:
[root@ceph-node-00 ~]# radosgw-admin sync group get --bucket testbucket | jq .[0].val.status\\n\\"Enabled\\"\\n
Another helpful command is sync info
. With sync info
, we can preview what sync replication will be implemented with our current configuration. So, for example, with our current zonegroup sync policy in the allowed
state, no sync will happen at the zonegroup level, so the sync info command will not show any sources or destinations configured.
[root@ceph-node-00 ~]# radosgw-admin sync info\\n{\\n \\"sources\\": [],\\n \\"dests\\": [],\\n \\"hints\\": {\\n \\"sources\\": [],\\n \\"dests\\": []\\n },\\n \\"resolved-hints-1\\": {\\n \\"sources\\": [],\\n \\"dests\\": []\\n },\\n \\"resolved-hints\\": {\\n \\"sources\\": [],\\n \\"dests\\": []\\n }\\n}\\n
We can also use the sync info
command at the bucket level, using the --bucket
parameter because we have configured a bidirectional pipe. We are going to have as sources zone2
-> zone1
and as destinations zone1
-> zone2
. This means that replication on the testbucket
bucket happens in both directions. If we PUT an object to testbucket
from zone1
it will be replicated to zone2
, and if we PUT and object to zone2
it will be replicated to zone1
.
[root@ceph-node-00 ~]# radosgw-admin sync info --bucket testbucket\\n{\\n \\"sources\\": [\\n {\\n \\"id\\": \\"test-pipe1\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"testbucket:89c43fae-cd94-4f93-b21c-76cd1a64788d.34553.1\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"testbucket:89c43fae-cd94-4f93-b21c-76cd1a64788d.34553.1\\"\\n },\\n \\"params\\": {\\n \\"source\\": {\\n \\"filter\\": {\\n \\"tags\\": []\\n }\\n },\\n \\"dest\\": {},\\n \\"priority\\": 0,\\n \\"mode\\": \\"system\\",\\n \\"user\\": \\"user1\\"\\n }\\n }\\n ],\\n \\"dests\\": [\\n {\\n \\"id\\": \\"test-pipe1\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"testbucket:89c43fae-cd94-4f93-b21c-76cd1a64788d.34553.1\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"testbucket:89c43fae-cd94-4f93-b21c-76cd1a64788d.34553.1\\"\\n },\\n \\"params\\": {\\n \\"source\\": {\\n \\"filter\\": {\\n \\"tags\\": []\\n }\\n },\\n \\"dest\\": {},\\n \\"priority\\": 0,\\n \\"mode\\": \\"system\\",\\n \\"user\\": \\"user1\\"\\n }\\n }\\n ],\\n
So if, for example, we only look at the sources, you can see they will vary depending on the cluster in which run the radosgw-admin
command. For example from cluster2
(ceph-node04
), we see zone1
as the source:
[root@ceph-node-00 ~]# ssh ceph-node-04 radosgw-admin sync info --bucket testbucket | jq \'.sources[].source, .sources[].dest\'\\n{\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"testbucket:66df8c0a-c67d-4bd7-9975-bc02a549f13e.45330.2\\"\\n}\\n{\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"testbucket:66df8c0a-c67d-4bd7-9975-bc02a549f13e.45330.2\\"\\n}\\n
In cluster1
(ceph-node-00
), we see zone2
as the source:
[root@ceph-node-00 ~]# radosgw-admin sync info --bucket testbucket | jq \'.sources[].source, .sources[].dest\'\\n{\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"testbucket:66df8c0a-c67d-4bd7-9975-bc02a549f13e.45330.2\\"\\n}\\n{\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"testbucket:66df8c0a-c67d-4bd7-9975-bc02a549f13e.45330.2\\"\\n}\\n
Let’s perform a quick test with the AWS CLI, to validate the configuration and confirm that replication is working for testbucket
. We PUT an object in zone1
and check that it is replicated to zone2
:
[root@ceph-node-00 ~]# aws --endpoint https://s3.zone1.cephlab.com:443 s3 cp /etc/hosts s3://testbucket/firsfile\\nupload: ../etc/hosts to s3://testbucket/firsfile\\n
We can check the sync has finished with the radosgw-admin bucket sync checkpoint
command:
[root@ceph-node-00 ~]# ssh ceph-node-04 radosgw-admin bucket sync checkpoint --bucket testbucket\\n2024-02-02T02:17:26.858-0500 7f3f38729800 1 bucket sync caught up with source:\\n local status: [, , , 00000000004.531.6, , , , , , , ]\\n remote markers: [, , , 00000000004.531.6, , , , , , , ]\\n2024-02-02T02:17:26.858-0500 7f3f38729800 0 bucket checkpoint complete\\n
An alternate way to check sync status is to use the radosgw-admin bucket sync status
command:
[root@ceph-node-00 ~]# radosgw-admin bucket sync status --bucket=testbucket\\n realm beeea955-8341-41cc-a046-46de2d5ddeb9 (multisite)\\n zonegroup 2761ad42-fd71-4170-87c6-74c20dd1e334 (multizg)\\n zone 66df8c0a-c67d-4bd7-9975-bc02a549f13e (zone1)\\n bucket :testbucket[66df8c0a-c67d-4bd7-9975-bc02a549f13e.37124.2])\\n current time 2024-02-02T09:07:42Z\\n\\n source zone 7b9273a9-eb59-413d-a465-3029664c73d7 (zone2)\\n source bucket :testbucket[66df8c0a-c67d-4bd7-9975-bc02a549f13e.37124.2])\\n incremental sync on 11 shards\\n bucket is caught up with source\\n
We see that the object is available in zone2
.
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 ls s3://testbucket/\\n2024-01-09 06:27:24 233 firsfile\\n
Because the replication is bidirectional, we PUT an object in zone2
, and it is replicated to zone1
:
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 cp /etc/hosts s3://testbucket/secondfile\\nupload: ../etc/hosts to s3://testbucket/secondfile\\n[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 ls s3://testbucket/\\n2024-01-09 06:27:24 233 firsfile\\n2024-02-02 00:40:15 233 secondfile\\n
In part five of this series, we discussed Multisite Sync Policy and shared some hands-on examples of configuring granular bucket bidirectional replication. in part six, we will continue configuring multisite sync policies including unidirectional replication with one source to many destination buckets.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
In the previous episode of the series, we discussed configuring dedicated RGW services for client and replication requests. Additionally, we explored the performance enhancements the sync fairness feature offer. In the fourth article of this series, we will be talking about load balancing our freshly deployed RGW S3 endpoints to provide high-availability and increased performance by distributing requests across individual RGW services.
In the previous installment, we configured four RGW instances: two dedicated to client S3 API requests and the rest for multisite replication requests. With this configuration, clients can connect to each RGW endpoint individually to use the HTTP restful S3 API. They could, for example, issue an S3 call like a LIST using as an endpoint the IP/FQDN of one of the nodes running and RGW service.
Here\'s an example with the AWS s3 client: $ aws –endpoint https://ceph-node02 s3 ls
. They will be able to access their buckets and data.
The problem is, what happens if ceph-node02
goes down? The user will start getting error messages and failed requests, even if the rest of the RGW services are running fine on the surviving nodes. To avoid this behaviour, providing high availability and increased performance, we need to configure a load balancer in front of our RGW services. Because the RGW endpoints are using the HTTP protocol, we have multiple well-known solutions to load balance HTTP requests. These include hardware-based commercial solutions as well as open-source software load balancers. We need to find a solution that will cover our performance needs depending on the size of our deployment and specific requirements. There are some great examples of different RadosGW load-balancing mechanisms in this github repository from Kyle Bader.
Each site\'s network infrastructure must offer ample bandwidth to support reading and writing of replicated objects or erasure-coded object shards. We recommend that the network fabric of each site has either zero (1:1) or minimal oversubscription (e.g., 2:1). One of the most used network topologies for Ceph cluster deployments is Leaf and Spine as it can provide the needed scalability.
Networking between zones participating in the same zone group will be utilized for asynchronous replication traffic. The inter-site bandwidth must be equal to or greater than ingest throughput to prevent synchronisation lag from growing and increasing the risk of data loss. Inter-site networking will not be relied on for read traffic or reconstitution of objects because all objects are locally durable. Path diversity is recommended for inter-site networking, as we generally speak of WAN connections. The inter-site networks should be routed (L3) instead of switched (L2 Extended Vlans) in order to provide independent networking stacks at each site. Finally, even if we are not doing so in our lab example, Ceph Object Gateway synchronization should be configured to use HTTPS endpoints to encrypt replication traffic with SSL/TLS in production.
Beginning with the Pacifc release, Ceph provides a cephadm service called ingress
, which provides an easily deployable HA and load-balancing stack based on Keepalived and HAproxy.
The ingress service allows you to create a high-availability endpoint for RGW with a minimum of configuration options. The orchestrator will deploy and manage a combination of HAproxy and Keepalived to balance the load on the different configured floating virtual IPs.
There are mutliple hosts where the ingress service is deployed. Each host has an HAproxy daemon and a keepalived daemon.
By default, a single virtual IP address is automatically configured by Keepalived on one of the hosts. Having a single VIP means that all traffic for the load-balancer will flow through a single host. This is less than ideal for configurations that service a high number of client requests while maintaining high throughput. We recommend configuring one VIP address per ingress node. We can then, for example, configure round-robin DNS across all deployed VIPs to load balance requests across all VIPs. This provides the possibility to achieve higher throughput as we are using more than one host to load-balance client HTTP requests across our configured RGW services. Depending on the size and requirements of the deployment, the ingress service may not be adequate, and other more scalable solutions can be used to balance the requests, like BGP + ECMP.
In this post, we will configure the ingress load balancing service so we can load-balance S3 client HTTP requests across the public-facing RGW services running on nodes ceph-node-02
and ceph-node-03
in zone1
, and ceph-node-06
and ceph-node-07
in zone2
.
In the following diagram, we depict at a high level the new load balancer components we are adding to our previously deployed architecture. In this way we will provide HA and load balancing for S3 client requests.
The first step is to create, as usual, a cephadm service spec file. In this case, the service type will be ingress
. We specify our existing public RGW service name rgw.client-traffic
as well as the service_id
and backend_service
parameters. We can get the name of the cephadm services using the cephadm orch ls
command.
We will configure one VIP per ingress service daemon, and two nodes to manage the ingress service with VIPs per Ceph cluster. We will enable SSL/HTTPS for client connections terminating at the ingress service.
[root@ceph-node-00 ~]# ceph orch ls | grep rgw\\nrgw.client-traffic ?:8000 2/2 4m ago 3d count-per-host:1;label:rgw\\nrgw.multisite.zone1 ?:8000 2/2 9m ago 3d count-per-host:1;label:rgwsync\\n\\n[root@ceph-node-00 ~]# cat << EOF > rgw-ingress.yaml\\nservice_type: ingress\\nservice_id: rgw.client-traffic\\nplacement:\\n hosts:\\n - ceph-node-02.cephlab.com\\n - ceph-node-03.cephlab.com\\nspec:\\n backend_service: rgw.client-traffic\\n virtual_ips_list:\\n - 192.168.122.150/24\\n - 192.168.122.151/24\\n frontend_port: 443\\n monitor_port: 1967\\n ssl_cert: |\\n -----BEGIN CERTIFICATE-----\\n -----END CERTIFICATE-----\\n\\n -----BEGIN CERTIFICATE-----\\n -----END CERTIFICATE-----\\n -----BEGIN PRIVATE KEY-----\\n -----END PRIVATE KEY-----\\nEOF\\n\\n\\n[root@ceph-node-00 ~]# ceph orch apply -i rgw-ingress.yaml\\nScheduled ingress.rgw.client update...\\n
NOTE: The ingress service builds from all the certs we add to the ssl_cert
list a single certificate file named HAproxy.pem
. For the certificate to work, HAproxy requires that you add the certificates in the following order: cert.pem
first, then the chain certificate, and finally, the private key.
Soon we can see our HAproxy and Keepalived services running on ceph-node-[02/03]
:
[root@ceph-node-00 ~]# ceph orch ps | grep -i client\\nhaproxy.rgw.client.ceph-node-02.icdlxn ceph-node-02.cephlab.com *:443,1967 running (3d) 9m ago 3d 8904k - 2.4.22-f8e3218 0d25561e922f 9e3bc0e21b4b\\nhaproxy.rgw.client.ceph-node-03.rupwfe ceph-node-03.cephlab.com *:443,1967 running (3d) 9m ago 3d 9042k - 2.4.22-f8e3218 0d25561e922f 63cf75019c35\\nkeepalived.rgw.client.ceph-node-02.wvtzsr ceph-node-02.cephlab.com running (3d) 9m ago 3d 1774k - 2.2.8 6926947c161f 031802fc4bcd\\nkeepalived.rgw.client.ceph-node-03.rxqqio ceph-node-03.cephlab.com running (3d) 9m ago 3d 1778k - 2.2.8 6926947c161f 3d7539b1ab0f\\n
You can check the configuration of HAproxy from inside the container: it is using static round-robin load balancing between both of our client-facing RGWs configured as the backends. The frontend listens on port 443 with our certificate in the path /var/lib/haproxy/haproxy.pem
:
[root@ceph-node-02 ~]# podman exec -it ceph-haproxy-rgw-client-ceph-node-02-jpnuri cat /var/lib/haproxy/haproxy.cfg | grep -A 15 \\"frontend frontend\\"\\nfrontend frontend\\n bind *:443 ssl crt /var/lib/haproxy/haproxy.pem\\n default_backend backend\\n\\nbackend backend\\n option forwardfor\\n balance static-rr\\n option httpchk HEAD / HTTP/1.0\\n server rgw.client-traffic.ceph-node-02.yntfqb 192.168.122.94:8000 check weight 100\\n server rgw.client-traffic.ceph-node-03.enzkpy 192.168.122.180:8000 check weight 100\\n
For this example, we have configured basic DNS round robin using the load balancer CoreDNS plugin. We are resolving s3.zone1.cephlab.com
across all configured ingress VIPs. As you can see with the following example, each request for s3.zone1.cephlab.com
resolves to a different Ingress VIP.
[root@ceph-node-00 ~]# ping -c 1 s3.zone1.cephlab.com\\nPING s3.cephlab.com (192.168.122.150) 56(84) bytes of data.\\n[root@ceph-node-00 ~]# ping -c 1 s3.zone1.cephlab.com\\nPING s3.cephlab.com (192.168.122.151) 56(84) bytes of data.\\n
You can now point the S3 client to s3.zone1.cephlab.com
to access the RGW S3 API endpoint.
[root@ceph-node-00 ~]# aws --endpoint https://s3.zone1.cephlab.com:443 s3 ls\\n2024-01-04 13:44:00 firstbucket\\n
At this point, we have high availability and load balancing configured for zone1
. If we lose one server running the RGW service, client requests will be redirected to the remaining RGW service.
We need to do the same steps for the second Ceph cluster that hosts zone2
, so we will end up with a load-balanced endpoint per zone:
s3.zone1.cephlab.com\\ns3.zone2.cephlab.com\\n
As a final step, we could deploy a global load balancer (GLB). This is not part of the Ceph solution and should be provided by a third party; there are many DNS global load balancers available that implement various load balancing policies.
As we are using SSL/TLS in our lab per-site load balancers, if we were to configure a GLB we will need to implement TLS passthrough or re-encrypt client connections so connections will be encrypted from the client to the per-site load balancer. Using a GLB has significant advantages:
Taking advantage of the active/active nature of Ceph Object storage replication, you can provide users with a single S3 endpoint FQDN and then apply policy at the load balancer to send the user request to one site or the other. The load balancer could, for example, redirect the client to the S3 endpoint closest to their location.
If you need an active/passive disaster recovery approach, a GLB can enhance failover. Users will have a single S3 endpoint FQDN o use. During normal operations, they will always be redirected to the primary site. In case of site failure the GLB will detect the failure of the primary site, and redirect the users transparently to the secondary site, enhancing user experience and reducing the failover time.
In the following diagram we provide an example where we add a GLB with the FQDN s3.cephlab.com
. Clients connect to s3.cephlab.com
and will be redirected to one or the other site based on the applied policy at the GLB level
In the load balancing ingress service examples we shared, we configured load balancing for S3 client endpoints, so the client HTTP requests are distributed among the available RGW services. We haven’t yet dicusssed the RGWs serving multisite sync requests. In our previous installment, we configured two RGWs dedicated to multisite sync operations. How do we load-balance sync requests across the two RGWs if we don’t have an ingress service or external load balancer configured?.
RGW implements round-robin at the zonegroup and zone endpoint levels. We can configure a comma-separated list of RGW services IP address or hostnames. The RGW service code will load-balance the request among the entries in the list.
Replication endpoints for our multizg
zone group:
[root@ceph-node-00 ~]# radosgw-admin zonegroup get | jq .endpoints\\n[\\n \\"http://ceph-node-04.cephlab.com:8000\\",\\n \\"http://ceph-node-05.cephlab.com:8000\\"\\n]\\n
Replication endpoints for our zone1 and zone2 zones.
[root@ceph-node-00 ~]# radosgw-admin zonegroup get | jq .zones[].endpoints\\n[\\n \\"http://ceph-node-00.cephlab.com:8000\\",\\n \\"http://ceph-node-01.cephlab.com:8000\\"\\n]\\n[\\n \\"http://ceph-node-04.cephlab.com:8000\\",\\n \\"http://ceph-node-05.cephlab.com:8000\\"\\n]\\n
We can take another approach using a load balancer for multisite sync endpoints. For example, a dedicated ingress service or any other HTTP load balancer. If we take this approach, we would just have a single FQDN in the list of zonegroup and zone endpoints.
It depends.. external load balancing could be better if the load balancer can offer at least the same throughput as round-robin of the configured dedicated RGW services. As an example, if our external load balancer is HAproxy running on a single VM with a single VIP and limited network throughput, we are better off using the RGW round-robin replication endpoint list option. For releases after a PR from early 2024 was merged, I would say that both options are ok. You need to trade the simplicity of just setting up a list of IPs for the endpoints, which is done for us automatically with the RGW manager module, against the more advanced features that a full-blown load-balancer can offer.
In part four of this series, we discussed everything related to load-balancing our RGW S3 endpoints. We covered multiple Load-balancing techniques, including the bundled Ceph0provided load balancer, the Ingress service
. In part five, we will detail the new Sync Policy feature that provides Object Multisite replication with a granular and flexible sync policy scheme.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
In part seven of this Ceph Multisite series, we introduce Archive Zone concepts and architecture. We will share a hands-on example of attaching an archive zone to a running Ceph Object Multisite cluster.
Archive your critical object data residing on Ceph using the Archive Zone feature.
The Archive Zone uses the multisite replication and S3 object versioning features. In this way, it will keep all versions of each object available even when deleted from the production site.
With the archive zone, you can have object immutability without the overhead of enabling object versioning in your production zones, saving the space that the replicas of the versioned S3 objects would consume in non-archive zones, which may well deployed on faster yet more expensive storage devices.
This can protect your data against logical or physical errors. It can save users from logical failures, for example, accidental deletion of a bucket in a production zone. It can protect your data from massive hardware failures or complete production site failure.
As the archive zone provides an immutable copy of your production data, it can serve as a key component of a ransomware protection strategy.
You can control the storage space usage of an archive zone through lifecycle policies of the production buckets, where you can define the number of versions you would like to keep for an object.
We can select on a per-bucket basis the data to send/replicate to the archive zone. If for example, we have pre-production buckets that don’t have any valuable data we can disable the archive zone replication on those buckets.
The archive zone as a zone in a multisite zonegroup can have a different setup than production zones, including its own set of pools and replication rules.
Ceph archive zones have the following main characteristics:
The archive zone S3 endpoint for data recovery can be configured on a private network that is only accessible to the operations administrator team. If recovery of a production object is required, the request would need to go through that team.
We can add an archive zone to a Ceph Object Storage single site configuration. With this configuration, we can attach the archive zone to the running single zone, single Ceph cluster, as depicted in the following figure:
Or we can attach our archive zone to a Ceph Object Storage multisite configuration. If, for example, we haved a realm/zonegroup replicating between two zones, we can add a third zone representing a third Ceph cluster. This is the architecture that we are going to use in our example, building on our work in the previous posts where we setup an Ceph Multisite replication cluster. We are now going to add a third zone to our zonegroup configured as an immutable archive zone. An example of this architecture is shown in the following diagram.
Let’s start with our archive zone configuration. We have a freshly deployed third Ceph cluster running on four nodes named ceph-node-[08-11].cephlab.com
.
[root@ceph-node-08 ~]# ceph orch host ls\\nHOST ADDR LABELS STATUS\\nceph-node-08.cephlab.com ceph-node-08 _admin,osd,mon,mgr,rgwsync \\nceph-node-09.cephlab.com 192.168.122.135 osd,mon,mgr,rgwsync\\nceph-node-10.cephlab.com 192.168.122.204 osd,mon,mgr,rgw \\nceph-node-11.cephlab.com 192.168.122.194 osd,rgw \\n4 hosts in cluster\\n
The archive zone is not currently configurable using the Manager rgw
module, so we must run radosgw-admin
commands to configure it. First, we pull the information from our already-deployed multisite
realm. We use the zonegroup endpoint and the access and secret keys for our RGW multisite synchronization user. If you need to check the details of your sync user, you can run: radosgw-admin user info --uid sysuser-multisite
.
[root@ceph-node-08]# radosgw-admin realm pull --rgw-realm=multisite --url=http://ceph-node-01.cephlab.com:8000 --access-key=X1BLKQE3VJ1QQ27ORQP4 --secret=kEam3Fq5Wgf24Ns1PZXQPdqb5CL3GlsAwpKJqRjg --default\\n\\n[root@ceph-node-08]# radosgw-admin period pull --url=http://ceph-node-01.cephlab.com:8000 --access-key=X1BLKQE3VJ1QQ27ORQP4 --secret=kEam3Fq5Wgf24Ns1PZXQPdqb5CL3GlsAwpKJqRjg\\n
Once we have pulled the realm and the period locally, our third cluster has all required realm and zonegroup configuration. If we run radosgw-admin zonegroup get
, we will see all details of our current multisite setup. Moving forward we will configure a new zone named archive
. We provide the list of endpoints, which are the dedicated sync RGWs that we are going to deploy on our new cluster along with the access and secret keys for the sync user, and last but not least the tier type. This flag is the one that defines that this new zone is going to be created as an archive zone.
[root@ceph-node-08]# radosgw-admin zone create --rgw-zone=archive --rgw-zonegroup=multizg --endpoints=http://ceph-node-08.cephlab.com:8000,http://ceph-node-09.cephlab.com:8000 --access-key=X1BLKQE3VJ1QQ27ORQP4 --secret=kEam3Fq5Wgf24Ns1PZXQPdqb5CL3GlsAwpKJqRjg --tier-type=archive --default\\n
With the new zone in place, we can update the period to push our new zone config to the rest of the zones in the zone group
[root@ceph-node-08]# radosgw-admin period update --commit\\n
Using cephadm, we deploy two RGW services that will replicate data from production zones. In this example, we use the cephadm RGW CLI instead of a spec file to showcase a different way to configure your Ceph services with cephadm. Both new RGWs that we spin up will belong to the archive zone. Using the --placement
argument, we configure two RGW services that will run on ceph-node-08
and ceph-node-09
, the same nodes we configured as our zone replication endpoints via our previous commands.
[root@ceph-node-08 ~]# ceph orch apply rgw multi.archive --realm=multisite --zone=archive --placement=\\"2 ceph-node-08.cephlab.com ceph-node-09.cephlab.com\\" --port=8000\\n\\nScheduled rgw.multi.archive update...\\n
We can check the RGWs have started correctly:
[root@ceph-node-08]# ceph orch ps | grep archive\\n[root@ceph-node-08]# ceph orch ps | grep archive\\nrgw.multi.archive.ceph-node-08.hratsi ceph-node-08.cephlab.com *:8000 running (10m) 10m ago 10m 80.5M - 18.2.0-131.el9cp 463bf5538482 44608611b391\\nrgw.multi.archive.ceph-node-09.lubyaa ceph-node-09.cephlab.com *:8000 running (10m) 10m ago 10m 80.7M - 18.2.0-131.el9cp 463bf5538482 d39dbc9b3351\\n
Once the new RGWs spin up, the new pools for the archive zone are created for us. Remember if we want to use erasure coding for our RGW data pool, this would be the moment to create it before we enable replication from production to the archive zone. The data pool is otherwise created using the default data protection strategy replication with three copies aka R3.
[root@ceph-node-08]# ceph osd lspools | grep archive\\n8 archive.rgw.log\\n9 archive.rgw.control\\n10 archive.rgw.meta\\n11 archive.rgw.buckets.index\\n
When we now check the sync status from one of our archive zone nodes, we see that there is currently no replication configured. This is because we are using sync policy
, and there is no zonegroup sync policy configured for the archive zone:
[root@ceph-node-08]# radosgw-admin sync status --rgw-zone=archive\\n realm beeea955-8341-41cc-a046-46de2d5ddeb9 (multisite)\\n zonegroup 2761ad42-fd71-4170-87c6-74c20dd1e334 (multizg)\\n zone bac4e4d7-c568-4676-a64c-f375014620ae (archive)\\n current time 2024-02-12T17:19:24Z\\nzonegroup features enabled: resharding\\n disabled: compress-encrypted\\n metadata sync syncing\\n full sync: 0/64 shards\\n incremental sync: 64/64 shards\\n metadata is caught up with master\\n data sync source: 66df8c0a-c67d-4bd7-9975-bc02a549f13e (zone1)\\n not syncing from zone\\n source: 7b9273a9-eb59-413d-a465-3029664c73d7 (zone2)\\n not syncing from zone\\n
Now we want to start replicating data to our archive zone, so we need to create a zone group policy. Rrecall from our previous post that we have a zonegroup policy configured to allow
replication at the zonegroup level, and then we configured replication on a per-bucket basis.
In this case, we will take a different approach with the archive zone. We are going to configure unidirectional sync at the zonegroup level, and set the policy status to enabled
so by default, all buckets in the zone zone1
will be replicated to the archive
archive zone.
As before, to create a sync policy we need a group, a flow, and a pipe. Let\'s create a new zonegroup group policy called grouparchive
:
[root@ceph-node-00 ~]# radosgw-admin sync group create --group-id=grouparchive --status=enabled \\n
We are creating a “directional” (unidirectional) flow that will replicate all data from zone1
to the archive
zone:
[root@ceph-node-00 ~]# radosgw-admin sync group flow create --group-id=grouparchive --flow-id=flow-archive --flow-type=directional --source-zone=zone1 --dest-zone=archive\\n
Finally, we create a pipe where we use a *
wildcard for all fields to avoid typing the full zone names. The *
represents all zones configured in the flow. We could have instead entered zone1
and archive
in the zone fields. The use of wildcards here helps avoid typos and generalizes the procedure.
[root@ceph-node-00 ~]# radosgw-admin sync group pipe create --group-id=grouparchive --pipe-id=pipe-archive --source-zones=\'*\' --source-bucket=\'*\' --dest-zones=\'*\' --dest-bucket=\'*\'\\n
Zonegroup sync policies always need to be committed:
[root@ceph-node-00 ~]# radosgw-admin period update --commit\\n
When we check the configured zonegroup policies, we now see two groups, group1
from our previous blog posts and grouparchive
that we created and configured just now:
[root@ceph-node-00 ~]# radosgw-admin sync group get\\n[\\n {\\n \\"key\\": \\"group1\\",\\n \\"val\\": {\\n \\"id\\": \\"group1\\",\\n \\"data_flow\\": {\\n \\"symmetrical\\": [\\n {\\n \\"id\\": \\"flow-mirror\\",\\n \\"zones\\": [\\n \\"zone1\\",\\n \\"zone2\\"\\n ]\\n }\\n ]\\n },\\n \\"pipes\\": [\\n {\\n \\"id\\": \\"pipe1\\",\\n \\"source\\": {\\n \\"bucket\\": \\"*\\",\\n \\"zones\\": [\\n \\"*\\"\\n ]\\n },\\n \\"dest\\": {\\n \\"bucket\\": \\"*\\",\\n \\"zones\\": [\\n \\"*\\"\\n ]\\n },\\n \\"params\\": {\\n \\"source\\": {\\n \\"filter\\": {\\n \\"tags\\": []\\n }\\n },\\n \\"dest\\": {},\\n \\"priority\\": 0,\\n \\"mode\\": \\"system\\",\\n \\"user\\": \\"\\"\\n }\\n }\\n ],\\n \\"status\\": \\"allowed\\"\\n }\\n },\\n {\\n \\"key\\": \\"grouparchive\\",\\n \\"val\\": {\\n \\"id\\": \\"grouparchive\\",\\n \\"data_flow\\": {\\n \\"directional\\": [\\n {\\n \\"source_zone\\": \\"zone1\\",\\n \\"dest_zone\\": \\"archive\\"\\n }\\n ]\\n },\\n \\"pipes\\": [\\n {\\n \\"id\\": \\"pipe-archive\\",\\n \\"source\\": {\\n \\"bucket\\": \\"*\\",\\n \\"zones\\": [\\n \\"*\\"\\n ]\\n },\\n \\"dest\\": {\\n \\"bucket\\": \\"*\\",\\n \\"zones\\": [\\n \\"*\\"\\n\\n ]\\n },\\n \\"params\\": {\\n \\"source\\": {\\n \\"filter\\": {\\n \\"tags\\": []\\n }\\n },\\n \\"dest\\": {},\\n \\"priority\\": 0,\\n \\"mode\\": \\"system\\",\\n \\"user\\": \\"\\"\\n }\\n }\\n ],\\n \\"status\\": \\"enabled\\"\\n }\\n }\\n]\\n
When we check any bucket from zone1
(here we choose the unidrectional
bucket, but it could be any other; we see that we now have a new sync policy configured with the ID pipe-archive
. This comes from the zonegroup policy we just applied because this is unidirectional. We run the command from ceph-node-00
in zone1
. We see only the dests
field populated, with the source
being zone1
and the destination being the archive
zone.
[root@ceph-node-00 ~]# radosgw-admin sync info --bucket unidirectional\\n{\\n \\"sources\\": [],\\n \\"dests\\": [\\n {\\n \\"id\\": \\"pipe-archive\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"archive\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n \\"params\\": {\\n \\"source\\": {\\n \\"filter\\": {\\n \\"tags\\": []\\n }\\n },\\n \\"dest\\": {},\\n \\"priority\\": 0,\\n \\"mode\\": \\"system\\",\\n \\"user\\": \\"\\"\\n }\\n },\\n {\\n \\"id\\": \\"test-pipe1\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n \\"params\\": {\\n \\"source\\": {\\n \\"filter\\": {\\n \\"tags\\": []\\n }\\n },\\n \\"dest\\": {},\\n \\"priority\\": 0,\\n \\"mode\\": \\"system\\",\\n \\"user\\": \\"user1\\"\\n }\\n },\\n
When we run the radosgw-admin sync status
command again, we see that the status for zone1
has changed from not syncing from zone
to synchronization enabled and data is caught up with source
.
[root@ceph-node-08 ~]# radosgw-admin sync status --rgw-zone=archive\\n realm beeea955-8341-41cc-a046-46de2d5ddeb9 (multisite)\\n zonegroup 2761ad42-fd71-4170-87c6-74c20dd1e334 (multizg)\\n zone bac4e4d7-c568-4676-a64c-f375014620ae (archive)\\n current time 2024-02-12T17:09:26Z\\nzonegroup features enabled: resharding\\n disabled: compress-encrypted\\n metadata sync syncing\\n full sync: 0/64 shards\\n incremental sync: 64/64 shards\\n metadata is caught up with master\\n data sync source: 66df8c0a-c67d-4bd7-9975-bc02a549f13e (zone1)\\n syncing\\n full sync: 0/128 shards\\n incremental sync: 128/128 shards\\n data is caught up with source\\n source: 7b9273a9-eb59-413d-a465-3029664c73d7 (zone2)\\n not syncing from zone\\n\\n
Now all data ingested into zone1
will be replicated to the archive
zone. Withh this configuration only have to set a uniderectional flow from zone1 to
archive. If, for example, a new object is ingested in
zone2, because we have a bidirectional bucket sync policy for the
unidirectionalbucket, the object replication flow will be the following:
zone2→
zone1 → Archive
.
We introduced the archive zone feature in part seven of this series. We shared a hands-on example of configuring an archive zone in a running Ceph Object multisite cluster. In the final post of this series, we will demonstrate how the Archive Zone feature can help you recover critical data from your production site.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
In the first part of this series, we explored the fundamentals of Ceph Object Storage and its policy-based archive to cloud/tape feature, which enables seamless data migration to remote S3-compatible storage classes. This feature is instrumental in offloading data to cost-efficient storage tiers, such as cloud or tape-based systems. However, in the past, the process has been unidirectional. Once objects are transitioned, retrieving them requires direct access to the cloud provider’s S3 endpoint. This limitation has introduced operational challenges, particularly when accessing archived or cold-tier data.
We are introducing policy-based data retrieval in the Ceph Object Storage ecosystem to address these gaps. This enhancement empowers administrators and operations teams to retrieve objects transitioned to cloud or tape tiers directly back into the Ceph cluster, aligning with operational efficiency and data accessibility needs.
Policy-based data retrieval transforms the usability of cloud-transitioned objects in Ceph. Whether the data resides in cost-efficient tape archives or high-latency/low-cost cloud tiers, this feature ensures that users can seamlessly access and manage their objects without relying on the external provider\'s S3 endpoints. This capability simplifies workflows and enhances compliance with operational policies and data lifecycle requirements.
This new functionality offers a dual approach to retrieving transitioned objects to remote cloud/tape S3-compatible endpoints:
S3 RestoreObject API Implementation: Similar to the AWS S3 RestoreObject
API, this feature allows users to retrieve objects manually using the S3 RestoreObject
API. The object restore operation can be permanent or temporary based on the retention period specified in the RestoreObject
API Call.
Read-Through Mode: By introducing a configurable --allow-read-through
capability Ceph can serve read requests for transitioned objects in the cloud tier storage class configuration. Upon receiving a GET
request, the system asynchronously retrieves the object from the cloud tier, stores it locally, and serves the data to the user. This eliminates the InvalidObjectState
error previously encountered for cloud-transitioned objects.
The restored data is treated as temporary and will exist in the Ceph cluster only for the duration specified during the restore request. Once the specified period expires, the restored data will be deleted, and the object will revert to a stub, preserving metadata and cloud transition configurations.
During the temporary restore period, the object is exempted from lifecycle rules that might otherwise move it to a different tier or delete it. This ensures uninterrupted access until the expiry date.
Restored objects are, by default, written to the STANDARD
storage class within the Ceph cluster. However, for temporary objects, the x-amz-storage-class
header will still return the original cloud-tier storage class This is in line with AWS Glacier semantics, where restored objects’ storage class remains the same.
We are uploading an object called 2gb
to the on-prem Ceph cluster using a bucket called databucket
. In part one of this blog post series, we configured databucket
with a lifecycle policy that will tier/archive data into IBM COS after 30 days. Wee set up the AWS CLI client with a profile called tiering
to interact with Ceph Object Gateway S3 API endpoint.
aws --profile tiering --endpoint https://s3.cephlabs.com s3 cp 2gb s3://databucket upload: ./2gb to s3://databucket/2gb\\n
We can check the size of the uploaded object in the STANDARD
storage class within our on-prem Ceph cluster:
aws --profile tiering --endpoint https://s3.cephlabs.com s3api head-object --bucket databucket --key 2gb\\n{\\n \\"AcceptRanges\\": \\"bytes\\",\\n \\"LastModified\\": \\"2024-11-26T21:31:05+00:00\\",\\n \\"ContentLength\\": 2000000000,\\n \\"ETag\\": \\"\\\\\\"b459c232bfa8e920971972d508d82443-60\\\\\\"\\",\\n \\"ContentType\\": \\"binary/octet-stream\\",\\n \\"Metadata\\": {},\\n \\"PartsCount\\": 60\\n}\\n
After 30 days, the lifecycle transition kicks in, and the object is transitioned to the cloud tier. First, as an admin, we check with the radosgw-admin
command that lifecycle (LC) processing has completed, and then as a user, we use the S3 HeadObject
API call to query the status of the object:
# radosgw-admin lc list| jq .[1]\\n{\\n \\"bucket\\": \\":databucket:fcabdf4a-86f2-452f-a13f-e0902685c655.310403.1\\",\\n \\"shard\\": \\"lc.23\\",\\n \\"started\\": \\"Tue, 26 Nov 2024 21:32:15 GMT\\",\\n \\"status\\": \\"COMPLETE\\"\\n}\\n\\n# aws --profile tiering --endpoint https://s3.cephlabs.com s3api head-object --bucket databucket --key 2gb\\n{\\n \\"AcceptRanges\\": \\"bytes\\",\\n \\"LastModified\\": \\"2024-11-26T21:32:48+00:00\\",\\n \\"ContentLength\\": 0,\\n \\"ETag\\": \\"\\\\\\"b459c232bfa8e920971972d508d82443-60\\\\\\"\\",\\n \\"ContentType\\": \\"binary/octet-stream\\",\\n \\"Metadata\\": {},\\n \\"StorageClass\\": \\"ibm-cos\\"\\n}\\n
As an admin, we can use the radosgw-admin bucket stats command to check the space used. We can see that rgw.main
is empty, and our rgw.cloudtiered
placement is the only one with data stored.
# radosgw-admin bucket stats --bucket databucket | jq .usage\\n{\\n \\"rgw.main\\": {\\n \\"size\\": 0,\\n \\"size_actual\\": 0,\\n \\"size_utilized\\": 0,\\n \\"size_kb\\": 0,\\n \\"size_kb_actual\\": 0,\\n \\"size_kb_utilized\\": 0,\\n \\"num_objects\\": 0\\n },\\n \\"rgw.multimeta\\": {\\n \\"size\\": 0,\\n \\"size_actual\\": 0,\\n \\"size_utilized\\": 0,\\n \\"size_kb\\": 0,\\n \\"size_kb_actual\\": 0,\\n \\"size_kb_utilized\\": 0,\\n \\"num_objects\\": 0\\n },\\n \\"rgw.cloudtiered\\": {\\n \\"size\\": 1604857600,\\n \\"size_actual\\": 1604861952,\\n \\"size_utilized\\": 1604857600,\\n \\"size_kb\\": 1567244,\\n \\"size_kb_actual\\": 1567248,\\n \\"size_kb_utilized\\": 1567244,\\n \\"num_objects\\": 3\\n }\\n}\\n
Now that the object has transitioned to our IBM COS Cloud tier let\'s restore it to our Ceph Cluster using the S3 RestoreObject
API call. In this example, we\'ll request a temporary restore and set the expiration to three days:
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api restore-object --bucket databucket --key 2gb --restore-request Days=3\\n
If we attemp to get an object that is still being restored, we get an error message like this:
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api get-object --bucket databucket --key 2gb /tmp/2gb\\nAn error occurred (RequestTimeout) when calling the GetObject operation (reached max retries: 2): restore is still in progress\\n
Using the S3 API, we can issue the HeadObject
call and check the status of the Restore
attribute. In this example, we can see how our restore from the IBM COS cloud endpoint to Ceph has finished, as ongoing request
is set to false
. We have an expiry date for the object, as we used the RestoreObject
call with --restore-request days=3O
. Other things to check from this output: the occupied size of the object on our local Ceph cluster is 2GB, as expected once it is restored. Also, the storage class is ibm-cos
. As noted before for temporarily transitioned objects, even when using the STANDARD
Ceph RGW storage class, we still keep the ibm-cos
storage class. Now that the object has been restored, we can issue an S3 GET
API call from the client to access the object.
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api head-object --bucket databucket --key 2gb\\n{\\n \\"AcceptRanges\\": \\"bytes\\",\\n \\"Restore\\": \\"ongoing-request=\\\\\\"false\\\\\\", expiry-date=\\\\\\"Thu, 28 Nov 2024 08:46:36 GMT\\\\\\"\\",\\n \\"LastModified\\": \\"2024-11-27T08:36:39+00:00\\",\\n \\"ContentLength\\": 2000000000,\\n \\"ETag\\": \\"\\\\\\"\\\\\\"0c4b59490637f76144bb9179d1f1db16-382\\\\\\"\\\\\\"\\",\\n \\"ContentType\\": \\"binary/octet-stream\\",\\n \\"Metadata\\": {},\\n \\"StorageClass\\": \\"ibm-cos\\"\\n}\\n\\n# aws --profile tiering --endpoint https://s3.cephlabs.com s3api get-object --bucket databucket --key 2gb /tmp/2gb\\n
Restored data in a permanent restore will remain in the Ceph cluster indefinitely, making it accessible as a regular object. Unlike temporary restores, no expiration period is defined, and the object will not revert to a stub after retrieval. his is suitable for scenarios where long-term access to the object is required without additional re-restoration steps.
Once permanently restored, the object is treated as a regular object within the Ceph cluster. All lifecycle rules (such as transition to cloud storage or expiration policies) are reapplied, and the restored object is fully integrated into the bucket\'s data lifecycle workflows.
By default, permanently restored objects are written to the STANDARD
storage class within the Ceph cluster. Unlike temporary restores, the object’s x-amz-storage-class
header will reflect the STANDARD
storage class, reflecting permanent residency in the cluster.
Restore the object permanently by not supplying a number of days to the --restore-request
argument:
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api restore-object --bucket databucket --key hosts2 --restore-request {}\\n
Verify the restored object: it\'s part of the STANDARD
storage class, so the object is a first-class citizen of the Ceph cluster, ready for integration into broader operational workflows.
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api head-object --bucket databucket --key hosts2\\n{\\n \\"AcceptRanges\\": \\"bytes\\",\\n \\"LastModified\\": \\"2024-11-27T08:28:55+00:00\\",\\n \\"ContentLength\\": 304,\\n \\"ETag\\": \\"\\\\\\"01a72b8a9d073d6bcae565bd523a76c5\\\\\\"\\",\\n \\"ContentType\\": \\"binary/octet-stream\\",\\n \\"Metadata\\": {},\\n \\"StorageClass\\": \\"STANDARD\\"\\n}\\n
Objects accessed through the Read-Through Restore mechanism are restored temporarily into the Ceph cluster. When a GET
request is made for a cloud-transitioned object, the system retrieves the object from the cloud tier asynchronously. It makes it available for a specified duration defined by the read_through_restore_days
value. After the expiry period, the restored data is deleted, and the object reverts to its stub state, retaining metadata and transition configurations.
Before enabling read-through mode if we try to access a stub object in our local Ceph cluster that has been transitioned to a remote S3 endpoint via policy-based archival, we will get the following error message:
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api get-object --bucket databucket --key 2gb6 /tmp/2gb6\\nAn error occurred (InvalidObjectState) when calling the GetObject operation: Read through is not enabled for this config\\n
So let\'s first enable read-through mode. As a Ceph admin we need to modify our current ibm-cos
cloud-tier storage class and add two new tier-config parameters: --tier-config=allow_read_through=true,read_through_restore_days=3
:
# radosgw-admin zonegroup placement modify --rgw-zonegroup default \\\\\\n --placement-id default-placement --storage-class ibm-cos \\\\\\n --tier-config=allow_read_through=true,read_through_restore_days=3\\n
If you have not performed any previous multisite configuration, a default zone and zonegroup are created for you, and changes to the zone/zonegroup will not take effect until the Ceph Object Gateways (RGW daemons) are restarted. If you have created a realm for multisite, the zone/zonegroup changes will take effect once the changes are committed with radosgw-admin period update --commit
. In our case, it\'s enough to restart RGW daemons to apply changes:
# ceph orch restart rgw.default\\nScheduled to restart rgw.default.ceph02.fvqogr on host \'ceph02\'\\nScheduled to restart rgw.default.ceph03.ypphif on host \'ceph03\'\\nScheduled to restart rgw.default.ceph04.qinihj on host \'ceph04\'\\nScheduled to restart rgw.default.ceph06.rktjon on host \'ceph06\'\\n
Once read-trough mode is enabled, and the RGW services are restated when a GET
request is made for an object in the cloud tier, the object will automatically be restored to the Ceph cluster and served to the user.
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api get-object --bucket databucket --key 2gb6 /tmp/2gb6\\n{\\n \\"AcceptRanges\\": \\"bytes\\",\\n \\"Restore\\": \\"ongoing-request=\\\\\\"false\\\\\\", expiry-date=\\\\\\"Thu, 28 Nov 2024 08:46:36 GMT\\\\\\"\\",\\n \\"LastModified\\": \\"2024-11-27T08:36:39+00:00\\",\\n \\"ContentLength\\": 2000000000,\\n \\"ETag\\": \\"\\\\\\"\\\\\\"0c4b59490637f76144bb9179d1f1db16-382\\\\\\"\\\\\\"\\",\\n \\"ContentType\\": \\"binary/octet-stream\\",\\n \\"Metadata\\": {},\\n \\"StorageClass\\": \\"ibm-cos\\"\\n}\\n
Ceph developers are improving the policy-based data retrieval feature with upcoming enhancements that include:
RestoreObject
API to fetch objects instead of GET
from S3 endpoints that use the Glacier APIPolicy-based data retrieval for Ceph Storage is a crucial addition that enhances current object storage tiering capabilites. Feel free to share your thoughts or questions about this new feature on the ceph-users mailing list. We’d love to hear how you plan to use it or if you’d like to see any aspects enhanced.
For more information about RGW placement targets and storage classes, visit this page
For a related take on directing data to multiple RGW storage classes, view this presentation
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
In the previous episode, we introduced Ceph Object Storage multisite features. We described the lab setup we will use in the following chapters to deploy and configure Ceph object multisite asynchronous replication.
Part two of this series will enumerate the steps to establish the initial multisite replication between our Ceph clusters, as depicted in the following diagram.
As part of the Quincy release, a new Manager module named rgw
was added to the Ceph orchestrator cephadm
. The rgw
manager module makes the configuration of multisite replication straightforward. This section will show you how to configure Ceph Object Storage multisite replication between two zones (each zone is an independent Ceph cluster) through the CLI using the new rgw
manager module.
We will start by creating an RGW module spec file for cluster1
.
We will use labels on our hosts to help define which nodes may host each service. In this case, for the replication RGW services, we set the rgwsync
label. Any host that has this label configured will start an RGW service with the specs defined in the file. The rest of the options included will take care of configuring the names for our realm, zonegroup and zone, and direct that the RGW services listen on port 8000/tcp.
[root@ceph-node-00 ~]# cat << EOF >> /root/rgw.spec\\nplacement:\\n label: rgwsync\\n count_per_host: 1\\nrgw_realm: multisite\\nrgw_zone: zone1\\nrgw_zonegroup: multizg\\nspec:\\n rgw_frontend_port: 8000\\nEOF\\n
In our first cluster, we want to run the sync RGW services on nodes ceph-node-00
and ceph-node-01
, so we need to label the corresponding nodes:
[root@ceph-node-00 ~]# ceph orch host label add ceph-node-00.cephlab.com rgwsync\\nAdded label rgwsync to host ceph-node-00.cephlab.com\\n[root@ceph-node-00 ~]# ceph orch host label add ceph-node-01.cephlab.com rgwsync\\nAdded label rgwsync to host ceph-node-01.cephlab.com\\n
Once the nodes have been labeled, we enable the RGW manager module and bootstrap the RGW multisite configuration. When bootstrapping the multisite config, the rgw
manager module will take care of the following steps:
[root@ceph-node-00 ~]# ceph mgr module enable rgw\\n[root@ceph-node-00 ~]# ceph rgw realm bootstrap -i rgw.spec\\nRealm(s) created correctly. Please use \'ceph rgw realm tokens\' to get the token.\\n
Let’s check the realm:
[root@ceph-node-00 ~]# radosgw-admin realm list\\n{\\n \\"default_info\\": \\"d85b6eef-2285-4072-8407-35e2ea7a17a2\\",\\n \\"realms\\": [\\n \\"multisite\\"\\n ]\\n}\\n
Multisite sync user:
[root@ceph01 ~]# radosgw-admin user list | grep sysuser\\n \\"Sysuser-multisite\\"\\n
Multisite sync user:
[root@ceph01 ~]# radosgw-admin user list | grep sysuser\\n \\"Sysuser-multisite\\"\\n
Zone1 RGW RADOS pools:
[root@ceph01 ~]# ceph osd lspools | grep rgw\\n24 .rgw.root\\n25 zone1.rgw.log\\n26 zone1.rgw.control\\n27 zone1.rgw.meta\\n
Once we create the first bucket, the bucket index pool will be created automatically. Also, once we upload the first objects/data to a bucket in zone1
, the data pool will be created for us. By default pools with a replication factor of 3 are created using the cluster\'s pre-defined CRUSH rule replicated_rule
. If we want to use Erasure Coding (EC) the data pool or customize, for example, the failure domain, we need to manually pre-create the pools with our customizations before we start uploading data into the first bucket.
NOTE: Don’t forget to double-check that your RGW pools have the right number of Placement Groups (PGs) to provide the required performance. We can choose to enable the PG autoscaler manager module with the bulk
flag set for each pool, or we can statically calculate the number of PGs our pools are going to need up front with the help of the PG calculator. We suggest a target of 200 PG replicas per OSD, the \\"PG ratio\\".
NOTE: Only RGW data pools can be configured with erasure coding. The rest of the RGW pool constellation must be configured with to use replication scheme, by default with size=3
.
The RGW services are up and running service the S3 endpoint on port 8000:
[root@ceph-node-00 ~]# curl http://ceph-node-00:8000\\n<?xml version=\\"1.0\\" encoding=\\"UTF-8\\"?><ListAllMyBucketsResult xmlns=\\"http://s3.amazonaws.com/doc/2006-03-01/\\"><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyBucketsResult>\\n
The RGW manager module creates a token with encoded information of our deployment. Other Ceph clusters that want to be added as a replicated zone to our multisite configuration can import this token into the RGW manager module and have replication configured and running with a single command.
We can check the contents of the token with the ceph rgw realm tokens
command and decode it with the base64
command. As you can see, it provides the required information for the secondary zone to connect to the primary zonegroup and pull the realm and zonegroup configuration.
[root@ceph-node-00 ~]# TOKEN=$(ceph rgw realm tokens | jq .[0].token | sed \'s/\\"//g\')\\n[root@ceph-node-00 ~]# echo $TOKEN | base64 -d\\n{\\n \\"realm_name\\": \\"multisite\\",\\n \\"realm_id\\": \\"d85b6eef-2285-4072-8407-35e2ea7a17a2\\",\\n \\"endpoint\\": \\"http://ceph-node-00.cephlab.com:8000\\",\\n \\"access_key\\": \\"RUB7U4C6CCOMG3EM9QGF\\",\\n \\"secret\\": \\"vg8XFPehb21Y8oUMB9RS0XXXXH2E1qIDIhZzpC\\"\\n}\\n
You can see from the prompt that we have switched to our second Ceph cluster, having copied the token from our first cluster and defined the rest of the parameters similarly to the first cluster.
[root@ceph-node-04 ~]# cat rgw2.spec\\nplacement:\\n label: rgwsync\\n count_per_host: 1\\nrgw_zone: zone2\\nrgw_realm_token: ewogICAgInJlYWxtX25hbWUiOiAibXVsdGlzaXRlIiwKICAgICJyZWFsbV9pZCI6ICIxNmM3OGJkMS0xOTIwLTRlMjMtOGM3Yi1lYmYxNWQ5ODI0NTgiLAogICAgImVuZHBvaW50IjogImh0dHA6Ly9jZXBoLW5vZGUtMDEuY2VwaGxhYi5jb206ODAwMCIsCiAgICAiYWNjZXNzX2tleSI6ICIwOFlXQ0NTNzEzUU9LN0pQQzFRUSIsCiAgICAic2VjcmV0IjogImZUZGlmTXpDUldaSXgwajI0ZEw4VGppRUFtOHpRdE01ZGNScXEyTjYiCn0=\\nspec:\\n rgw_frontend_port: 8000\\n
We label the hosts that will run the Ceph RGW sync services:
[root@ceph-node-04 ~]# ceph orch host label add ceph-node-04.cephlab.com rgwsync\\nAdded label rgwsync to host ceph-node-04.cephlab.com\\n[root@ceph-node-04 ~]# ceph orch host label add ceph-node-05.cephlab.com rgwsync\\nAdded label rgwsync to host ceph-node-05.cephlab.com\\n
Enable the module, and run the ceph rgw zone create
command with the spec file we created a moment ago:
[root@ceph02 ~]# ceph mgr module enable rgw\\n[root@ceph02 ~]# ceph rgw zone create -i rgw2.spec --start-radosgw\\nZones zone2 created successfully\\n
The rgw
manager module will take care of pulling the realm and zonegroup periods using the access and secret keys from the multisite sync user. Finally, it will create zone2
and do a final period update so all zones have the latest configuration changes in place with zone2
added to zonegroup multizg
. In the following output from the radosgw-admin zonegroup get
command we can see the zone groupendpoints. We can also see that zone1
is the master zone for our zonegroup and the corresponding endpoints for zone1
and zone2
.
[root@ceph-node-00 ~]# radosgw-admin zonegroup get\\n{\\n \\"id\\": \\"2761ad42-fd71-4170-87c6-74c20dd1e334\\",\\n \\"name\\": \\"multizg\\",\\n \\"api_name\\": \\"multizg\\",\\n \\"is_master\\": true,\\n \\"endpoints\\": [\\n \\"http://ceph-node-04.cephlab.com:8000\\",\\n \\"http://ceph-node-05.cephlab.com:8000\\"\\n ],\\n \\"hostnames\\": [],\\n \\"hostnames_s3website\\": [],\\n \\"master_zone\\": \\"66df8c0a-c67d-4bd7-9975-bc02a549f13e\\",\\n \\"zones\\": [\\n {\\n \\"id\\": \\"66df8c0a-c67d-4bd7-9975-bc02a549f13e\\",\\n \\"name\\": \\"zone1\\",\\n \\"endpoints\\": [\\n \\"http://ceph-node-00.cephlab.com:8000\\",\\n \\"http://ceph-node-01.cephlab.com:8000\\"\\n ],\\n \\"log_meta\\": false,\\n \\"log_data\\": true,\\n \\"bucket_index_max_shards\\": 11,\\n \\"read_only\\": false,\\n \\"tier_type\\": \\"\\",\\n \\"sync_from_all\\": true,\\n \\"sync_from\\": [],\\n \\"redirect_zone\\": \\"\\",\\n \\"supported_features\\": [\\n \\"compress-encrypted\\",\\n \\"resharding\\"\\n ]\\n },\\n {\\n \\"id\\": \\"7b9273a9-eb59-413d-a465-3029664c73d7\\",\\n \\"name\\": \\"zone2\\",\\n \\"endpoints\\": [\\n \\"http://ceph-node-04.cephlab.com:8000\\",\\n \\"http://ceph-node-05.cephlab.com:8000\\"\\n ],\\n \\"log_meta\\": false,\\n \\"log_data\\": true,\\n \\"bucket_index_max_shards\\": 11,\\n \\"read_only\\": false,\\n \\"tier_type\\": \\"\\",\\n \\"sync_from_all\\": true,\\n \\"sync_from\\": [],\\n \\"redirect_zone\\": \\"\\",\\n \\"supported_features\\": [\\n \\"compress-encrypted\\",\\n \\"resharding\\"\\n ]\\n }\\n ],\\n \\"placement_targets\\": [\\n {\\n \\"name\\": \\"default-placement\\",\\n \\"tags\\": [],\\n \\"storage_classes\\": [\\n \\"STANDARD\\"\\n ]\\n }\\n ],\\n \\"default_placement\\": \\"default-placement\\",\\n \\"realm_id\\": \\"beeea955-8341-41cc-a046-46de2d5ddeb9\\",\\n \\"sync_policy\\": {\\n \\"groups\\": []\\n },\\n \\"enabled_features\\": [\\n \\"resharding\\"\\n ]\\n}\\n
To verify that replication is working, let’s create a user and a bucket:
[root@ceph-node-00 ~]# radosgw-admin user create --uid=\'user1\' --display-name=\'First User\' --access-key=\'S3user1\' --secret-key=\'S3user1key\'\\n\\n[root@ceph-node-00 ~]# aws configure\\nAWS Access Key ID [None]: S3user1\\nAWS Secret Access Key [None]: S3user1key\\nDefault region name [None]: multizg\\nDefault output format [None]: json\\n[root@ceph-node-00 ~]# aws --endpoint http://s3.cephlab.com:80 s3 ls\\n[root@ceph-node-00 ~]# aws --endpoint http://s3.cephlab.com:80 s3 mb s3://firstbucket\\nmake_bucket: firstbucket\\n[root@ceph-node-00 ~]# aws --endpoint http://s3.cephlab.com:80 s3 cp /etc/hosts s3://firstbucket\\nupload: ../etc/hosts to s3://firstbucket/hosts\\n
If we check from our second Ceph cluster, zone2
, we can see that all metadata has been replicated, and that all users and buckets that we created in zone1
are now present in zone2
.
NOTE: In this example, we will use the radosgw-admin
command to check, but we could also use S3 API commands pointing the AWS client to the IP/hostname of an RGW within the second zone.
[root@ceph-node-04 ~]# radosgw-admin user list\\n[\\n \\"dashboard\\",\\n \\"user1\\",\\n \\"sysuser-multisite\\"\\n]\\n[root@ceph-node-04 ~]# radosgw-admin bucket stats --bucket testbucket | jq .bucket\\n\\"testbucket\\"\\n
To check replication status, we can use the radosgw-admin sync status
command. For example:
[root@ceph-node-00 ~]# radosgw-admin sync status\\n realm beeea955-8341-41cc-a046-46de2d5ddeb9 (multisite)\\n zonegroup 2761ad42-fd71-4170-87c6-74c20dd1e334 (multizg)\\n zone 66df8c0a-c67d-4bd7-9975-bc02a549f13e (zone1)\\n current time 2024-01-05T22:51:17Z\\nzonegroup features enabled: resharding\\n disabled: compress-encrypted\\n metadata sync no sync (zone is master)\\n data sync source: 7b9273a9-eb59-413d-a465-3029664c73d7 (zone2)\\n syncing\\n full sync: 0/128 shards\\n incremental sync: 128/128 shards\\n data is caught up with source\\n\\n
As a recap, in part two of this multisite series, we have gone through the steps of deploying Ceph Object Storage multisite replication between two sites/zones using the rgw
manager module. Tthis is just our first building block as our target is to have a full-blown deployment including the much-needed load-balancers.
In part three of the series we will continue fine tuning our multisite replication setup by dedicating specifc RGW services for each type of request: client facing or multisite replication.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
Ceph offers object storage tiering capabilities to optimize cost and performance by seamlessly moving data between storage classes. These tiers can be configured locally within an on-premise infrastructure or extended to include cloud-based storage classes, providing a flexible and scalable solution for diverse workloads. With policy-based automation, administrators can define lifecycle policies to migrate data between high-performance storage and cost-effective archival tiers, ensuring the right balance of speed, durability, and cost-efficiency.
Local storage classes in Ceph allow organizations to tier data between fast NVMe or SAS/SATA SSD-based pools and economical HDD or QLC-based pools within their on-premises Ceph cluster. This is particularly beneficial for applications requiring varying performance levels or scenarios where data \\"ages out\\" of high-performance requirements and can be migrated to slower, more economical storage.
In addition to local tiering, Ceph offers policy-based data archival and retrieval capabilities that integrate with S3-compatible platforms for off-premises data management. Organizations can use this feature to archive data to cloud-based tiers such as IBM Cloud Object Storage, AWS S3, Azure Blob or S3 Tape Endpoints for long-term retention, disaster recovery, or cost-optimized cold storage. By leveraging policy-based automation, Ceph ensures that data is moved to the cloud or other destination based on predefined lifecycle rules, enhancing its value in hybrid cloud strategies.
Initially, Ceph\'s policy-based data archival (cloud sync) to S3-compatible platforms offered a uni-directional data flow, where data could only be archived from local storage pools to the designated cloud storage tier. While this allowed users to leverage cost-effective cloud platforms for cold storage or long-term data retention, the lack of retrieval capabilities limited the solution’s flexibility in data management. This meant that once data was archived to cloud storage, it could no longer be actively retrieved or re-integrated into local workflows directly through Ceph.
Ceph Squid introduced policy-based data retrieval, which marks a significant evolution in its capabilities and is now available as a Tech Preview. This enhancement enables users to retrieve S3 cloud or tape transitioned objects directly into their on-prem Ceph environment, eliminating the limitations of the previous uni-directional flow. Data can be restored as temporary or permanent objects.
This retrieval of objects can be done in two different ways:
S3RestoreObject
API requestGET
requests on transitioned objects to restore them to the Ceph cluster transparently.In this release, we don\'t support object retrieval from S3 cloud/tape endpoints that use the distinct Glacier API, like for example IBM Deep Archive. This feature enhancement is targeted for the Tentacle release of Ceph.
In this section, we will configure and set up the Policy-Based Data Archival feature of Ceph. We will discuss using data lifecycle policies to transition cold data to a an offsite, cost-effective storage class by archiving it to IBM Cloud Object Storage (COS).
Note: The RGW Storage Classes we describe here should not be conflated with Kubernetes PV/PVC Storage Classes.
The table below provides a summary of the various lifecycle policies that The Ceph Object Gateway supports:
Policy Type | Description | Example Use Case |
---|---|---|
Expiration | Deletes objects after a specified duration | Removing temporary files automatically after 30 days |
Noncurrent Version Expiration | Deletes noncurrent versions of objects after a specified duration in versioned buckets | Managing storage costs by removing old versions of objects |
Abort Incomplete Multipart Upload | Cancels multipart uploads that are not completed within a specified duration | Free up storage by cleaning up incomplete uploads |
Transition Between Storage Classes | Moves objects between different storage classes within the same Ceph cluster after a duration | Moving data from SSD / replicated to HDD / EC storage after 90 days |
NewerNoncurrentVersions Filter | Filters noncurrent versions newer than a specified count for expiration or transition actions | Retaining only the last three noncurrent versions of an object |
ObjectSizeGreaterThan Filter | Applies the lifecycle rule only to objects larger than a specified size | Moving large video files to a lower-cost storage class |
ObjectSizeLess Filter | Applies the lifecycle rule only to objects smaller than a specified size | Archiving small log files after a certain period |
In addition to specifying policies, lifecycle rules can be filtered using tags or prefixes, allowing for more granular control over which objects are affected. Tags can identify specific subsets of objects based on per-object tagging, while prefixes help target objects based on their key names.
First, we configure the remote S3 cloud service as the future destination of our on-prem transitioned objects. In our example, we will create an IBM COS bucket named ceph-s3-tier
.
It is important to note that we need to create a service credential for our bucket with HMAC keys enabled.
Create a new storage class on the default placement within the default zonegroup; we use the rgw-admin --tier-type=cloud-s3
parameter to configure the storage class against our previously configured bucket in COS S3.
# radosgw-admin zonegroup placement add --rgw-zonegroup=default --placement-id=default-placement --storage-class=ibm-cos --tier-type=cloud-s3\\n
Note: Ceph allows one to create storage classes with arbitrary names, but some clients and client libraries accept only AWS storage class names or behave uniquely when the storage class is, e.g., GLACIER
.
We can verify the available storage classes in the default zone group and placement target:
# radosgw-admin zonegroup get --rgw-zonegroup=default | jq .placement_targets[0].storage_classes\\n[\\n \\"STANDARD_IA\\",\\n \\"STANDARD\\",\\n \\"ibm-cos\\"\\n]\\n
Next, we use the radosgw-admin
command to configure the cloud-s3
storage class with specific parameters from our IBM COS bucket: endpoint, region, and account credentials:
# radosgw-admin zonegroup placement modify --rgw-zonegroup default --placement-id default-placement --storage-class ibm-cos --tier-config=endpoint=https://s3.eu-de.cloud-object-storage.appdomain.cloud,access_key=YOUR_ACCESS_KEY,secret=YOUR_SECRET_KEY,target_path=\\"ceph-s3-tier\\",multipart_sync_threshold=44432,multipart_min_part_size=44432,retain_head_object=true,region=eu-de\\n
Once the COS cloud-S3 storage class is in place, we will switch the user to a consumer of the Ceph Object S3 API and configure a lifecycle policy through the RGW S3 API endpoint. Our user is named tiering
, and we have the S3 AWC CLI pre-configured with the credentials for the tiering user.
# aws --profile tiering --endpoint https://s3.cephlabs.com s3 mb s3://databucket\\n# aws --profile tiering --endpoint https://s3.cephlabs.com s3 /etc/hosts s3://databucket\\n
We will attach a JSON lifecycle policy to the previously created bucket. For instance, the bucket databucket
will have the following policy, transitioning all objects older than 30 days to the COS storage class:
{\\n \\"Rules\\": [\\n {\\"ID\\": \\"Transition objects from Ceph to COS that are older than 30 days\\",\\n \\"Prefix\\": \\"\\",\\n \\"Status\\": \\"Enabled\\",\\n \\"Transitions\\": [\\n {\\n \\"Days\\": 30,\\n \\"StorageClass\\": \\"ibm-cos\\"\\n }\\n ]\\n }\\n ]\\n}\\n
As an S3 API consumer, we will use the AWS S3 CLI, to apply the bucket lifecycle configuration we saved to a local file called ibm-cos-lc.json
:
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api put-bucket-lifecycle-configuration --lifecycle-configuration file://ibm-cos-lc.json --bucket databucket\\n
Verify that the policy is applied:
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api get-bucket-lifecycle-configuration --bucket databucket\\n
We can also check that Ceph / RGW have registered this new LC policy with the following radosgw-admin
command. The status is UNINITIAL
, as this LC has never been processed; once processed, it will move to the COMPLETED
state:
# radosgw-admin lc list | jq .[1]\\n{\\n \\"bucket\\": \\":databucket:fcabdf4a-86f2-452f-a13f-e0902685c655.310403.1\\",\\n \\"shard\\": \\"lc.23\\",\\n \\"started\\": \\"Thu, 01 Jan 1970 00:00:00 GMT\\",\\n \\"status\\": \\"UNINITIAL\\"\\n}\\n
We can get further details of the rule applied to the bucket with the following command:
# radosgw-admin lc get --bucket databucket\\n{\\n \\"prefix_map\\": {\\n \\"\\": {\\n \\"status\\": true,\\n \\"dm_expiration\\": false,\\n \\"expiration\\": 0,\\n \\"noncur_expiration\\": 0,\\n \\"mp_expiration\\": 0,\\n \\"transitions\\": {\\n \\"ibm-cos\\": {\\n \\"days\\": 30\\n }\\n },\\n }\\n }\\n}\\n
Important WARNING: changing this parameter is ONLY for LC testing purposes. Do not change it on a production Ceph cluster, and remember to reset as appropriate!
We can speed up the testing of lifecycle policies by enabling a debug interval for the lifecycle process. In this setting, each \\"day\\" in the bucket lifecycle configuration is equivalent to 60 seconds, so a three-day expiration period is effectively three minutes:
# ceph config set client.rgw rgw_lc_debug_interval 60\\n# ceph orch restart rgw.default\\n
If we now run radosgw-admin lc list
we should see the lifecycle policy for our transition bucket in a completed state:
[root@ceph01 ~]# radosgw-admin lc list| jq .[1]\\n{\\n \\"bucket\\": \\":databucket:fcabdf4a-86f2-452f-a13f-e0902685c655.310403.1\\",\\n \\"shard\\": \\"lc.23\\",\\n \\"started\\": \\"Mon, 25 Nov 2024 10:43:31 GMT\\",\\n \\"status\\": \\"COMPLETE\\"\\n}\\n
If we list the objects available in the transition bucket on our on-premise cluster, we can see that the object size is 0
. This is because they have been transitioned to the cloud. However, the metadata / head of the object is still available locally because of the use of the retain_head_object\\": \\"true\\"
parameter when creating the cloud storage class:
# aws --profile tiering --endpoint https://s3.cephlabs.com s3 ls s3://databucket\\n2024-11-25 05:41:33 0 hosts\\n
If we check the object attributes using the s3api get-object-attributes
call we can see that the storage class for this object is now ibm-cos
, so this object has been successfully transitioned into the S3 cloud provider:
# aws --profile tiering --endpoint https://s3.cephlabs.com s3api get-object-attributes --object-attributes StorageClass ObjectSize --bucket databucket --key hosts\\n{\\n \\"LastModified\\": \\"2024-11-25T10:41:33+00:00\\",\\n \\"StorageClass\\": \\"ibm-cos\\",\\n \\"ObjectSize\\": 0\\n}\\n
If we check in IBM COS, using the AWS CLI S3 client, but using the endpoint and profile of the IBM COS user, we can see that the objects are available in the IBM COS bucket . Due to API limitations, the original object modification time and ETag cannot be preserved, but they are stored as metadata attributes on the destination objects.
aws --profile cos --endpoint https://s3.eu-de.cloud-object-storage.appdomain.cloud s3api head-object --bucket ceph-s3-tier --key databucket/hosts | jq .\\n{\\n \\"AcceptRanges\\": \\"bytes\\",\\n \\"LastModified\\": \\"2024-11-25T10:41:33+00:00\\",\\n \\"ContentLength\\": 304,\\n \\"ETag\\": \\"\\\\\\"01a72b8a9d073d6bcae565bd523a76c5\\\\\\"\\",\\n \\"ContentType\\": \\"binary/octet-stream\\",\\n \\"Metadata\\": {\\n \\"rgwx-source-mtime\\": \\"1732529733.944271939\\",\\n \\"rgwx-versioned-epoch\\": \\"0\\",\\n \\"rgwx-source\\": \\"rgw\\",\\n \\"rgwx-source-etag\\": \\"01a72b8a9d073d6bcae565bd523a76c5\\",\\n \\"rgwx-source-key\\": \\"hosts\\"\\n }\\n}\\n
To avoid collisions across buckets, the source bucket name is prepended to the target object name. If the object is versioned, the object version ID is appended to the end.
Below is the sample object name format:
s3://<target_path>/<source_bucket_name>/<source_object_name>(-<source_object_version_id>)\\n
Similar semantics as those of LifecycleExpiration are applied below for versioned and locked objects. If the object is current post-transitioning to the cloud, it is made noncurrent with a delete marker created. If the object is noncurrent and locked, its transition is skipped.
This blog covered transitioning cold data to a more cost-effective storage class using tiering and lifecycle policies and archiving it to IBM Cloud Object Storage (COS). In the next blog, we will explore how to restore archived data to the Ceph cluster when needed. We will introduce the key technical concepts and provide detailed configuration steps to help you implement cloud restore, ensuring your cold data remains accessible when required.
For more information about RGW placement targets and storage classes, visit this page
For a related take on directing data to multiple RGW storage classes, view this presentation
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
Throughout this series of articles, we will provide hands-on examples to help you set up and configure some of the most critical replication features of the Ceph Object Storage solution. This will include the new Object Storage multisite enhancements released in the Reef release.
At a high level, these are the topics we will cover in this series:
When discussing Replication, Disaster Recovery, Backup and Restore, we have multiple strategies available that provide us with different SLAs for data and application recovery (RTO / RPO). For instance, synchronous replication provides the lowest RPO, which means zero data loss. Ceph can provide synchronous replication between sites by stretching the Ceph cluster among the data centers. On the other hand, asynchronous replication will assume a non-zero RPO. In Ceph, async multisite replication involves replicating the data to another Ceph cluster. Each Ceph storage modality (object, block, and file) has its own asynchronous replication mechanism. This blog series will cover geo-dispersed object storage multisite asynchronous replication.
Before getting our hands wet with the deployment details, let\'s begin with a quick overview of what Ceph Object Storage (RGW) provides: enterprise grade, highly mature object geo-replication capabilities. The RGW multisite replication feature facilitates asynchronous object replication across single or multi-zone deployments. Ceph Object Storage operates efficiently over WAN connections using asynchronous replication with eventual consistency.
Ceph Object Storage Multisite Replication provides many benefits for businesses that must store and manage large amounts of data across multiple locations. Here are some of the key benefits of using Ceph Object Storage Multisite Replication:
Ceph Object Storage clusters can be geographically dispersed, which improves data availability and reduces the risk of data loss due to hardware failure, natural disasters or other events. There are no network latency requirements as we are doing eventually consistent async replication.
Replication is Active/Active for data (object) access. Multiple end users can simultaneously read/write from/to their closest RGW (S3) endpoint location. In other words, the replication is bidirectional. This enables users to access data more quickly and reduce downtime.
Notably only the designated master zone in the zone group accepts etadata updates. For example, when creating Users and Buckets, all metadata modifications on non-master zones will be forwarded to the configured master. if the master fails, a manual master zone failover must be triggered.
With multisite replication, businesses can quickly scale their storage infrastructure by adding new sites or clusters. This allows businesses to store and manage large amounts of data without worrying about running out of storage capacity or performance.
A Ceph Object Storage multisite cluster consists of realms, zonegroups, and zones:
A realm defines a global namespace across multiple Ceph storage clusters
Zonegroups can have one or more zones
Next, we have zones. These are the lowest level of the Ceph multisite configuration, and they’re represented by one or more object gateways within a single Ceph cluster.
As you can see in the following diagram, Ceph Object Storage multisite replication happens at the zone level. We have a single realm called and two zonegroups. The realm global object namespace ensures unique object IDs across zonegroups and zones.
Each bucket is owned by the zone group where it was created, and its object data will only be replicated to other zones in that zonegroup. Any request for data in that bucket sent to other zonegroups will be redirected to the zonegroup where the bucket resides.
Within a Ceph Object Storage cluster you can have one or more realms. Each realm is an independent global object namespace, meaning each realm will have its own sets of users, buckets, and objects. For example, you can\'t have two buckets with the same name within a single realm. In Ceph Object Storage, there is also the concept of tenants to isolate S3 namespaces, but that discussion is out of our scope here. You can find more information on this page
The following diagram shows an example where we have two different Realms, thus two independent namespaces. Each realm has its zonegroup and replication zones.
Each zone represents a Ceph cluster, and you can have one or more zones in a zonegroup. Multisite replication, when configured, will happen between zones. In this series of blogs, we will configure only two zones in a zone group, but you can configure a larger number of replicated zones in a single zonegroup.
With the latest 6.1 release, Ceph Object Storage introduces “Multisite Sync Policy” that provides granular bucket-level replication, provides the user with greater flexibility and reduced costs, unlocking and an array of valuable replication features:
Users can enable or disable sync per individual bucket, enabling precise control over replication workflows.
Full-zone replication while opting out to replicate specific buckets
Replicating a single source bucket with multi-destination buckets
Implementing symmetrical and directional data flow configurations per bucket
The following diagram shows an example of the sync policy feature in action.
As part of the Quincy release, a new Ceph Manager module called rgw
was added to the ceph orchestrator cephadm
. The rgw
manager module makes the configuration of multisite replication straightforward. This section will show you how to configure Ceph Object Storage multisite replication between two zones (each zone is an independent Ceph Cluster) through the CLI using the new rgw
manager module.
NOTE: In Reef and later releases, multisite configuration can also be performed using the Ceph UI/Dashboard. We don’t use the UI in this guide, but if you are interested, you can find more information here.
In our setup, we are going to configure our multisite replication with the following logical layout: we have a realm called multisite
, and this realm contains a single zonegroup called multizg
. Inside the zonegroup, we have two zones, named zone1
and zone2
. Each zone represents a Ceph cluster in a geographically distributed datacenter. The following diagram is a logical representation of our multisite configuration.
As this is a lab deployment, this is a downsized example. Each Ceph cluster comprises four nodes with six OSDs each. We configure four RGW services (one per node) for each cluster. Two RGWs will serve S3 client requests, and the remaining RGW services will be responsible for multisite replication operations. Ceph Object Storage multisite replication data is transmitted to the other site through the RGW services using the HTTP protocol. The advantage of this is that at the networking layer, we only need to enable/allow HTTP communication between the Ceph clusters (zones) for which we want to configure multisite.
The following diagram shows the final architecture we will be configuring step by step.
In our example, we will terminate the SSL connection from the client at the per-site load balancer level. The RGW services will use plain HTTP for all the involved endpoints.
When configuring TLS/SSL, we can terminate the encrypted connection from the client to the S3 endpoint at the load balancer level or at the RGW service level. It is possible to do both, re-encrypting the connection from the load balancer to the RGWs, though this scenario is not currently supported by Ceph ingress service.
The second post will enumerate the steps to establish multisite replication between our Ceph clusters, as depicted in the following diagram.
But before starting the configuration of Ceph Object Storage multisite replication, we need to provide a bit more context regarding our initial state. We have two Ceph clusters deployed, the first cluster with nodes ceph-node-00
to ceph-node-03
and the second cluster with nodes from ceph-node-04
to ceph-node-07
.
[root@ceph-node-00 ~]# ceph orch host ls\\nHOST ADDR LABELS STATUS\\nceph-node-00.cephlab.com 192.168.122.12 _admin,osd,mon,mgr\\nceph-node-01.cephlab.com 192.168.122.179 osd,mon,mgr\\nceph-node-02.cephlab.com 192.168.122.94 osd,mon,mgr\\nceph-node-03.cephlab.com 192.168.122.180 osd\\n4 hosts in cluster\\n[root@ceph-node-04 ~]# ceph orch host ls\\nHOST ADDR LABELS STATUS\\nceph-node-04.cephlab.com 192.168.122.138 _admin,osd,mon,mgr\\nceph-node-05.cephlab.com 192.168.122.175 osd,mon,mgr\\nceph-node-06.cephlab.com 192.168.122.214 osd,mon,mgr\\nceph-node-07.cephlab.com 192.168.122.164 osd\\n4 hosts in cluster\\n
The core Ceph services have been deployed, plus Ceph\'s observability stack, but there is no RGW service deployed. Ceph services are running containerized on RHEL with the help of Podman.
[root@ceph-node-00 ~]# ceph orch ls\\nNAME PORTS RUNNING REFRESHED AGE PLACEMENT \\nalertmanager ?:9093,9094 1/1 6m ago 3w count:1 \\nceph-exporter 4/4 6m ago 3w * \\ncrash 4/4 6m ago 3w * \\ngrafana ?:3000 1/1 6m ago 3w count:1 \\nmgr 3/3 6m ago 3w label:mgr \\nmon 3/3 6m ago 3w label:mon \\nnode-exporter ?:9100 4/4 6m ago 3w * \\nosd.all-available-devices 4 6m ago 3w label:osd \\nprometheus ?:9095 1/1 6m ago 3w count:1\\n[root@ceph-node-00 ~]# ceph version\\nceph version 18.2.0-131.el9cp (d2f32f94f1c60fec91b161c8a1f200fca2bb8858) reef (stable)\\n[root@ceph-node-00 ~]# podman inspect cp.icr.io/cp/ibm-ceph/ceph-7-rhel9 | jq .[].Labels.summary\\n\\"Provides the latest IBM Storage Ceph 7 in a fully featured and supported base image.\\"\\n# cat /etc/redhat-release \\nRed Hat Enterprise Linux release 9.2 (Plow)\\n
As a recap, in part one of this multisite series, we have gone through an overview of Ceph Object Storage multisite replication features and architecture, setting the stage to start configuring the multisite replication in part two of the series.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
In the era of big data, managing vast amounts of storage efficiently and reliably is a critical challenge for enterprises. Ceph has become a leading software defined storage solution known for its flexibility, scalability, and robustness. Building on this foundation, Ceph elevates these capabilities, offering seamless integration with enterprise environments and advanced tools for efficiently managing petabytes of data.
This blog post series will delve into the automated deployment of Ceph clusters using Ceph\'s state-of-the-art orchestrator, cephadm
. Additionally, for those automating their infrastructure with Ansible, we will share an example using an Infrastracture-As-Code approach with the help of Jinja2 templates and Ansible.
Infrastructure as Code (IaC) revolutionizes infrastructure management by treating infrastructure setups as code. This allows us to apply software development practices such as version control, testing, and continuous integration to infrastructure management, reducing the risk of errors and speeding up deployment and scaling.
With Ceph, tools like Ansible and cephadm
are perfect examples of IaC in action. They allow administrators to define the desired state of their Ceph clusters in code, making it easier to deploy, manage, and scale these clusters across different environments.
As Ceph became more popular and clusters rapidly grew, the need for an effective orchestration tool became increasingly critical. Over the years, several tools have been developed to simplify and automate the deployment and management of Ceph clusters. Let’s take a brief look at them:
ceph-deploy
was one of the first tools introduced to ease the deployment of Ceph clusters. As a lightweight command-line utility, ceph-deploy
allowed administrators to quickly set up a basic Ceph cluster by automating many manual steps in configuring Ceph daemons like MONs, OSDs, and MGRs.
ceph-ansible
marked a significant step forward by integrating Ceph deployment with Ansible, a popular open-source automation tool. This approach embraced the principles of Infrastructure as Code (IaC), allowing administrators to define the entire Ceph cluster configuration in Ansible playbooks.
cephadm
The current bundled Ceph orchestrator, which we will cover in detail in the next section.
Unlike its predecessors, ephadm
deploys all Ceph daemons as containers using Docker or Podman. This containerized approach ensures consistency across different environments and simplifies the management of dependencies, making deploying, upgrading, and scaling Ceph clusters easier.
Cephadm\'s use of a declarative spec file to define the cluster\'s desired state marks a significant improvement in how Ceph clusters are managed. Administrators can now describe their entire cluster configuration in advance, and Cephadm continuously ensures that the cluster matches this desired state. This process is also known as convergence.
In addition to its powerful deployment capabilities, Cephadm integrates with the Ceph Dashboard, provides built-in monitoring and alerting, and supports automated upgrades, making it the most comprehensive and user-friendly orchestration tool in the Ceph ecosystem to date.
Modern IT environments increasingly require repeatedly deploying and scaling storage clusters across different environments: development, testing, and production. This is where Cephadm comes to the rescue. By automating the deployment and management of Ceph clusters, Cephadm eliminates the manual, error-prone processes traditionally involved in setting up distributed storage systems.
cephadm
’s use of a declarative service spec file allows administrators to define the entire configuration of a Ceph cluster in a single, reusable file that is amenable to revision control. This spec file can describe everything from the placement of OSDs, Monitors, and Managers to the setup and configuration of File, Block, and Object Services. By applying this spec file, cephadm
can automatically converge the cluster to match the desired state, ensuring consistency across multiple deployments.
It’s important to note that Cephadm provides the deployment and lifecycle of Ceph cluster services. Still, not all day two operations of specific services, like creating a Ceph Object Storage (RGW) user, are currently covered by Cephadm.
cephadm
fits perfectly into an Infrastructure as Code (IaC) paradigm. IaC treats infrastructure configurations like software code, storing them in version control, automating their application, and enabling continuous delivery pipelines. With cephadm
, the spec file acts as the code that defines your storage infrastructure.
For example, you could store your cephadm
spec files in a version control system like Git with optional CICD pipelines. When changes are made to the cluster configuration, they are committed and pushed, triggering automated pipelines that deploy or update the Ceph cluster based on the updated spec file. This approach streamlines deployments and ensures that your storage infrastructure is always in sync with your application and service needs.
Note that specific cephadm
configuration changes require restarting the corresponding service, which must be coordinated with an automation tool for the changes to take effect once applied.
Below is an example of a Cephadm spec file that enables a complete Ceph cluster deployment during the bootstrap process. This basic example is designed to get you started; a production deployment would require further customization of the spec file.
service_type: host\\nhostname: ceph1\\naddr: 10.10.0.2\\nlocation:\\n root: default\\n datacenter: DC1\\nlabels:\\n- osd\\n- mon\\n- mgr\\n---\\nservice_type: host\\nhostname: ceph2\\naddr: 10.10.0.3\\nlocation:\\n datacenter: DC1\\nlabels:\\n- osd\\n- mon\\n- rgw\\n---\\nservice_type: host\\nhostname: ceph3\\naddr: 10.10.0.4\\nlocation:\\n datacenter: DC1\\nlabels:\\n- osd\\n- mds\\n- mon\\n---\\nservice_type: mon\\nplacement:\\n label: \\"mon\\"\\n---\\nservice_type: mgr\\nservice_name: mgr\\nplacement:\\n label: \\"mgr\\"\\n---\\nservice_type: osd\\nservice_id: all-available-devices\\nservice_name: osd.all-available-devices\\nspec:\\n data_devices:\\n all: true\\n limit: 1\\nplacement:\\n label: \\"osd\\"\\n---\\nservice_type: rgw\\nservice_id: objectgw\\nservice_name: rgw.objectgw\\nplacement:\\n count: 2\\n label: \\"rgw\\"\\nspec:\\n rgw_frontend_port: 8080\\n rgw_frontend_extra_args:\\n - \\"tcp_nodelay=1\\"\\n---\\nservice_type: ingress\\nservice_id: rgw.external-traffic\\nplacement:\\n label: \\"rgw\\"\\nspec:\\n backend_service: rgw.objectgw\\n virtual_ips_list:\\n - 172.18.8.191/24\\n - 172.18.8.192/24\\n frontend_port: 8080\\n monitor_port: 1967\\n
The first service_type
is host
, which enumerates all hosts in the cluster, including their hostnames and IP addresses. The location
field indicates the host\'s position within the Ceph CRUSH topology, a hierarchical structure that informs data placement and retrieval across the cluster. Check out this document for more info.
By setting specific labels on the host, Cephadm can efficiently schedule and deploy containerized Ceph services on desired nodes, with a given node sometimes having more than one label and thus hosting more than one Ceph service. This ensures resource isolation and reduces the number of nodes required, optimizing resource usage and cutting costs in production environments.
Note that the hosts we are adding to the cluster need a set of prerequisites configured to successfully join the Ceph cluster. This blog series will also cover automating the deployment of these prerequisites.
service_type: host\\nhostname: ceph1\\naddr: 10.10.0.2\\nlocation:\\n root: default\\n datacenter: DC1\\nlabels:\\n- osd\\n- mon\\n- mgr\\n
After that, we have the Monitor and Manager service deployments. We have a simple configuration for these, using only the placement parameter
. With the placement
parameter, we tell cephadm
that it can deploy the Monitor service on any host with the mon
label.
---\\nservice_type: mon\\nplacement:\\n label: \\"mon\\"\\n---\\nservice_type: mgr\\nservice_name: mgr\\nplacement:\\n label: \\"mgr\\"\\n
Next, we have the osd
service type. The cephadm
OSD service type is incredibly flexible: it allows you to define almost any OSD configuration you can imagine. For full details on the OSD service spec, check out this document
In our example, we take one of the most straightforward approaches possible: we tell Cephadm to use as OSDs all free/usable media devices that are available on the, again using the placement parameter. It will only configure OSD devices on nodes that have the osd
label.
In this next section, we configure the cephfs
service to deploy the Ceph shared file system, including metadata service (MDS) daemons. For all service spec configuration options, check out this document
Finally, we populate the rgw
service type to set up Ceph Object Gateway (RGW) services. The RGW services provide an S3- and Swift-compatible HTTP RESTful endpoint for clients. In this example, in the placement section we are set the count of RGW services to 2
. This means that the Cephadm scheduler will look to schedule two RGW daemons on two available hosts that have the rgw
label set. The tcp_nodelay=1
frontend option disable Nagle congestion control, which can improve latency for RGW operations on small objects.
---\\nservice_type: mds\\nservice_id: cephfs\\nPlacement:\\n count: 2\\n label: \\"mds\\"\\n---\\nservice_type: rgw\\nservice_id: objectgw\\nservice_name: rgw.objectgw\\nplacement:\\n count: 2\\n label: \\"rgw\\"\\nrgw_realm: \\nrgw_zone: \\nrgw_zonegroup: \\nspec:\\n rgw_frontend_port: 8080\\n rgw_frontend_extra_args:\\n - \\"tcp_nodelay=1\\"\\n
Ceph also provides an out-of-the-box load balancer based on haproxy
and keepalived
called the ingress service, a term that may be familiar to Kubernetes admins. In this example, we are use the ingress service to balance client S3 requests among the RGW daemons running in our cluster, providing the object service with HA and load lalancing. Detailed information is here
We use the rgw
label service to colocate the haproxy
/keepalived
daemons with RGW services. We then set the list of floating Virtual IP addresses (VIPs) that the clients will use to access the S3 endpoint API with the virtual_ips_list
spec parameter.
---\\nservice_type: ingress\\nservice_id: rgw.external-traffic\\nplacement:\\n label: \\"rgw\\"\\nspec:\\n backend_service: rgw.objectgw\\n virtual_ips_list:\\n - 172.18.8.191/24\\n - 172.18.8.192/24\\n frontend_port: 8080\\n monitor_port: 1967\\n
Once we have all the services spec defined and ready, we need to pass the spec file to the cephadm
bootstrap command to get our cluster deployed and configured as we have described in our file. Here is an example of the bootstrap command using the --apply-spec
parameter to pass our cluster specification file:
# cephadm bootstrap \\\\\\n --registry-json /root/registry.json \\\\\\n --dashboard-password-noupdate \\\\\\n --ssh-user=cephadm \\\\\\n --mon-ip \\\\\\n --apply-spec /root/cluster-spec.yaml\\n
In the next installment of this series, we’ll explore how to leverage Jinja2 (J2) templating and Ansible in tandem with cephadm
service spec files. This approach will demonstrate how to build an Infrastructure as Code (IaC) framework for Ceph cluster deployments, facilitating a streamlined Continuous Delivery (CICD)pipeline with Git as the single source of Ceph configuration management.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
In the previous post in this series, we discussed Multisite Sync Policy and shared hands-on examples of granular bucket bi-directional replication. In today\'s blog, part six, we will configure additional multisite sync policies, including unidirectional replication with one source to many destination buckets.
In the previous article, we explored a bucket sync policy with a bidirectional configuration. Now let’s explore an example of how to enable unidirectional sync between two buckets. Again to give a bit of context, we currently have our zonegroup sync policy set to allowed
, and a bidirectional flow configured at the zonegroup
level. With the zonegroup sync policy allowing us to configure replication at per-bucket granularity, we can start with our unidirectional replication configuration.
We create the unidirectional bucket, then create a sync group with the id unidirectional-1
, then set the status to Enabled
. When we set the status of the sync group policy to enabled
, replication will begin once the pipe has been applied to the bucket.
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 mb s3://unidirectional\\nmake_bucket: unidirectional\\n[root@ceph-node-00 ~]# radosgw-admin sync group create --bucket=unidirectional --group-id=unidirectiona-1 --status=enabled\\n
Once the sync group is in place, we need to create a pipe for our bucket. In this example, we specify the source and destination zones: the source will be zone1
and the destination zone2
. In this way, we create a unidirectional replication pipe for bucket unidirectional
with data replicated only in one direction: zone1
—> zone2
.
[root@ceph-node-00 ~]# radosgw-admin sync group pipe create --bucket=unidirectional --group-id=unidirectiona-1 --pipe-id=test-pipe1 --source-zones=\'zone1\' --dest-zones=\'zone2\'\\n
With sync info
, we can check the flow of bucket replication. You can see that the sources field is empty as we are running the command from a node in zone1
, and we are not receiving data from an external source. After all, from the zone where we are running the command, we are doing unidirectional replication, so we are sending data to a destination. We can see that the source is zone1
, and the destination is zone2
for the unidirectional
bucket.
[root@ceph-node-00 ~]# radosgw-admin sync info --bucket unidirectional\\n{\\n \\"sources\\": [],\\n \\"dests\\": [\\n {\\n \\"id\\": \\"test-pipe1\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"unidirectional:89c43fae-cd94-4f93-b21c-76cd1a64788d.34955.1\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"unidirectional:89c43fae-cd94-4f93-b21c-76cd1a64788d.34955.1\\"\\n….\\n}\\n
When we run the same command in zone2
, we see the same information but the sources field show receiving data from zone1
. The unidirectional bucket zone2
is not sending out any replication data, which is why the destination field is empty in the output of the sync info
command.
[root@ceph-node-04 ~]# radosgw-admin sync info --bucket unidirectional\\n{\\n \\"sources\\": [\\n {\\n \\"id\\": \\"test-pipe1\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n….\\n \\"dests\\": [],\\n
Once we have our configuration ready for action, let’s do some checking to see that everything is working as expected. Let’s PUT three files to zone1
:
[root@ceph-node-00 ~]# for i [1..3] do ; in aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 cp /etc/hosts s3://unidirectional/fil${i}\\nupload: ../etc/hosts to s3://unidirectional/fil1\\nupload: ../etc/hosts to s3://unidirectional/fil2\\nupload: ../etc/hosts to s3://unidirectional/fil3\\n
We can check that they have been synced to zone2
:
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 ls s3://unidirectional/\\n2024-02-02 17:56:09 233 fil1\\n2024-02-02 17:56:10 233 fil2\\n2024-02-02 17:56:11 233 fil3\\n
Now let’s check what happens when we PUT an object to zone2
. We shouldn’t see the file replicated to zone1
, as our replication configuration for the bucket is unidirectional.
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 cp /etc/hosts s3://unidirectional/fil4\\nupload: ../etc/hosts to s3://unidirectional/fil4\\n[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 ls s3://unidirectional/\\n2024-02-02 17:56:09 233 fil1\\n2024-02-02 17:56:10 233 fil2\\n2024-02-02 17:56:11 233 fil3\\n2024-02-02 17:57:49 233 fil4\\n
We checked in zone1 after a while and can see that the file is not there, meaning it did not get replicated from zone2 as expected.
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 ls s3://unidirectional\\n2024-02-02 17:56:09 233 fil1\\n2024-02-02 17:56:10 233 fil2\\n2024-02-02 17:56:11 233 fil3\\n
In this example we are going to modify the previous unidirectional sync policy by adding a new replication target bucket named backupbucket
. Once we set the sync policy, every object uploaded to bucket unidirectional
in zone1
will be replicated to buckets unidirectional
and backupbucket
in zone2
.
To get started, let\'s create the bucket backupbucket
:
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 mb s3://backupbucket\\nmake_bucket: backupbucket\\n
We will add a new pipe to our existing sync group policy named backupbucket
. We created the group sync policy in our previous unidirectional
example.
Again, we specify the source and destination zones, so our sync will be unidirectional. The main difference is that now we are specifying a destination bucket named backupbucket
with the --dest-bucket
parameter.
[root@ceph-node-00 ~]# radosgw-admin sync group pipe create --bucket=unidirectional --group-id=unidirectiona-1 --pipe-id=test-pipe2 --source-zones=\'zone1\' --dest-zones=\'zone2\' --dest-bucket=backupbucket\\n
Again, let\'s check the sync info output, which shows us a representation of the replication flow we have configured. The sources field is empty because in zone1
we are not receiving data from any other source. In destinations we now have two different pipes
. The first test-pipe1
we created in our previous example. The second pipe has backupbucket
set as the replication destination in zone2
.
[root@ceph-node-00 ~]# radosgw-admin sync info --bucket unidirectional\\n{\\n \\"sources\\": [],\\n \\"dests\\": [\\n {\\n \\"id\\": \\"test-pipe1\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n \\"params\\": {\\n \\"source\\": {\\n \\"filter\\": {\\n \\"tags\\": []\\n }\\n },\\n \\"dest\\": {},\\n \\"priority\\": 0,\\n \\"mode\\": \\"system\\",\\n \\"user\\": \\"user1\\"\\n }\\n },\\n {\\n \\"id\\": \\"test-pipe2\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"backupbucket\\"\\n },\\n \\"params\\": {\\n \\"source\\": {\\n \\"filter\\": {\\n \\"tags\\": []\\n }\\n },\\n \\"dest\\": {},\\n \\"priority\\": 0,\\n \\"mode\\": \\"system\\",\\n \\"user\\": \\"user1\\"\\n }\\n }\\n ],\\n \\"hints\\": {\\n \\"sources\\": [],\\n \\"dests\\": [\\n \\"backupbucket\\"\\n ]\\n },\\n
Let’s check it out: from our previous example, we had zone1
with three files:
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 ls s3://unidirectional/\\n2024-02-02 17:56:09 233 fil1\\n2024-02-02 17:56:10 233 fil2\\n2024-02-02 17:56:11 233 fil3\\n
In zone2
with four files, fil4
will not be replicated to zone1
because replication is unidirectional.
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 ls s3://unidirectional/\\n2024-02-02 17:56:09 233 fil1\\n2024-02-02 17:56:10 233 fil2\\n2024-02-02 17:56:11 233 fil3\\n2024-02-02 17:57:49 233 fil4\\n
Let\'s add three more files to zone1
. Wwe expect these to be replicated to the unidirectional
bucket and backupbucket
in zone2
:
[root@ceph-node-00 ~]# for i [5..7] do ; in aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 cp /etc/hosts s3://unidirectional/fil${i}\\nupload: ../etc/hosts to s3://unidirectional/fil5\\nupload: ../etc/hosts to s3://unidirectional/fil6\\nupload: ../etc/hosts to s3://unidirectional/fil7\\n[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 ls s3://unidirectional\\n2024-02-02 17:56:09 233 fil1\\n2024-02-02 17:56:10 233 fil2\\n2024-02-02 17:56:11 233 fil3\\n2024-02-02 18:03:51 233 fil5\\n2024-02-02 18:04:37 233 fil6\\n2024-02-02 18:09:08 233 fil7\\n[root@ceph-node-00 ~]# aws --endpoint http://object.s3.zone2.dan.ceph.blue:80 s3 ls s3://unidirectional\\n2024-02-02 17:56:09 233 fil1\\n2024-02-02 17:56:10 233 fil2\\n2024-02-02 17:56:11 233 fil3\\n2024-02-02 17:57:49 233 fil4\\n2024-02-02 18:03:51 233 fil5\\n2024-02-02 18:04:37 233 fil6\\n2024-02-02 18:09:08 233 fil7\\n[root@ceph-node-00 ~]# aws --endpoint http://object.s3.zone2.dan.ceph.blue:80 s3 ls s3://backupbucket\\n2024-02-02 17:56:09 233 fil1\\n2024-02-02 17:56:10 233 fil2\\n2024-02-02 17:56:11 233 fil3\\n2024-02-02 18:03:51 233 fil5\\n2024-02-02 18:04:37 233 fil6\\n2024-02-02 18:09:08 233 fil7\\n
Excellent, everything is working as expected. We have all objectss replicated to all buckets -- except fil4
. This is expected as the file was uploaded to zone2
, and our replication is unidirectional, so there is no sync from zone2
to zone1
.
What will sync info
tell us if we query backupbucket
? This bucket is referenced only in another bucket policy, but bucket backupbucket
doesn\'t have a sync policy of its own:
[root@ceph-node-00 ~]# ssh ceph-node-04 radosgw-admin sync info --bucket backupbucket\\n{\\n \\"sources\\": [],\\n \\"dests\\": [],\\n \\"hints\\": {\\n \\"sources\\": [\\n \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n ],\\n \\"dests\\": []\\n },\\n \\"resolved-hints-1\\": {\\n \\"sources\\": [\\n {\\n \\"id\\": \\"test-pipe2\\",\\n \\"source\\": {\\n \\"zone\\": \\"zone1\\",\\n \\"bucket\\": \\"unidirectional:66df8c0a-c67d-4bd7-9975-bc02a549f13e.36430.1\\"\\n },\\n \\"dest\\": {\\n \\"zone\\": \\"zone2\\",\\n \\"bucket\\": \\"backupbucket\\"\\n },\\n
For this situation, we use hints, so even if the backup is not directly involved in the unidirectional
bucket sync policy, it is referenced by a hint.
Note that in the output, we have resolved hints, which means that the bucket backupbucket
found about bucket unidirectional
syncing to it indirectly, and not from its own policy: the policy for backupbucket
itself is empty.
One important consideration that can be a bit confusing is that metadata is always synced to the other zone independent of the bucket sync policy, so every user and bucket, even if not configured for replication, will show up in all the zones that belong to the zonegroup.
Just as an example, let\'s create a new bucket called newbucket
:
[root@ceph-node-00 ~]# aws --endpoint http://object.s3.zone2.dan.ceph.blue:80 s3 mb s3://newbucket\\nmake_bucket: newbucket\\n
We confirm that this bucket doesn’t have any replication configured:
[root@ceph-node-00 ~]# radosgw-admin bucket sync checkpoint --bucket newbucket\\nSync is disabled for bucket newbucket\\n
But all metadata syncs to the secondary zone so that the bucket will appear in zone2
. In any case, the data inside the bucket won’t be replicated.
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 ls | grep newbucket\\n2024-02-02 02:22:31 newbucket\\n
Another thing to notice is that objects uploaded before a sync policy is configured for a bucket won’t get synced to the other zone until we upload an object after enabling the bucket sync. This example syncs when we upload a new object to the bucket:
[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 ls s3://objectest1/\\n2024-02-02 04:03:47 233 file1\\n2024-02-02 04:03:50 233 file2\\n2024-02-02 04:03:53 233 file3\\n2024-02-02 04:27:19 233 file4\\n\\n[root@ceph-node-00 ~]# ssh ceph-node-04 radosgw-admin bucket sync checkpoint --bucket objectest1\\n2024-02-02T04:17:15.596-0500 7fc00c51f800 1 waiting to reach incremental sync..\\n2024-02-02T04:17:17.599-0500 7fc00c51f800 1 waiting to reach incremental sync..\\n2024-02-02T04:17:19.601-0500 7fc00c51f800 1 waiting to reach incremental sync..\\n2024-02-02T04:17:21.603-0500 7fc00c51f800 1 waiting to reach incremental sync..\\n\\n[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 cp /etc/hosts s3://objectest1/file4\\nupload: ../etc/hosts to s3://objectest1/file4\\n[root@ceph-node-00 ~]# radosgw-admin bucket sync checkpoint --bucket objectest1\\n2024-02-02T04:27:29.975-0500 7fce4cf11800 1 bucket sync caught up with source:\\n local status: [00000000001.569.6, , 00000000001.47.6, , , , 00000000001.919.6, 00000000001.508.6, , , ]\\n remote markers: [00000000001.569.6, , 00000000001.47.6, , , , 00000000001.919.6, 00000000001.508.6, , , ]\\n[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone2.dan.ceph.blue:443 s3 ls s3://objectest1\\n2024-02-02 04:03:47 233 file1\\n2024-02-02 04:03:50 233 file2\\n2024-02-02 04:03:53 233 file3\\n2024-02-02 04:27:19 233 file4\\n
Objects created, modified, or deleted when the bucket sync policy was in the allowed
or forbidden
states will not automatically sync when the policy is enabled again.
We need to run the bucket sync run
command to sync these objects and get the bucket in both zones in sync. For example, we disable sync for bucket objectest1
, and PUT a couple of objects in zone1
that aren\'t replicated to zone2
even after we enable the replication again.
[root@ceph-node-00 ~]# radosgw-admin sync group create --bucket=objectest1 --group-id=objectest1-1 --status=forbidden\\n[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 cp /etc/hosts s3://objectest1/file5\\nupload: ../etc/hosts to s3://objectest1/file5\\n[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 cp /etc/hosts s3://objectest1/file6\\nupload: ../etc/hosts to s3://objectest1/file6\\n[root@ceph-node-00 ~]# radosgw-admin sync group create --bucket=objectest1 --group-id=objectest1-1 --status=enabled\\n[root@ceph-node-00 ~]# aws --endpoint http://object.s3.zone2.dan.ceph.blue:80 s3 ls s3://objectest1\\n2024-02-02 04:03:47 233 file1\\n2024-02-02 04:03:50 233 file2\\n2024-02-02 04:03:53 233 file3\\n2024-02-02 04:27:19 233 file4\\n[root@ceph-node-00 ~]# aws --endpoint https://object.s3.zone1.dan.ceph.blue:443 s3 ls s3://objectest1/\\n2024-02-02 04:03:47 233 file1\\n2024-02-02 04:03:50 233 file2\\n2024-02-02 04:03:53 233 file3\\n2024-02-02 04:27:19 233 file4\\n2024-02-02 04:44:45 233 file5\\n2024-02-02 04:45:38 233 file6\\n
To get the buckets back in sync we use the radosgw-admin sync run
command from the destination zone.
[root@ceph-node-00 ~]# ssh ceph-node-04 radosgw-admin bucket sync run --source-zone zone1 --bucket objectest1\\n[root@ceph-node-00 ~]# aws --endpoint http://object.s3.zone2.dan.ceph.blue:80 s3 ls s3://objectest1\\n2024-02-02 04:03:47 233 file1\\n2024-02-02 04:03:50 233 file2\\n2024-02-02 04:03:53 233 file3\\n2024-02-02 04:27:19 233 file4\\n2024-02-02 04:44:45 233 file5\\n2024-02-02 04:45:38 233 file6\\n
We continued discussing Multisite Sync Policy in part six of this series. We shared some hands-on examples of configuring multisite sync policies, including unidirectional replication with one source to many destination buckets. In the final post of this series we will introduce the Archive Zone feature, which maintains an immutable copy of all versions of all the objects from our production zones.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
In part eight of this Ceph Multisite series, we will continue exploring the Archive Zone. Using a hands-on example of recovering data from the archive zone, we will cover in detail how the archive zone works.
Let\'s start with a visual representation of the recovery workflow for the archive zone. Once covered, we will follow the same steps with a hands-on example.
The following diagram shows the behaviour of the Archive zone when a user PUTs an object into the production zone.
object1
from the user into the production zone, the object replicates into the archive zone as the current version.object1
, the modification is replicated to the archive zone. The modified object is now the current version and the old (initial, original) object is still available in the archive zone thanks to S3 versioning.object1
, the same will happen as in step2, and we will have three versions of the object available in the archive zone.Continuing with the example depicted above, let\'s check how we could recover data from a logical failure.
object1
in the production zone. The object is not deleted from the archive zone.object1
, it fails. The application is down, and panic ensues.object1
\'s latest version from the archive zone into the production cluster.We are going to use rclone
CLI tool for our testing. First, we create a specific user for our tests, so in our zone1
cluster, we run:
# radosgw-admin user create --uid=archuser --display-name=\\"S3 user to test the archive zone\\" --access-key=archuser --secret-key=archuser\\n{\\n \\"user_id\\": \\"archuser\\",\\n \\"display_name\\": \\"S3 user to test the archive zone\\",\\n \\"email\\": \\"\\",\\n \\"suspended\\": 0,\\n \\"max_buckets\\": 1000,\\n \\"subusers\\": [],\\n \\"keys\\": [\\n {\\n \\"user\\": \\"archuser\\",\\n \\"access_key\\": \\"archuser\\",\\n \\"secret_key\\": \\"archuser\\"\\n }\\n ],\\n \\"swift_keys\\": [],\\n \\"caps\\": [],\\n \\"op_mask\\": \\"read, write, delete\\",\\n \\"default_placement\\": \\"\\",\\n \\"default_storage_class\\": \\"\\",\\n \\"placement_tags\\": [],\\n \\"bucket_quota\\": {\\n
Now we configure the AWS client with this user:
# aws configure\\nAWS Access Key ID [None]: archuser\\nAWS Secret Access Key [None]: archuser\\nDefault region name [None]: multizg\\nDefault output format [None]: text\\n
We will also create a couple of aliases to make our life easier.
Alias for zone1
and the archive
zone:
# alias s3apiarchive=\'aws --endpoint=https://object.s3.archive.dan.ceph.blue:443 s3api\'\\n# alias s3apizone1=\'aws --endpoint=https://object.s3.zone1.dan.ceph.blue:443 s3api\'\\n
We wish to use rclone
, so let\'s download and install the appropriate rclone
package:
# yum install https://downloads.rclone.org/v1.62.0/rclone-v1.62.0-linux-amd64.rpm -y\\n
Next we configure the rclone
client with our production zone endpoint and the archive zone endpoint. This way, with the use of rclone
we can recover data from the archive zone if required:
cat <<EOF >rclone.conf\\n[zone1]\\ntype = s3\\nprovider = Other\\naccess_key_id = archuser\\nsecret_access_key = archuser\\nendpoint = https://object.s3.zone1.dan.ceph.blue:443\\nlocation_constraint = multizg\\nacl = bucket-owner-full-control\\n[archive]\\ntype = s3\\nprovider = Ceph\\naccess_key_id = archuser\\nsecret_access_key = archuser\\nendpoint = https://object.s3.archive.dan.ceph.blue:443\\nlocation_constraint = multizg\\nacl = bucket-owner-full-control\\nEOF\\n\\n
Next we create some test files and capture their MD5 checksums so we can compare later:
# echo \\"This is file 1\\" > /tmp/test-file-1\\n# echo \\"This is file 2\\" > /tmp/test-file-2\\n# echo \\"This is file 3\\" > /tmp/test-file-3\\n# md5sum /tmp/test-file-1\\n88c16a56754e0f17a93d269ae74dde9b /tmp/test-file-1\\n# md5sum /tmp/test-file-2\\ndb06069ef1c9f40986ffa06db4fe8fd7 /tmp/test-file-2\\n# md5sum /tmp/test-file-3\\n95227e10e2c33771e1c1379b17330c86 /tmp/test-file-3\\n
We have our client ready, let’s check out the archive zone.
Create a new bucket and verify the bucket has been created in all RGW zones:
# s3apizone1 create-bucket --bucket my-bucket\\n# s3apizone1 list-buckets\\nBUCKETS 2023-03-15T12:03:54.315000+00:00 my-bucket\\nOWNER S3 user to test the archive zone archuser\\n# s3apiarchive list-buckets\\nBUCKETS 2023-03-15T12:03:54.315000+00:00 my-bucket\\nOWNER S3 user to test the archive zone archuser\\n
Verify that the object versioning is not yet configured as this is implemented lazily
# s3apizone1 get-bucket-versioning --bucket my-bucket\\n# s3apiarchive get-bucket-versioning --bucket my-bucket\\n
Upload a new object to our bucket my-bucket
.
# rclone copy /tmp/test-file-1 zone1:my-bucket\\n
Verify that S3 versioning has been enabled in the archive zone but not in zone1
:
# s3apiarchive get-bucket-versioning --bucket my-bucket\\n{\\n \\"Status\\": \\"Enabled\\",\\n \\"MFADelete\\": \\"Disabled\\"\\n}\\n# s3apizone1 get-bucket-versioning --bucket my-bucket\\n
Verify that the object version ID is null in the master and secondary zones but not in the archive zone:
# s3apizone1 list-object-versions --bucket my-bucket\\n{\\n \\"Versions\\": [\\n {\\n \\"ETag\\": \\"\\\\\\"88c16a56754e0f17a93d269ae74dde9b\\\\\\"\\",\\n \\"Size\\": 15,\\n \\"StorageClass\\": \\"STANDARD\\",\\n \\"Key\\": \\"test-file-1\\",\\n \\"VersionId\\": \\"null\\",\\n \\"IsLatest\\": true,\\n \\"LastModified\\": \\"2023-03-15T12:07:12.914000+00:00\\",\\n \\"Owner\\": {\\n \\"DisplayName\\": \\"S3 user to test the archive zone\\",\\n \\"ID\\": \\"archuser\\"\\n }\\n }\\n ]\\n}\\n# s3apiarchive list-object-versions --bucket my-bucket\\n{\\n \\"Versions\\": [\\n {\\n \\"ETag\\": \\"\\\\\\"88c16a56754e0f17a93d269ae74dde9b\\\\\\"\\",\\n \\"Size\\": 15,\\n \\"StorageClass\\": \\"STANDARD\\",\\n \\"Key\\": \\"test-file-1\\",\\n \\"VersionId\\": \\"6DRlC7fKtpmkvHA9zknhFA87RjyilTV\\",\\n \\"IsLatest\\": true,\\n \\"LastModified\\": \\"2023-03-15T12:07:12.914000+00:00\\",\\n \\"Owner\\": {\\n \\"DisplayName\\": \\"S3 user to test the archive zone\\",\\n \\"ID\\": \\"archuser\\"\\n }\\n }\\n ]\\n}\\n
Modify the object in the master zone and verify that a new version is created in the RGW archive zone:
# rclone copyto /tmp/test-file-2 zone1:my-bucket/test-file-1\\n# rclone ls zone1:my-bucket\\n 15 test-file-1\\n
Verify a new version has been created in the RGW archive zone:
# s3apiarchive list-object-versions --bucket my-bucket\\n{\\n \\"Versions\\": [\\n {\\n \\"ETag\\": \\"\\\\\\"db06069ef1c9f40986ffa06db4fe8fd7\\\\\\"\\",\\n \\"Size\\": 15,\\n \\"StorageClass\\": \\"STANDARD\\",\\n \\"Key\\": \\"test-file-1\\",\\n \\"VersionId\\": \\"mXoINEnZsSCDNaWwCDELVysUbnMqNqx\\",\\n \\"IsLatest\\": true,\\n \\"LastModified\\": \\"2023-03-15T12:13:27.057000+00:00\\",\\n \\"Owner\\": {\\n \\"DisplayName\\": \\"S3 user to test the archive zone\\",\\n \\"ID\\": \\"archuser\\"\\n }\\n },\\n {\\n \\"ETag\\": \\"\\\\\\"88c16a56754e0f17a93d269ae74dde9b\\\\\\"\\",\\n \\"Size\\": 15,\\n \\"StorageClass\\": \\"STANDARD\\",\\n \\"Key\\": \\"test-file-1\\",\\n \\"VersionId\\": \\"6DRlC7fKtpmkvHA9zknhFA87RjyilTV\\",\\n \\"IsLatest\\": false,\\n \\"LastModified\\": \\"2023-03-15T12:07:12.914000+00:00\\",\\n \\"Owner\\": {\\n \\"DisplayName\\": \\"S3 user to test the archive zone\\",\\n \\"ID\\": \\"archuser\\"\\n }\\n }\\n ]\\n}\\n
We can check the ETag: it will match the MD5sum for the object. This is only the case if neither multipart upload nor object encryption are configured.
# md5sum /tmp/test-file-2\\ndb06069ef1c9f40986ffa06db4fe8fd7 /tmp/test-file-2\\n# md5sum /tmp/test-file-1\\n88c16a56754e0f17a93d269ae74dde9b /tmp/test-file-1\\n
Let’s upload one more version of the object
# rclone copyto /tmp/test-file-3 zone1:my-bucket/test-file-1\\n
In the primary zone we only have one version, the current version of the object:
# rclone --s3-versions lsl zone1:my-bucket\\n\\n 15 2023-03-15 07:59:10.779573336 test-file-1\\n
But in the Archive zone we have all three versions available:
# rclone --s3-versions lsl archive:my-bucket\\n 15 2023-03-15 07:59:10.779573336 test-file-1\\n 15 2023-03-15 07:59:03.782438991 test-file-1-v2023-03-15-121327-057\\n 15 2023-03-15 07:58:58.135330567 test-file-1-v2023-03-15-120712-914\\n
Now let’s delete test-file1
from my-bucket
in zone1
, then recover the object from the archive zone:
# rclone delete zone1:my-bucket/test-file-1\\n# rclone --s3-versions lsl zone1:my-bucket\\n# rclone --s3-versions lsl archive:my-bucket\\n 15 2023-03-15 07:59:10.779573336 test-file-1\\n 15 2023-03-15 07:59:03.782438991 test-file-1-v2023-03-15-121327-057\\n 15 2023-03-15 07:58:58.135330567 test-file-1-v2023-03-15-120712-914\\n
The object has been deleted from zone1
, but all versions are still available in the archive zone. If we recover the latest version test-file-1
it should match with the MD5 checksum for our test-file-3
:
# rclone copyto archive:my-bucket/test-file-1 zone1:my-bucket/test-file-1\\n# rclone copyto zone1:my-bucket/test-file-1 /tmp/recovered-file1\\n# md5sum /tmp/recovered-file1\\n95227e10e2c33771e1c1379b17330c86 /tmp/recovered-file1\\n# md5sum /tmp/test-file-3\\n95227e10e2c33771e1c1379b17330c86 /tmp/test-file-3\\n
Now let\'s explore the case where we want to recover the object with the version that has a specific timestamp, for example 2023-03-15-121327-057
.
# rclone --s3-versions copyto archive:my-bucket/test-file-1-v2023-03-15-121327-057 zone1:my-bucket/test-file-1\\n# rclone copyto zone1:my-bucket/test-file-1 /tmp/recovered-file1\\n# md5sum /tmp/recovered-file1\\ndb06069ef1c9f40986ffa06db4fe8fd7 /tmp/recovered-file1\\n# md5sum /tmp/test-file-2\\ndb06069ef1c9f40986ffa06db4fe8fd7 /tmp/test-file-2\\n
This takes us to the end of a hands-on example of working with an archive zone and, with the help of rclone
, seamlessly recovering data.
We introduced the archive zone feature in part eight of this series. We shared a hands-on example of recovering data from an archive zone. This takes us to the end of this Ceph Object Storage Multisite series.
We hope this content has been helpful for your Ceph endeavours.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
Efficient multitenant environment management is critical in large-scale object storage systems. In the Ceph Squid release, we’ve introduced a transformative feature as a Tech Preview: Identity and Access Management (IAM) accounts.
This enhancement brings self-service resource management to Ceph Object Storage and significantly reduces administrative overhead for Ceph administrators by enabling hands-off multitenancy management.
IAM accounts allow tenants to independently manage their resources—users, groups, roles, policies, and buckets—using an API interface modeled after AWS IAM. For Ceph administrators, this means delegating day-to-day operational responsibilities to tenants while retaining system-wide control.
The IAM API is fully compatible with AWS S3 and is offered through the Object Gateway API endpoint. In this way, IAM Account administrators (root account) don’t need access or permissions on the Ceph internal radosgw-cli or adminOPS API, enhancing responsibility delegation while maintaining security.
With the addition of IAM accounts, we have a new user persona that represents the tenant admin: the IAM Root Account User.
Although outside this post\'s scope, we provide an example RGW Spec file that sets up the RGW services and the Ingres Service (load balancer) for the RGW endpoints.
# cat << EOF > /root/rgw-ha.spec\\n---\\nservice_type: ingress\\nservice_id: rgw.rgwsrv\\nservice_name: ingress.rgw.rgwsrv\\nplacement:\\n count: 2\\n hosts:\\n - ceph-node-02.cephlab.com\\n - ceph-node-03.cephlab.com\\nspec:\\n backend_service: rgw.rgwsrv\\n first_virtual_router_id: 50\\n frontend_port: 80\\n monitor_port: 1497\\n virtual_ip: 192.168.122.100\\n---\\nservice_type: rgw\\nservice_id: rgwsrv\\nservice_name: rgw.rgwsrv\\nplacement:\\n count: 3\\n hosts:\\n - ceph-node-03.cephlab.com\\n - ceph-node-00.cephlab.com\\n - ceph-node-01.cephlab.com\\nspec:\\n rgw_frontend_port: 8080\\n rgw_realm: realm1\\n rgw_zone: zone1\\n rgw_zonegroup: zonegroup1\\nEOF\\n \\n# ceph orch apply -i /root/rgw-ha.spec\\n
This config provides us with a working virtual IP endpoint on 192.168.122.100
that resolves to s3.zone1.cephlab.com
.
Let’s walk through the steps to configure an IAM account, create users, and apply permissions.
We will use the radosgw-admin
CLI to create an IAM account for an analytics web application team; you could also use the AdminOPS API.
In the example, we first create the IAM account and then define the resources this specific IAM account will have available from the global RGW/Object Storage system.
# radosgw-admin account create --account-name=analytic_app\\n
This command creates an account named analytic_app
. The account is initialized with default quotas and limits, which may be adjusted afterward. When using IAM accounts, an RGW Account ID gets created that will be part of the principal ARN when we need to reference it, for example: arn:aws:iam::RGW00889737169837717:user/name
.
Example output:
{\\n \\"id\\": \\"RGW00889737169837717\\",\\n \\"tenant\\": \\"analytics\\",\\n \\"name\\": \\"analytic_app\\",\\n \\"max_users\\": 1000,\\n ...\\n}\\n
As the RGW admin, in this example, we adjust the maximum number of users for the account:
# radosgw-admin account modify --max-users 10 --account-name=analytic_app\\n
This ensures the IAM account can create up to ten users. Depending on your needs, you can also manage the maximum numbers of groups, keys, policies, buckets, etc.
As part of creating the IAM account, we can enable and define quotas for the account to control resource usage. In this example we configure the account\'s maximum storage usage to 20GB, and we can also configure other quotas related to the object count per bucket:
# radosgw-admin quota set --quota-scope=account --account-name=analytic_app --max-size=20G\\n# radosgw-admin quota enable --quota-scope=account --account-id=RGW00889737169837717\\n
Creating the Account Root User for our new IAM Account
Each IAM account is managed by a root user, who has default permissions over all resources within the account. Like normal users and roles, accounts and account root users must be created by an administrator using radosgw-admin or the Admin Ops API.
To create the account root user for the analytic_app
account, we run the following command:
# radosgw-admin user create --uid=root_analytics_web --display-name=root_analytics_web --account-id=RGW00889737169837717 --account-root --gen-secret --gen-access-key\\n
Example output:
{\\n \\"user_id\\": \\"root_analytics_web\\",\\n \\"access_key\\": \\"1EHAKZAXKPV6LU65QS2R\\",\\n \\"secret_key\\": \\"AgXK1BqPOP25pt0HvERDts2yZtFNfF4Mm8mCnoJX\\",\\n ...\\n}\\n
The root account user is now ready to create and manage users, groups, roles, and permissions within the IAM account. These resources can be administered and managed through the IAM API available through the RGW endpoint. At this point, the RGW admin may provide the credentials of the root user of the IAM account to the person responsible for the account. That person can perform all administrative operations related to their account using the IAM API provided by the RGW endpoint, which is entirely hands-off for the RGW administrator/operator.
Here is a list of some operations that the IAM root account can perform without the intervention of an RGW admin:
Now we will configure the AWS CLI using the Access and Secret Keys of the IAM Root Account generated in the previous step. The IAM API is available by default on my Ceph Object Gateway (RGW) endpoints. In this example, we have s3.zone1.cephlab.com
as the load-balanced endpoint providing access to the API.
# dnf install awscli -y\\n# aws configure\\nAWS Access Key ID [****************dmin]: 1EHAKZAXKPV6LU65QS2R\\nAWS Secret Access Key [****************dmin]: AgXK1BqPOP25pt0HvERDts2yZtFNfF4Mm8mCnoJX\\nDefault region name [multizg]: zonegroup1\\nDefault output format [json]: json\\n# aws configure set endpoint_url http://s3.zone1.cephlab.com\\n
Add a new IAM user named analytics_frontend
to the analytics IAM account:
# aws iam create-user --user-name analytics_frontend\\n
Assign an access key and secret key to the new user:
# aws iam create-access-key --user-name analytics_frontend\\n\\n
At this point, the user cannot access the S3 resources. In the next step, we will enable the user to access resources. Here is an example of trying to access the S3 namespace as the analytics_frontend
user without attaching a policy:
# aws --profile analytics_backend s3 ls\\nargument of type \'NoneType\' is not iterable\\n# aws --profile analytics_backend s3 ls s3://staticfront/\\nargument of type \'NoneType\' is not iterable\\n
We have multiple ways to grant the new IAM user access to the various resources available in the account, for example, IAM, S3, and SNS resources:
Feature | Managed Policies | Inline Policies | Assume Role |
---|---|---|---|
Definition | Reusable policies that can be attached to multiple users, groups, or roles. | Policies created for and attached to a single user, group, or role. | Temporary access is granted to a user or service to perform specific tasks. |
Reusability | Can be shared across multiple accounts or identities within the RGW IAM system. | Specific to the identity they are attached to and cannot be reused. | Roles can be reused by multiple identities that need temporary access. |
Ease of Management | Easier to manage due to centralized policy definition. | Requires individual updates for each identity they are attached to. | It is easier to manage since roles are centrally defined and assumed as needed. |
Flexibility | Ideal for common permissions that apply to many users or groups. | Best for unique permissions tailored to specific use cases or users. | Highly flexible for scenarios requiring time-limited, task-specific access. |
Use Case | Example: Granting read-only or full access to S3 buckets across multiple users. | Example: Granting a specific user access to a unique S3 bucket. | Example: Allowing a service to temporarily assume access to an S3 bucket for processing. |
In this first example, we use a managed policy, policy/AmazonS3FullAccess
, to allow the analytics_frontend
user full access to IAM Account S3 resources:
# aws iam attach-user-policy --user-name analytics_frontend --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess\\n
Once we have attached the managed policy, we can create the S3 resources of the IAM account, for example:
# aws --profile analytics_frontend s3 mb s3://staticfront\\nmake_bucket: staticfront\\n
First create an IAM group to manage permissions for users requiring similar roles. Iin this case, we are creating a group for the frontend-monitoring team.
# aws iam create-group --group-name frontend-monitoring\\n
Attach a Policy to the Group: In this example, we will attach an S3 read-only access policy to the group so that all users inherit the permissions and can access the S3 resources in read-only mode. No modifications to the S3 dataset are allowed.
# aws iam attach-group-policy --group-name frontend-monitoring --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess\\n
Check that the policy is successfully attached to the group:
# aws iam list-attached-group-policies --group-name frontend-monitoring\\n{\\n \\"AttachedPolicies\\": [\\n {\\n \\"PolicyName\\": \\"AmazonS3ReadOnlyAccess\\",\\n \\"PolicyArn\\": \\"arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess\\"\\n }\\n ]\\n}\\n
Create individual IAM users with their keys who will be group members.
# aws iam create-user --user-name mon_user1\\n# aws iam create-user --user-name mon_user2\\n# aws iam create-access-key --user-name mon_user1\\n# aws iam create-access-key --user-name mon_user2\\n
Add the users created in the previous step to the frontend-monitoring
group so they inherit the permissions.
# aws iam add-user-to-group --group-name frontend-monitoring --user-name mon_user1\\n# aws iam add-user-to-group --group-name frontend-monitoring --user-name mon_user2\\n
Confirm that both users are part of the group:
# aws iam get-group --group-name frontend-monitoring\\n{\\n \\"Users\\": [\\n {\\n \\"Path\\": \\"/\\",\\n \\"UserName\\": \\"mon_user1\\",\\n \\"UserId\\": \\"fe09d373-08e8-4b61-bffa-6f65eaf11e56\\",\\n \\"Arn\\": \\"arn:aws:iam::RGW60952341557974488:user/mon_user1\\"\\n },\\n {\\n \\"Path\\": \\"/\\",\\n \\"UserName\\": \\"mon_user2\\",\\n \\"UserId\\": \\"29c57263-1293-4bdf-90e4-a784859f12ef\\",\\n \\"Arn\\": \\"arn:aws:iam::RGW60952341557974488:user/mon_user2\\"\\n }\\n ],\\n \\"Group\\": {\\n \\"Path\\": \\"/\\",\\n \\"GroupName\\": \\"frontend-monitoring\\",\\n \\"GroupId\\": \\"a453d5af-4e25-401c-be76-b4075419cc94\\",\\n \\"Arn\\": \\"arn:aws:iam::RGW60952341557974488:group/frontend-monitoring\\"\\n }\\n}\\n
This example demonstrates creating and attaching an inline policy to a specific user in IAM. Inline policies define permissions for a single user and are directly embedded into their identity. While this example focuses on the PutUserPolicy
operation, the same approach applies to groups (PutGroupPolicy) and roles (PutRolePolicy) if you need to manage permissions for those entities.
We begin by creating a user who will be assigned the custom inline policy.
# aws iam create-user --user-name static_ro\\n# aws iam create-access-key --user-name static_ro\\n
We create a JSON file containing the policy document to define the Custom Inline Policy. This policy permits users to perform read-only operations on a specific S3 bucket and its objects.
# cat << EOF > analytics_policy_web_ro.json\\n{\\n \\"Version\\": \\"2012-10-17\\",\\n \\"Statement\\": [\\n {\\n \\"Effect\\": \\"Allow\\",\\n \\"Action\\": [\\n \\"s3:GetObject\\",\\n \\"s3:ListBucket\\",\\n \\"s3:ListBucketMultipartUploads\\"\\n ],\\n \\"Resource\\": [\\n \\"arn:aws:s3:::staticfront/*\\", \\n \\"arn:aws:s3:::staticfront\\" \\n ]\\n }\\n ]\\n}\\nEOF\\n
Example Policy Overview:
Attach the policy to the user using the put-user-policy command.
# aws iam put-user-policy --user-name static_ro --policy-name analytics-static-ro --policy-document file://analytics_policy_web_ro.json\\n
List the inline policies attached to the user to confirm the policy was successfully applied.
# aws iam list-user-policies --user-name static_ro\\n{\\n \\"PolicyNames\\": [\\n \\"analytics-static-ro\\"\\n ]\\n}\\n
Since this topic is too extensive for a single post, we won\'t cover IAM roles and the ability of IAM users to assume roles with the help of STS in this article. In our next post about the new IAM account feature in Squid, we will explore additional exciting features, including Roles with STS and cross-account access for sharing datasets between accounts.
For further details on IAM, explore the IAM API documentation, and Account documentation.
For further details on the Squid release, check Laura Flores\' blog post
The atuhors would like to thank IBM for supporting the community by facilitating our time to create these posts.
In the previous episode of the series, we went through an example of configuring Ceph Object Storage multisite replication with the help of the rgw
manager module.
We have set up two RGWs for each Ceph cluster. By default, the RGW services manage both client S3 requests and replication requests among the sites. The RGW services share their resources and processing time between both tasks. To improve this configuration, we can assign a specific subset of RGWs to manage client S3 requests and another subset of RGWs to manage multisite replication requests between the two Ceph clusters.
Using this approach is not mandatory but will provide the following benefits:
Because we have a dedicated set of resources for public and multisite replication, we can scale the subsets of client-facing and replication RGWs independently depending on where we need higher performance, like increased throughput or lower latency.
Segregated RGWs can avoid sync replication stalling because the RGWs are busy with client-facing tasks or vice-versa.
Improved troubleshooting: dedicated subsets of RGWs can improve the troubleshooting experience as we can target the RGWs to investigate depending the specific issue. Also, when reading through the debug logs of the RGW services, replication messages don’t get in the middle of client messages and vice versa.
Because we are using different sets of RGWs, we could use different networks with different security levels, firewall rules, OS security, etc. For example:
The public-facing RGWs could be using Network A.
The replication RGWs could be using Network B.
When configuring a multisite deployment, it is common practice to dedicate specific RGW services to client operations and other RGW services to multisite replication.
By default, all RGWs participate in multisite replication. Two steps are needed to exclude an RGW from participating in the multisite replication sync.
Set this Ceph option for RGWs: ceph config set ${KEY_ID} rgw_run_sync_thread false
. When false, prevents this object store\'s gateways from transmitting multisite replication data
The previous parameter only tells the RGW not to send replication data, but it can keep receiving. To also avoid receiving, we need to remove the RGWs from the zonegroup and zone replication endpoints.
In the previous chapter, we configured two RGWs per Ceph cluster, which are currently serving both client S3 requests and replication request traffic. In the following steps, we will configure two additional RGWs per cluster for a total of four RGWs within each cluster. Out of these four RGWs, two will be dedicated to serving client requests and the other two will be dedicated to serving multisite replication. The diagram below illustrates what we are aiming to achieve.
We will use labels to control the scheduling and placement of RGW services. In this case, the label we will use for the public-facing RGWs is rgw
.
[root@ceph-node-00 ~]# ceph orch host label add ceph-node-02.cephlab.com rgw\\nAdded label rgw to host ceph-node-02.cephlab.com\\n[root@ceph-node-00 ~]# ceph orch host label add ceph-node-03.cephlab.com rgw\\nAdded label rgw to host ceph-node-03.cephlab.com\\n
We create an RGW spec file for the public-facing RGWs. In this example, we use the same CIDR network for all RGW services. We could however configure different network CIDRs for the different sets of RGWs we deploy if needed. We use the same realm, zonegroup and zone as the services we already have running, as we want all RGWs to belong to the same realm namespace.
[root@ceph-node-00 ~]# cat << EOF >> /root/rgw-client.spec\\nservice_type: rgw\\nservice_id: client-traffic\\nplacement:\\n label: rgw\\n count_per_host: 1\\nnetworks:\\n- 192.168.122.0/24\\nspec:\\n rgw_frontend_port: 8000\\n rgw_realm: multisite\\n rgw_zone: zone1\\n rgw_zonegroup: multizg\\nEOF\\n
We apply the spec file and check that we now have four new services running: two for multisite replication and the other for client traffic.
[root@ceph-node-00 ~]# ceph orch apply -i spec-rgw.yaml\\nScheduled rgw.rgw-client-traffic update…\\n[root@ceph-node-00 ~]# ceph orch ps | grep rgw\\nrgw.multisite.zone1.ceph-node-00.mwvvel ceph-node-00.cephlab.com *:8000 running (2h) 6m ago 2h 190M - 18.2.0-131.el9cp 463bf5538482 dda6f58469e9\\nrgw.multisite.zone1.ceph-node-01.fwqfcc ceph-node-01.cephlab.com *:8000 running (2h) 6m ago 2h 184M - 18.2.0-131.el9cp 463bf5538482 10a45a616c44\\nrgw.client-traffic.ceph-node-02.ozdapg ceph-node-02.cephlab.com 192.168.122.94:8000 running (84s) 79s ago 84s 81.1M - 18.2.0-131.el9cp 463bf5538482 0bc65ad993b1\\nrgw.client-traffic.ceph-node-03.udxlvd ceph-node-03.cephlab.com 192.168.122.180:8000 running (82s) 79s ago 82s 18.5M - 18.2.0-131.el9cp 463bf5538482 8fc7d6b06b54\\n
As we mentioned at the start of this section, to disable replication traffic on an RGW, we need to ensure two things:
So the first thing to do is disable the rgw_run_sync_thread
using the ceph config
command. We specify the service name client.rgw.client-traffic
to apply the change on both of our client-facing RGWs simultaneously. We first check the current configuration of the rgw_run_sync_thread
and confirm that it is set by default to true.
[root@ceph-node-00 ~]# ceph config get client.rgw.client-traffic rgw_run_sync_thread\\ntrue\\n
We will now change the parameters to false so that the sync threads will be disabled for this set of RGWs.
[root@ceph-node-00 ~]# ceph config set client.rgw.client-traffic rgw_run_sync_thread false\\n[root@ceph-node-00 ~]# ceph config get client.rgw.client-traffic rgw_run_sync_thread false\\n
The second step is ensuring the new RGWs we deployed are not listed as replication endpoints in the zonegroup configuration. We shouldn’t see ceph-node-02
or ceph-node-03
listed as endpoints under zone1
:
[root@ceph-node-00 ~]# radosgw-admin zonegroup get | jq \'.zones[]|.name,.endpoints\'\\n\\"zone1\\"\\n[\\n \\"http://ceph-node-00.cephlab.com:8000\\",\\n \\"http://ceph-node-01.cephlab.com:8000\\"\\n]\\n\\"zone2\\"\\n[\\n \\"http://ceph-node-04.cephlab.com:8000\\",\\n \\"http://ceph-node-05.cephlab.com:8000\\"\\n]\\n
Note that the JSON parsing utility jq
must be installed for this task.
After confirming, we have finished this part of the configuration and have running in the cluster dedicated services for each type of request: client cequests and replication requests.
You would need to repeat the same steps to apply the same configuration to our second cluster zone2
.
The Reef release introduced an improvement in Object Storage Multisite Replication known as \\"Replication Sync Fairness\\". This improvement addresses the issue faced by earlier releases where replication work was not distributed optimally. In prior releases, one RGW would take the lock for replication operations and the other RGW services would find it difficult to obtain the lock. This resulted in multisite replication not scaling linearly when adding additional RGW services. To improve the distribution of replication work, significant improvements were made in the Quincy release. However, with Sync Fairness replication in Reef, replication data and metadata are evenly distributed among all RGW services, enabling them to collaborate more efficiently in replication tasks.
Thanks to the IBM Storage DFG team that ran scale testing to highlight and verify the improvements introduced by the sync fairness feature. During the testing, the DGF team compared Ceph Reef with Quincy and Pacific when ingesting objects with multisite replication configured.
The results below provided by DFG compare the degree of participation by each syncing RGW in each test case. The graphs plot the avgcount (number of objects and bytes fetched by data sync) polled every fifteen minutes. An optimal result is one where all syncing RGWs evenly share the load.
In this example, note how one of the Pacific RGWs (the blue lines labeled RHCS 5.3) processed objects in the 13M range (18M for secondary sync) while the other two RGWs were handling 5 million and 1.5 million, resulting in longer sync times: more than 24 hours. The Reef RGWs, (the green lines labeled RHCS 7) RGWs, however, all stay within close range of each other. Each processes 5M to 7M objects, and syncing is achieved more quickly, well under 19 hours.
The closer the lines of the same color are in the graph, the better the sync participation is. As you can see, for Reef, the green lines are very close to each other, meaning that the replication workload is evenly distributed among the three sync RGWs configured for the test.
In the following graph, we show how much time it took for each release to sync the full workload (small objects) to the other zone: the less time, the better. e can see that Reef, here labeled 7
, provides substantially improved sync times.
To summarize, in part three of this series, we discussed configuring dedicated RGW services for public and replication requests. Additionally, we have explored the performance enhancements the sync fairness feature offers. We will delve into load balancing our client-facing RGW endpoints in part four.
The authors would like to thank IBM for supporting the community by facilitating our time to create these posts.
This is the eighth, and expected to be last, backport release in the Quincy series. We recommend all users update to this release.
v17.2.8 will have RPM/centos 9 packages instead of RPM/centos 8 built.
v17.2.8 container images, now based on CentOS 9, may be incompatible on older kernels (e.g., Ubuntu 18.04) due to differences in thread creation methods. Users upgrading to v17.2.8 container images with older OS versions may encounter crashes during pthread_create
. However, we recommend upgrading your OS to avoid this unsupported combination.
Users should expect to see the el8 rpm subdirectory empty and the \\"dnf\\" commands are expected to fail with 17.2.8. They can choose to use 17.2.8 RPM packages for centos 8/el8 provided by CERN as a community member or continue to stay at 17.2.7 following instructions from https://docs.ceph.com/en/latest/install/get-packages/#rhel, the ceph.repo file should point to https://download.ceph.com/rpm-17.2.7/el8 instead of https://download.ceph.com/rpm-quincy/el8
These CERN packages come with no warranty and have not been tested. The software in them has been tested by Ceph according to [platforms](https://docs.ceph.com/en/latest/start/os-recommendations/ #platforms). The repository for el8 builds is hosted by CERN on [Linux @ CERN](https://linuxsoft.cern.ch/repos/ceph-ext- quincy8el-stable/). The public part of the GPG key used to sign the packages is available at RPM-GPG-KEY-Ceph-Community.
get_pool_is_selfmanaged_snaps_mode
C++ API has been deprecated due to being prone to false negative results. Its safer replacement is pool_is_in_selfmanaged_snaps_mode
.fromsnapname == NULL
) in fast-diff mode (whole_object == true
with the fast-diff
image feature enabled and valid), diff-iterate is now guaranteed to execute locally if an exclusive lock is available. This brings a dramatic performance improvement for QEMU live disk synchronization and backup use cases.--image-id
has been added to the rbd children
CLI command, so it can be run for images in the trash.RBD_IMAGE_OPTION_CLONE_FORMAT
option has been exposed in Python bindings via the clone_format
optional parameter to the clone
, deep_copy
and migration_prepare
methods.RBD_IMAGE_OPTION_FLATTEN
option has been exposed in Python bindings via the flatten
optional parameter to the deep_copy
and migration_prepare
methods..github: sync the list of paths for rbd label, expand tests label to qa/* (pr#57726, Ilya Dryomov)
[quincy] qa/multisite: stabilize multisite testing (pr#60479, Shilpa Jagannath, Casey Bodley)
[quincy] RGW backports (pr#51806, Soumya Koduri, Casey Bodley)
[rgw][lc][rgw_lifecycle_work_time] adjust timing if the configured end time is less than the start time (pr#54874, Oguzhan Ozmen)
Add Containerfile and build.sh to build it (pr#60230, Dan Mick)
admin/doc-requirements: bump Sphinx to 5.0.2 (pr#55204, Nizamudeen A)
batch backport of #50743, #55342, #48557 (pr#55593, John Mulligan, Afreen, Laura Flores)
blk/aio: fix long batch (64+K entries) submission (pr#58674, Igor Fedotov, Adam Kupczyk, Robin Geuze)
bluestore/bluestore_types: avoid heap-buffer-overflow in another way to keep code uniformity (pr#58818, Rongqi Sun)
bluestore/bluestore_types: check \'it\' valid before using (pr#56889, Rongqi Sun)
build: Make boost_url a list (pr#58316, Adam Emerson, Kefu Chai)
centos 9 related backports for RBD (pr#58565, Casey Bodley, Ilya Dryomov)
ceph-menv:fix typo in README (pr#55164, yu.wang)
ceph-node-proxy not present, not part of container (pr#60337, Dan Mick)
ceph-volume: add missing import (pr#56260, Guillaume Abrioux)
ceph-volume: create LVs when using partitions (pr#58221, Guillaume Abrioux)
ceph-volume: fix a bug in _check_generic_reject_reasons (pr#54706, Kim Minjong)
ceph-volume: fix a regression in raw list
(pr#54522, Guillaume Abrioux)
ceph-volume: Fix migration from WAL to data with no DB (pr#55496, Igor Fedotov)
ceph-volume: Fix unbound var in disk.get_devices() (pr#59651, Zack Cerza)
ceph-volume: fix zap_partitions() in devices.lvm.zap (pr#55480, Guillaume Abrioux)
ceph-volume: fixes fallback to stat in is_device and is_partition (pr#54630, Teoman ONAY)
ceph-volume: Revert \\"ceph-volume: fix raw list for lvm devices\\" (pr#54430, Matthew Booth, Guillaume Abrioux)
ceph-volume: use \'no workqueue\' options with dmcrypt (pr#55336, Guillaume Abrioux)
ceph-volume: use importlib from stdlib on Python 3.8 and up (pr#58006, Guillaume Abrioux, Kefu Chai)
ceph-volume: Use safe accessor to get TYPE info (pr#56322, Dillon Amburgey)
ceph.spec.in: add support for openEuler OS (pr#56366, liuqinfei)
ceph.spec.in: we need jsonnet for all distroes for make check (pr#60074, Kyr Shatskyy)
ceph_test_rados_api_misc: adjust LibRadosMiscConnectFailure.ConnectTimeout timeout (pr#58128, Lucian Petrut)
cephadm: add a --dry-run option to cephadm shell (pr#54221, John Mulligan)
cephadm: add tcmu-runner to logrotate config (pr#55966, Adam King)
cephadm: add timemaster to timesync services list (pr#56308, Florent Carli)
cephadm: Adding support to configure public_network cfg section (pr#55959, Redouane Kachach)
cephadm: allow ports to be opened in firewall during adoption, reconfig, redeploy (pr#55960, Adam King)
cephadm: disable ms_bind_ipv4 if we will enable ms_bind_ipv6 (pr#58760, Dan van der Ster, Joshua Blanch)
cephadm: fix host-maintenance command always exiting with a failure (pr#58755, John Mulligan)
cephadm: make custom_configs work for tcmu-runner container (pr#53425, Adam King)
cephadm: pin pyfakefs version for tox tests (pr#56763, Adam King)
cephadm: remove restriction for crush device classes (pr#56087, Seena Fallah)
cephadm: run tcmu-runner through script to do restart on failure (pr#55975, Adam King, Raimund Sacherer, Teoman ONAY, Ilya Dryomov)
cephadm: support for CA signed keys (pr#55965, Adam King)
cephadm: turn off cgroups_split setting when bootstrapping with --no-cgroups-split (pr#58761, Adam King)
cephadm: use importlib.metadata for querying ceph_iscsi\'s version (pr#58637, Kefu Chai)
cephfs-mirror: various fixes (pr#56702, Jos Collin)
cephfs: Fixed a bug in the readdir_cache_cb function that may have us… (pr#58806, Tod Chen)
cephfs: upgrade cephfs-shell\'s path wherever necessary (pr#54186, Rishabh Dave)
client, mds: update mtime and change attr for snapdir when snaps are created, deleted and renamed (issue#54501, pr#50730, Venky Shankar)
client/fuse: handle case of renameat2 with non-zero flags (pr#55010, Leonid Usov, Shachar Sharon)
client: always refresh mds feature bits on session open (issue#63188, pr#54244, Venky Shankar)
client: call _getattr() for -ENODATA returned _getvxattr() calls (pr#54405, Jos Collin)
client: disallow unprivileged users to escalate root privileges (pr#60314, Xiubo Li, Venky Shankar)
client: fix leak of file handles (pr#56121, Xavi Hernandez)
client: queue a delay cap flushing if there are ditry caps/snapcaps (pr#54465, Xiubo Li)
cloud sync: fix crash due to objs on cr stack (pr#51136, Yehuda Sadeh)
cls/cas/cls_cas_internal: Initialize \'hash\' value before decoding (pr#59236, Nitzan Mordechai)
cmake/modules/BuildRocksDB.cmake: inherit parent\'s CMAKE_CXX_FLAGS (pr#55501, Kefu Chai)
cmake/rgw: librgw tests depend on ALLOC_LIBS (pr#54796, Casey Bodley)
cmake: use or turn off liburing for rocksdb (pr#54123, Casey Bodley, Patrick Donnelly)
common/admin_socket: add a command to raise a signal (pr#54356, Leonid Usov)
common/dout: fix FTBFS on GCC 14 (pr#59057, Radoslaw Zarzynski)
common/Formatter: dump inf/nan as null (pr#60064, Md Mahamudur Rahaman Sajib)
common/StackStringStream: update pointer to newly allocated memory in overflow() (pr#57363, Rongqi Sun)
common/weighted_shuffle: don\'t feed std::discrete_distribution with all-zero weights (pr#55154, Radosław Zarzyński)
common: intrusive_lru destructor add (pr#54557, Ali Maredia)
common: fix compilation warnings in numa.cc (pr#58704, Radoslaw Zarzynski)
common: resolve config proxy deadlock using refcounted pointers (pr#54374, Patrick Donnelly)
Do not duplicate query-string in ops-log (pr#57132, Matt Benjamin)
do not evict clients if OSDs are laggy (pr#52271, Dhairya Parmar, Laura Flores)
doc/architecture.rst - fix typo (pr#55385, Zac Dover)
doc/architecture.rst: improve rados definition (pr#55344, Zac Dover)
doc/architecture: correct typo (pr#56013, Zac Dover)
doc/architecture: improve some paragraphs (pr#55400, Zac Dover)
doc/architecture: remove pleonasm (pr#55934, Zac Dover)
doc/ceph-volume: add spillover fix procedure (pr#59542, Zac Dover)
doc/ceph-volume: explain idempotence (pr#54234, Zac Dover)
doc/ceph-volume: improve front matter (pr#54236, Zac Dover)
doc/cephadm - edit t11ing (pr#55483, Zac Dover)
doc/cephadm/services: remove excess rendered indentation in osd.rst (pr#54324, Ville Ojamo)
doc/cephadm/upgrade: ceph-ci containers are hosted by quay.ceph.io (pr#58682, Casey Bodley)
doc/cephadm: add default monitor images (pr#57210, Zac Dover)
doc/cephadm: add malformed-JSON removal instructions (pr#59665, Zac Dover)
doc/cephadm: add note about ceph-exporter (Quincy) (pr#55520, Zac Dover)
doc/cephadm: correct nfs config pool name (pr#55604, Zac Dover)
doc/cephadm: edit \\"Using Custom Images\\" (pr#58942, Zac Dover)
doc/cephadm: edit troubleshooting.rst (1 of x) (pr#54284, Zac Dover)
doc/cephadm: edit troubleshooting.rst (2 of x) (pr#54321, Zac Dover)
doc/cephadm: explain different methods of cephadm delivery (pr#56176, Zac Dover)
doc/cephadm: fix typo in set ssh key command (pr#54389, Piotr Parczewski)
doc/cephadm: how to get exact size_spec from device (pr#59432, Zac Dover)
doc/cephadm: improve host-management.rst (pr#56112, Anthony D\'Atri)
doc/cephadm: Improve multiple files (pr#56134, Anthony D\'Atri)
doc/cephadm: Quincy default images procedure (pr#57239, Zac Dover)
doc/cephadm: remove downgrade reference from upgrade docs (pr#57087, Adam King)
doc/cephfs/client-auth.rst: correct ``fs authorize cephfs1 /dir1 clie… (pr#55247, 叶海丰)
doc/cephfs: add cache pressure information (pr#59150, Zac Dover)
doc/cephfs: add doc for disabling mgr/volumes plugin (pr#60498, Rishabh Dave)
doc/cephfs: disambiguate \\"Reporting Free Space\\" (pr#56873, Zac Dover)
doc/cephfs: disambiguate two sentences (pr#57705, Zac Dover)
doc/cephfs: edit \\"Cloning Snapshots\\" in fs-volumes.rst (pr#57667, Zac Dover)
doc/cephfs: edit \\"is mount helper present\\" (pr#58580, Zac Dover)
doc/cephfs: edit \\"Layout Fields\\" text (pr#59023, Zac Dover)
doc/cephfs: edit \\"Pinning Subvolumes...\\" (pr#57664, Zac Dover)
doc/cephfs: edit add-remove-mds (pr#55649, Zac Dover)
doc/cephfs: edit front matter in client-auth.rst (pr#57123, Zac Dover)
doc/cephfs: edit front matter in mantle.rst (pr#57793, Zac Dover)
doc/cephfs: edit fs-volumes.rst (1 of x) (pr#57419, Zac Dover)
doc/cephfs: edit fs-volumes.rst (1 of x) followup (pr#57428, Zac Dover)
doc/cephfs: edit fs-volumes.rst (2 of x) (pr#57544, Zac Dover)
doc/cephfs: edit mount-using-fuse.rst (pr#54354, Jaanus Torp)
doc/cephfs: edit vstart warning text (pr#57816, Zac Dover)
doc/cephfs: fix \\"file layouts\\" link (pr#58877, Zac Dover)
doc/cephfs: fix \\"OSD capabilities\\" link (pr#58894, Zac Dover)
doc/cephfs: fix architecture link to correct relative path (pr#56341, molpako)
doc/cephfs: improve \\"layout fields\\" text (pr#59252, Zac Dover)
doc/cephfs: improve cache-configuration.rst (pr#59216, Zac Dover)
doc/cephfs: improve ceph-fuse command (pr#56969, Zac Dover)
doc/cephfs: note regarding start time time zone (pr#53577, Milind Changire)
doc/cephfs: rearrange subvolume group information (pr#60437, Indira Sawant)
doc/cephfs: refine client-auth (1 of 3) (pr#56781, Zac Dover)
doc/cephfs: refine client-auth (2 of 3) (pr#56843, Zac Dover)
doc/cephfs: refine client-auth (3 of 3) (pr#56852, Zac Dover)
doc/cephfs: s/mountpoint/mount point/ (pr#59296, Zac Dover)
doc/cephfs: s/mountpoint/mount point/ (pr#59288, Zac Dover)
doc/cephfs: s/subvolumegroups/subvolume groups (pr#57744, Zac Dover)
doc/cephfs: separate commands into sections (pr#57670, Zac Dover)
doc/cephfs: streamline a paragraph (pr#58776, Zac Dover)
doc/cephfs: take Anthony\'s suggestion (pr#58361, Zac Dover)
doc/cephfs: update cephfs-shell link (pr#58372, Zac Dover)
doc/cephfs: Update disaster-recovery-experts.rst to mention Slack (pr#55045, Dhairya Parmar)
doc/cephfs: use \'p\' flag to set layouts or quotas (pr#60484, TruongSinh Tran-Nguyen)
doc/config: edit \\"ceph-conf.rst\\" (pr#54464, Zac Dover)
doc/dev/peering: Change acting set num (pr#59064, qn2060)
doc/dev/release-process.rst: note new \'project\' arguments (pr#57645, Dan Mick)
doc/dev: add \\"activate latest release\\" RTD step (pr#59656, Zac Dover)
doc/dev: add formatting to basic workflow (pr#58739, Zac Dover)
doc/dev: edit \\"Principles for format change\\" (pr#58577, Zac Dover)
doc/dev: edit internals.rst (pr#55853, Zac Dover)
doc/dev: fix spelling in crimson.rst (pr#55738, Zac Dover)
doc/dev: Fix typos in encoding.rst (pr#58306, N Balachandran)
doc/dev: improve basic-workflow.rst (pr#58939, Zac Dover)
doc/dev: link to ceph.io leads list (pr#58107, Zac Dover)
doc/dev: osd_internals/snaps.rst: add clone_overlap doc (pr#56524, Matan Breizman)
doc/dev: refine \\"Concepts\\" (pr#56661, Zac Dover)
doc/dev: refine \\"Concepts\\" 2 of 3 (pr#56726, Zac Dover)
doc/dev: refine \\"Concepts\\" 3 of 3 (pr#56730, Zac Dover)
doc/dev: refine \\"Concepts\\" 4 of 3 (pr#56741, Zac Dover)
doc/dev: remove \\"Stable Releases and Backports\\" (pr#60274, Zac Dover)
doc/dev: repair broken image (pr#57009, Zac Dover)
doc/dev: s/to asses/to assess/ (pr#57424, Zac Dover)
doc/dev: update leads list (pr#56604, Zac Dover)
doc/dev: update leads list (pr#56590, Zac Dover)
doc/dev_guide: add needs-upgrade-testing label info (pr#58731, Zac Dover)
doc/developer_guide: update doc about installing teuthology (pr#57751, Rishabh Dave)
doc/glossary.rst: add \\"Monitor Store\\" (pr#54744, Zac Dover)
doc/glossary.rst: add \\"OpenStack Swift\\" and \\"Swift\\" (pr#57943, Zac Dover)
doc/glossary: add \\"ceph-ansible\\" (pr#59009, Zac Dover)
doc/glossary: add \\"ceph-fuse\\" entry (pr#58945, Zac Dover)
doc/glossary: add \\"Crimson\\" entry (pr#56074, Zac Dover)
doc/glossary: add \\"librados\\" entry (pr#56236, Zac Dover)
doc/glossary: add \\"object storage\\" (pr#59426, Zac Dover)
doc/glossary: Add \\"OMAP\\" to glossary (pr#55750, Zac Dover)
doc/glossary: add \\"PLP\\" to glossary (pr#60505, Zac Dover)
doc/glossary: add \\"Prometheus\\" (pr#58979, Zac Dover)
doc/glossary: add \\"Quorum\\" to glossary (pr#54510, Zac Dover)
doc/glossary: Add \\"S3\\" (pr#57984, Zac Dover)
doc/glossary: Add link to CRUSH paper (pr#55558, Zac Dover)
doc/glossary: improve \\"BlueStore\\" entry (pr#54266, Zac Dover)
doc/glossary: improve \\"MDS\\" entry (pr#55850, Zac Dover)
doc/glossary: improve OSD definitions (pr#55614, Zac Dover)
doc/governance: add Zac Dover\'s updated email (pr#60136, Zac Dover)
doc/install: add manual RADOSGW install procedure (pr#55881, Zac Dover)
doc/install: fix typos in openEuler-installation doc (pr#56414, Rongqi Sun)
doc/install: Keep the name field of the created user consistent with … (pr#59758, hejindong)
doc/install: update \\"update submodules\\" (pr#54962, Zac Dover)
doc/man/8/mount.ceph.rst: add more mount options (pr#55755, Xiubo Li)
doc/man/8/radosgw-admin: add get lifecycle command (pr#57161, rkhudov)
doc/man: add missing long option switches (pr#57708, Patrick Donnelly)
doc/man: edit \\"manipulating the omap key\\" (pr#55636, Zac Dover)
doc/man: edit ceph-bluestore-tool.rst (pr#59684, Zac Dover)
doc/man: edit ceph-osd description (pr#54552, Zac Dover)
doc/man: supplant \\"wsync\\" with \\"nowsync\\" as the default (pr#60201, Zac Dover)
doc/mds: improve wording (pr#59587, Piotr Parczewski)
doc/mgr/dashboard: fix TLS typo (pr#59033, Mindy Preston)
doc/mgr: credit John Jasen for Zabbix 2 (pr#56685, Zac Dover)
doc/mgr: document lack of MSWin NFS 4.x support (pr#55033, Zac Dover)
doc/mgr: edit \\"Overview\\" in dashboard.rst (pr#57337, Zac Dover)
doc/mgr: edit \\"Resolve IP address to hostname before redirect\\" (pr#57297, Zac Dover)
doc/mgr: explain error message - dashboard.rst (pr#57110, Zac Dover)
doc/mgr: remove ceph-exporter (Quincy) (pr#55518, Zac Dover)
doc/mgr: remove Zabbix 1 information (pr#56799, Zac Dover)
doc/mgr: update zabbix information (pr#56632, Zac Dover)
doc/rados/configuration/bluestore-config-ref: Fix lowcase typo (pr#54695, Adam Kupczyk)
doc/rados/configuration/osd-config-ref: fix typo (pr#55679, Pierre Riteau)
doc/rados/operations: add EC overhead table to erasure-code.rst (pr#55245, Anthony D\'Atri)
doc/rados/operations: document ceph balancer status detail
(pr#55264, Laura Flores)
doc/rados/operations: Fix off-by-one errors in control.rst (pr#55232, tobydarling)
doc/rados/operations: Improve crush_location docs (pr#56595, Niklas Hambüchen)
doc/rados/operations: Improve health-checks.rst (pr#59584, Anthony D\'Atri)
doc/rados/operations: remove vanity cluster name reference from crush… (pr#58949, Anthony D\'Atri)
doc/rados/operations: rephrase OSDs peering (pr#57158, Piotr Parczewski)
doc/rados: add \\"change public network\\" procedure (pr#55800, Zac Dover)
doc/rados: add \\"pgs not deep scrubbed in time\\" info (pr#59735, Zac Dover)
doc/rados: add bucket rename command (pr#57028, Zac Dover)
doc/rados: add confval directives to health-checks (pr#59873, Zac Dover)
doc/rados: add link to messenger v2 info in mon-lookup-dns.rst (pr#59796, Zac Dover)
doc/rados: add link to pg blog post (pr#55612, Zac Dover)
doc/rados: add options to network config ref (pr#57917, Zac Dover)
doc/rados: add osd_deep_scrub_interval setting operation (pr#59804, Zac Dover)
doc/rados: add PG definition (pr#55631, Zac Dover)
doc/rados: add pg-states and pg-concepts to tree (pr#58051, Zac Dover)
doc/rados: add stop monitor command (pr#57852, Zac Dover)
doc/rados: add stretch_rule workaround (pr#58183, Zac Dover)
doc/rados: credit Prashant for a procedure (pr#58259, Zac Dover)
doc/rados: document manually passing search domain (pr#58433, Zac Dover)
doc/rados: document unfound object cache-tiering scenario (pr#59382, Zac Dover)
doc/rados: edit \\"client can\'t connect...\\" (pr#54655, Zac Dover)
doc/rados: edit \\"Everything Failed! Now What?\\" (pr#54666, Zac Dover)
doc/rados: edit \\"monitor store failures\\" (pr#54660, Zac Dover)
doc/rados: edit \\"Placement Groups Never Get Clean\\" (pr#60048, Zac Dover)
doc/rados: edit \\"recovering broken monmap\\" (pr#54602, Zac Dover)
doc/rados: edit \\"troubleshooting-mon\\" (pr#54503, Zac Dover)
doc/rados: edit \\"understanding mon_status\\" (pr#54580, Zac Dover)
doc/rados: edit \\"Using the Monitor\'s Admin Socket\\" (pr#54577, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (1 of x) (pr#54419, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (2 of x) (pr#54422, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (3 of x) (pr#54439, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (4 of x) (pr#54444, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (5 of x) (pr#54456, Zac Dover)
doc/rados: edit t-mon.rst text (pr#54350, Zac Dover)
doc/rados: edit t-shooting-mon.rst (pr#54428, Zac Dover)
doc/rados: edit troubleshooting-osd.rst (pr#58273, Zac Dover)
doc/rados: edit troubleshooting-pg.rst (pr#54229, Zac Dover)
doc/rados: explain replaceable parts of command (pr#58061, Zac Dover)
doc/rados: fix broken links (pr#55681, Zac Dover)
doc/rados: fix outdated value for ms_bind_port_max (pr#57049, Pierre Riteau)
doc/rados: followup to PR#58057 (pr#58163, Zac Dover)
doc/rados: format \\"initial troubleshooting\\" (pr#54478, Zac Dover)
doc/rados: format Q&A list in t-mon.rst (pr#54346, Zac Dover)
doc/rados: format Q&A list in tshooting-mon.rst (pr#54367, Zac Dover)
doc/rados: format sections in tshooting-mon.rst (pr#54639, Zac Dover)
doc/rados: improve \\"Ceph Subsystems\\" (pr#54703, Zac Dover)
doc/rados: improve \\"scrubbing\\" explanation (pr#54271, Zac Dover)
doc/rados: improve formatting of log-and-debug.rst (pr#54747, Zac Dover)
doc/rados: improve leader/peon monitor explanation (pr#57960, Zac Dover)
doc/rados: link to pg setting commands (pr#55937, Zac Dover)
doc/rados: ops/pgs: s/power of 2/power of two (pr#54701, Zac Dover)
doc/rados: parallelize t-mon headings (pr#54462, Zac Dover)
doc/rados: PR#57022 unfinished business (pr#57266, Zac Dover)
doc/rados: remove dual-stack docs (pr#57074, Zac Dover)
doc/rados: remove PGcalc from docs (pr#55902, Zac Dover)
doc/rados: remove redundant pg repair commands (pr#57041, Zac Dover)
doc/rados: repair stretch-mode.rst (pr#54763, Zac Dover)
doc/rados: restore PGcalc tool (pr#56058, Zac Dover)
doc/rados: revert \\"doc/rados/operations: document ceph balancer status detail
\\" (pr#55359, Laura Flores)
doc/rados: s/cepgsqlite/cephsqlite/ (pr#57248, Zac Dover)
doc/rados: standardize markup of \\"clean\\" (pr#60502, Zac Dover)
doc/rados: update \\"stretch mode\\" (pr#54757, Michael Collins)
doc/rados: update common.rst (pr#56269, Zac Dover)
doc/rados: update config for autoscaler (pr#55439, Zac Dover)
doc/rados: update how to install c++ header files (pr#58309, Pere Diaz Bou)
doc/rados: update PG guidance (pr#55461, Zac Dover)
doc/radosgw - edit admin.rst \\"set user rate limit\\" (pr#55151, Zac Dover)
doc/radosgw/admin.rst: use underscores in config var names (pr#54934, Ville Ojamo)
doc/radosgw/multisite: fix Configuring Secondary Zones -> Updating the Period (pr#60334, Casey Bodley)
doc/radosgw: add confval directives (pr#55485, Zac Dover)
doc/radosgw: add gateway starting command (pr#54834, Zac Dover)
doc/radosgw: admin.rst - edit \\"Create a Subuser\\" (pr#55021, Zac Dover)
doc/radosgw: admin.rst - edit \\"Create a User\\" (pr#55005, Zac Dover)
doc/radosgw: admin.rst - edit sections (pr#55018, Zac Dover)
doc/radosgw: disambiguate version-added remarks (pr#57142, Zac Dover)
doc/radosgw: edit \\"Add/Remove a Key\\" (pr#55056, Zac Dover)
doc/radosgw: edit \\"Enable/Disable Bucket Rate Limit\\" (pr#55261, Zac Dover)
doc/radosgw: edit \\"read/write global rate limit\\" admin.rst (pr#55272, Zac Dover)
doc/radosgw: edit \\"remove a subuser\\" (pr#55035, Zac Dover)
doc/radosgw: edit \\"Usage\\" admin.rst (pr#55322, Zac Dover)
doc/radosgw: edit admin.rst \\"Get Bucket Rate Limit\\" (pr#55254, Zac Dover)
doc/radosgw: edit admin.rst \\"get user rate limit\\" (pr#55158, Zac Dover)
doc/radosgw: edit admin.rst \\"set bucket rate limit\\" (pr#55243, Zac Dover)
doc/radosgw: edit admin.rst - quota (pr#55083, Zac Dover)
doc/radosgw: edit admin.rst 1 of x (pr#55001, Zac Dover)
doc/radosgw: edit compression.rst (pr#54986, Zac Dover)
doc/radosgw: edit front matter - role.rst (pr#54855, Zac Dover)
doc/radosgw: edit multisite.rst (pr#55672, Zac Dover)
doc/radosgw: edit sections (pr#55028, Zac Dover)
doc/radosgw: fix formatting (pr#54754, Zac Dover)
doc/radosgw: Fix JSON typo in Principal Tag example code snippet (pr#54643, Daniel Parkes)
doc/radosgw: fix verb disagreement - index.html (pr#55339, Zac Dover)
doc/radosgw: format \\"Create a Role\\" (pr#54887, Zac Dover)
doc/radosgw: format commands in role.rst (pr#54906, Zac Dover)
doc/radosgw: format POST statements (pr#54850, Zac Dover)
doc/radosgw: Improve dynamicresharding.rst (pr#54369, Anthony D\'Atri)
doc/radosgw: Revert \\"doc/rgw/lua: add info uploading a (pr#55526, Zac Dover)
doc/radosgw: update link in rgw-cache.rst (pr#54806, Zac Dover)
doc/radosgw: update S3 action list (pr#57366, Zac Dover)
doc/radosgw: use \'confval\' directive for reshard config options (pr#57025, Casey Bodley)
doc/radosrgw: edit admin.rst (pr#55074, Zac Dover)
doc/rbd/rbd-exclusive-locks: mention incompatibility with advisory locks (pr#58865, Ilya Dryomov)
doc/rbd: \\"rbd flatten\\" doesn\'t take encryption options in quincy (pr#56272, Ilya Dryomov)
doc/rbd: add namespace information for mirror commands (pr#60271, N Balachandran)
doc/rbd: minor changes to the rbd man page (pr#56257, N Balachandran)
doc/README.md - add ordered list (pr#59800, Zac Dover)
doc/README.md: create selectable commands (pr#59836, Zac Dover)
doc/README.md: edit \\"Build Prerequisites\\" (pr#59639, Zac Dover)
doc/README.md: improve formatting (pr#59702, Zac Dover)
doc/rgw/d3n: pass cache dir volume to extra_container_args (pr#59769, Mark Kogan)
doc/rgw/notification: persistent notification queue full behavior (pr#59235, Yuval Lifshitz)
doc/rgw/notifications: specify which event types are enabled by default (pr#54501, Yuval Lifshitz)
doc/rgw: edit admin.rst - rate limit management (pr#55129, Zac Dover)
doc/rgw: fix Attributes index in CreateTopic example (pr#55433, Casey Bodley)
doc/security: remove old GPG information (pr#56915, Zac Dover)
doc/security: update CVE list (pr#57019, Zac Dover)
doc/src: add inline literals (``) to variables (pr#57938, Zac Dover)
doc/src: invadvisable is not a word (pr#58191, Doug Whitfield)
doc/start: Add Beginner\'s Guide (pr#57823, Zac Dover)
doc/start: add links to Beginner\'s Guide (pr#58204, Zac Dover)
doc/start: add Slack invite link (pr#56042, Zac Dover)
doc/start: add vstart install guide (pr#60463, Zac Dover)
doc/start: Edit Beginner\'s Guide (pr#57846, Zac Dover)
doc/start: explain \\"OSD\\" (pr#54560, Zac Dover)
doc/start: fix typo in hardware-recommendations.rst (pr#54481, Anthony D\'Atri)
doc/start: fix wording & syntax (pr#58365, Piotr Parczewski)
doc/start: improve MDS explanation (pr#56467, Zac Dover)
doc/start: improve MDS explanation (pr#56427, Zac Dover)
doc/start: link to mon map command (pr#56411, Zac Dover)
doc/start: remove \\"intro.rst\\" (pr#57950, Zac Dover)
doc/start: remove mention of Centos 8 support (pr#58391, Zac Dover)
doc/start: s/http/https/ in links (pr#57872, Zac Dover)
doc/start: s/intro.rst/index.rst/ (pr#57904, Zac Dover)
doc/start: update mailing list links (pr#58685, Zac Dover)
doc/start: update release names (pr#54573, Zac Dover)
doc: add description of metric fields for cephfs-top (pr#55512, Neeraj Pratap Singh)
doc: add supported file types in cephfs-mirroring.rst (pr#54823, Jos Collin)
doc: Amend dev mailing list subscribe instructions (pr#58698, Paulo E. Castro)
doc: cephadm/services/osd: fix typo (pr#56231, Lorenz Bausch)
doc: clarify availability vs integrity (pr#58132, Gregory O\'Neill)
doc: clarify superuser note for ceph-fuse (pr#58616, Patrick Donnelly)
doc: clarify use of location: in host spec (pr#57648, Matthew Vernon)
doc: Correct link to \\"Device management\\" (pr#58490, Matthew Vernon)
doc: Correct link to Prometheus docs (pr#59561, Matthew Vernon)
doc: correct typo (pr#57885, Matthew Vernon)
doc: discuss the standard multi-tenant CephFS security model (pr#53559, Greg Farnum)
doc: Document the Windows CI job (pr#60035, Lucian Petrut)
doc: documenting the feature that scrub clear the entries from damage… (pr#59080, Neeraj Pratap Singh)
doc: explain the consequence of enabling mirroring through monitor co… (pr#60527, Jos Collin)
doc: fix email (pr#60235, Ernesto Puerta)
doc: fix typo (pr#59993, N Balachandran)
doc: Fixes two typos and grammatical errors. Signed-off-by: Sina Ahma… (pr#54776, Sina Ahmadi)
doc: Improve doc/radosgw/placement.rst (pr#58975, Anthony D\'Atri)
doc: specify correct fs type for mkfs (pr#55283, Vladislav Glagolev)
doc: SubmittingPatches-backports - remove backports team (pr#60299, Zac Dover)
doc: Update \\"Getting Started\\" to link to start not install (pr#59909, Matthew Vernon)
doc: Update dynamicresharding.rst (pr#54330, Aliaksei Makarau)
doc: update rgw admin api req params for get user info (pr#55072, Ali Maredia)
doc: update tests-integration-testing-teuthology-workflow.rst (pr#59550, Vallari Agrawal)
doc:start.rst fix typo in hw-recs (pr#55506, Eduardo Roldan)
doc:update e-mail addresses governance (pr#60086, Tobias Fischer)
docs/rados/operations/stretch-mode: warn device class is not supported (pr#59101, Kamoltat Sirivadhna)
docs/rados: remove incorrect ceph command (pr#56496, Taha Jahangir)
docs/radosgw: edit admin.rst \\"enable/disable user rate limit\\" (pr#55195, Zac Dover)
docs/rbd: fix typo in arg name (pr#56263, N Balachandran)
docs: Add information about OpenNebula integration (pr#54939, Daniel Clavijo)
docs: removed centos 8 and added squid to the build matrix (pr#58903, Yuri Weinstein)
global: Call getnam_r with a 64KiB buffer on the heap (pr#60124, Adam Emerson)
install-deps.sh, do_cmake.sh: almalinux is another el flavour (pr#58523, Dan van der Ster)
install-deps: save and restore user\'s XDG_CACHE_HOME (pr#56991, luo rixin)
kv/RocksDBStore: Configure compact-on-deletion for all CFs (pr#57404, Joshua Baergen)
librados: make querying pools for selfmanaged snaps reliable (pr#55025, Ilya Dryomov)
librados: use CEPH_OSD_FLAG_FULL_FORCE for IoCtxImpl::remove (pr#59283, Chen Yuanrun)
librbd/crypto: fix issue when live-migrating from encrypted export (pr#59144, Ilya Dryomov)
librbd/migration: prune snapshot extents in RawFormat::list_snaps() (pr#59659, Ilya Dryomov)
librbd: account for discards that truncate in ObjectListSnapsRequest (pr#56212, Ilya Dryomov)
librbd: Append one journal event per image request (pr#54819, Ilya Dryomov, Joshua Baergen)
librbd: create rbd_trash object during pool initialization and namespace creation (pr#57604, Ramana Raja)
librbd: diff-iterate shouldn\'t crash on an empty byte range (pr#58210, Ilya Dryomov)
librbd: disallow group snap rollback if memberships don\'t match (pr#58208, Ilya Dryomov)
librbd: don\'t crash on a zero-length read if buffer is NULL (pr#57569, Ilya Dryomov)
librbd: don\'t report HOLE_UPDATED when diffing against a hole (pr#54950, Ilya Dryomov)
librbd: fix regressions in ObjectListSnapsRequest (pr#54861, Ilya Dryomov)
librbd: fix split() for SparseExtent and SparseBufferlistExtent (pr#55664, Ilya Dryomov)
librbd: improve rbd_diff_iterate2() performance in fast-diff mode (pr#55257, Ilya Dryomov)
librbd: make diff-iterate in fast-diff mode aware of encryption (pr#58342, Ilya Dryomov)
librbd: make group and group snapshot IDs more random (pr#57090, Ilya Dryomov)
librbd: return ENOENT from Snapshot::get_timestamp for nonexistent snap_id (pr#55473, John Agombar)
librgw: teach librgw about rgw_backend_store (pr#59315, Matt Benjamin)
log: Make log_max_recent have an effect again (pr#48310, Joshua Baergen)
make-dist: don\'t use --continue option for wget (pr#55092, Casey Bodley)
MClientRequest: properly handle ceph_mds_request_head_legacy for ext_num_retry, ext_num_fwd, owner_uid, owner_gid (pr#54411, Alexander Mikhalitsyn)
mds,qa: some balancer debug messages (<=5) not printed when debug_mds is >=5 (pr#53551, Patrick Donnelly)
mds/MDBalancer: ignore queued callbacks if MDS is not active (pr#54494, Leonid Usov)
mds/MDSRank: Add set_history_slow_op_size_and_threshold for op_tracker (pr#53358, Yite Gu)
mds: add a command to dump directory information (pr#55986, Jos Collin, Zhansong Gao)
mds: add debug logs during setxattr ceph.dir.subvolume (pr#56061, Milind Changire)
mds: adjust pre_segments_size for MDLog when trimming segments for st… (issue#59833, pr#54034, Venky Shankar)
mds: allow lock state to be LOCK_MIX_SYNC in replica for filelock (pr#56050, Xiubo Li)
mds: change priority of mds rss perf counter to useful (pr#55058, sp98)
mds: disable `defer_client_eviction_on_laggy_osds\' by default (issue#64685, pr#56195, Venky Shankar)
mds: do not simplify fragset (pr#54892, Milind Changire)
mds: do remove the cap when seqs equal or larger than last issue (pr#58296, Xiubo Li)
mds: dump locks when printing mutation ops (pr#52976, Patrick Donnelly)
mds: ensure next replay is queued on req drop (pr#54315, Patrick Donnelly)
mds: fix session/client evict command (issue#68132, pr#58724, Venky Shankar, Neeraj Pratap Singh)
mds: log message when exiting due to asok command (pr#53549, Patrick Donnelly)
mds: prevent scrubbing for standby-replay MDS (pr#58799, Neeraj Pratap Singh)
mds: replacing bootstrap session only if handle client session message (pr#53363, Mer Xuanyi)
mds: revert standby-replay trimming changes (pr#54717, Patrick Donnelly)
mds: set the correct WRLOCK flag always in wrlock_force() (pr#58773, Xiubo Li)
mds: set the loner to true for LOCK_EXCL_XSYN (pr#54910, Xiubo Li)
mds: try to choose a new batch head in request_clientup() (pr#58843, Xiubo Li)
mds: use variable g_ceph_context directly in MDSAuthCaps (pr#52820, Rishabh Dave)
MDSAuthCaps: print better error message for perm flag in MDS caps (pr#54946, Rishabh Dave)
mgr/BaseMgrModule: Optimize CPython Call in Finish Function (pr#57585, Nitzan Mordechai)
mgr/cephadm: Add \\"networks\\" parameter to orch apply rgw (pr#55318, Teoman ONAY)
mgr/cephadm: add \\"original_weight\\" parameter to OSD class (pr#59412, Adam King)
mgr/cephadm: add ability for haproxy, prometheus, grafana to bind on specific ip (pr#58753, Adam King)
mgr/cephadm: add is_host\\\\_functions to HostCache (pr#55964, Adam King)
mgr/cephadm: Adding extra arguments support for RGW frontend (pr#55963, Adam King, Redouane Kachach)
mgr/cephadm: allow draining host without removing conf/keyring files (pr#55973, Adam King)
mgr/cephadm: catch CancelledError in asyncio timeout handler (pr#56086, Adam King)
mgr/cephadm: ceph orch add fails when ipv6 address is surrounded by square brackets (pr#56079, Teoman ONAY)
mgr/cephadm: cleanup iscsi keyring upon daemon removal (pr#58757, Adam King)
mgr/cephadm: don\'t use image tag in orch upgrade ls (pr#55974, Adam King)
mgr/cephadm: fix flake8 test failures (pr#58077, Nizamudeen A)
mgr/cephadm: fix placement with label and host pattern (pr#56088, Adam King)
mgr/cephadm: fix reweighting of OSD when OSD removal is stopped (pr#56083, Adam King)
mgr/cephadm: Fix unfound progress events (pr#58758, Prashant D)
mgr/cephadm: fixups for asyncio based timeout (pr#55556, Adam King)
mgr/cephadm: make client-keyring deploying ceph.conf optional (pr#58754, Adam King)
mgr/cephadm: make setting --cgroups=split configurable for adopted daemons (pr#58759, Gilad Sid)
mgr/cephadm: pick correct IPs for ingress service based on VIP (pr#55970, Redouane Kachach, Adam King)
mgr/cephadm: refresh public_network for config checks before checking (pr#56492, Adam King)
mgr/cephadm: support for regex based host patterns (pr#56222, Adam King)
mgr/cephadm: support for removing host entry from crush map during host removal (pr#56081, Adam King)
mgr/cephadm: update timestamp on repeat daemon/service events (pr#56080, Adam King)
mgr/dashboard/frontend:Ceph dashboard supports multiple languages (pr#56360, TomNewChao)
mgr/dashboard: add Table Schema to grafonnet (pr#56737, Aashish Sharma)
mgr/dashboard: allow tls 1.2 with a config option (pr#53779, Nizamudeen A)
mgr/dashboard: change deprecated grafana URL in daemon logs (pr#55545, Nizamudeen A)
mgr/dashboard: Consider null values as zero in grafana panels (pr#54540, Aashish Sharma)
mgr/dashboard: debugging make check failure (pr#56128, Nizamudeen A)
mgr/dashboard: disable dashboard v3 in quincy (pr#54250, Nizamudeen A)
mgr/dashboard: exclude cloned-deleted RBD snaps (pr#57221, Ernesto Puerta)
mgr/dashboard: fix duplicate grafana panels when on mgr failover (pr#56930, Avan Thakkar)
mgr/dashboard: fix duplicate grafana panels when on mgr failover (pr#56270, Avan Thakkar)
mgr/dashboard: fix e2e failure related to landing page (pr#55123, Pedro Gonzalez Gomez)
mgr/dashboard: fix error while accessing roles tab when policy attached (pr#55516, Nizamudeen A, Afreen)
mgr/dashboard: fix rgw port manipulation error in dashboard (pr#54176, Nizamudeen A)
mgr/dashboard: fix the jsonschema issue in install-deps (pr#55543, Nizamudeen A)
mgr/dashboard: get rgw port from ssl_endpoint (pr#55248, Nizamudeen A)
mgr/dashboard: make ceph logo redirect to dashboard (pr#56558, Afreen)
mgr/dashboard: rbd image hide usage bar when disk usage is not provided (pr#53809, Pedro Gonzalez Gomez)
mgr/dashboard: remove green tick on old password field (pr#53385, Nizamudeen A)
mgr/dashboard: remove unnecessary failing hosts e2e (pr#53459, Pedro Gonzalez Gomez)
mgr/dashboard: replace deprecated table panel in grafana with a newer table panel (pr#56680, Aashish Sharma)
mgr/dashboard: replace piechart plugin charts with native pie chart panel (pr#56655, Aashish Sharma)
mgr/dashboard: rm warning/error threshold for cpu usage (pr#56441, Nizamudeen A)
mgr/dashboard: sanitize dashboard user creation (pr#56551, Pedro Gonzalez Gomez)
mgr/dashboard: Show the OSDs Out and Down panels as red whenever an OSD is in Out or Down state in Ceph Cluster grafana dashboard (pr#54539, Aashish Sharma)
mgr/dashboard: upgrade from old \'graph\' type panels to the new \'timeseries\' panel (pr#56653, Aashish Sharma)
mgr/k8sevents: update V1Events to CoreV1Events (pr#57995, Nizamudeen A)
mgr/Mgr.cc: clear daemon health metrics instead of removing down/out osd from daemon state (pr#58512, Cory Snyder)
mgr/nfs: Don\'t crash ceph-mgr if NFS clusters are unavailable (pr#58284, Anoop C S, Ponnuvel Palaniyappan)
mgr/pg_autoscaler: add check for norecover flag (pr#57568, Aishwarya Mathuria)
mgr/prometheus: s/pkg_resources.packaging/packaging/ (pr#58627, Adam King, Kefu Chai)
mgr/rbd_support: fix recursive locking on CreateSnapshotRequests lock (pr#54290, Ramana Raja)
mgr/rest: Trim requests array and limit size (pr#59370, Nitzan Mordechai)
mgr/snap_schedule: add support for monthly snapshots (pr#54894, Milind Changire)
mgr/snap_schedule: make fs argument mandatory if more than one filesystem exists (pr#54090, Milind Changire)
mgr/snap_schedule: restore yearly spec to lowercase y (pr#57445, Milind Changire)
mgr/snap_schedule: support subvol and group arguments (pr#55210, Milind Changire)
mgr/stats: initialize mx_last_updated in FSPerfStats (pr#57442, Jos Collin)
mgr/vol: handle case where clone index entry goes missing (pr#58558, Rishabh Dave)
mgr/volumes: fix subvolume group rm
error message (pr#54206, neeraj pratap singh, Neeraj Pratap Singh)
mgr: add throttle policy for DaemonServer (pr#54012, ericqzhao)
mgr: don\'t dump global config holding gil (pr#50193, Mykola Golub)
mgr: fix a race condition in DaemonServer::handle_report() (pr#54555, Radoslaw Zarzynski)
mgr: remove out&down osd from mgr daemons (pr#54534, shimin)
mon/ConfigMonitor: Show localized name in \\"config dump --format json\\" output (pr#53886, Sridhar Seshasayee)
mon/ConnectionTracker.cc: disregard connection scores from mon_rank = -1 (pr#55166, Kamoltat)
mon/LogMonitor: Use generic cluster log level config (pr#57521, Prashant D)
mon/MonClient: handle ms_handle_fast_authentication return (pr#59308, Patrick Donnelly)
mon/Monitor: during shutdown don\'t accept new authentication and crea… (pr#55597, Nitzan Mordechai)
mon/OSDMonitor: Add force-remove-snap mon command (pr#59403, Matan Breizman)
mon/OSDMonitor: fix get_min_last_epoch_clean() (pr#55868, Matan Breizman, Adam C. Emerson)
mon/OSDMonitor: fix rmsnap command (pr#56430, Matan Breizman)
mon: add exception handling to ceph health mute (pr#55117, Daniel Radjenovic)
mon: add proxy to cache tier options (pr#50551, tan changzhi)
mon: fix health store size growing infinitely (pr#55549, Wei Wang)
mon: fix inconsistencies in class param (pr#59278, Victoria Mackie)
mon: fix mds metadata lost in one case (pr#54317, shimin)
mon: stuck peering since warning is misleading (pr#57407, shreyanshjain7174)
msg/async: Encode message once features are set (pr#59442, Aishwarya Mathuria)
msg/AsyncMessenger: re-evaluate the stop condition when woken up in \'wait()\' (pr#53718, Leonid Usov)
msg: update MOSDOp() to use ceph_tid_t instead of long (pr#55425, Lucian Petrut)
nofail option in fstab not supported (pr#52986, Leonid Usov)
os/bluestore: allow use BtreeAllocator (pr#59498, tan changzhi)
os/bluestore: enable async manual compactions (pr#58742, Igor Fedotov)
os/bluestore: expand BlueFS log if available space is insufficient (pr#57243, Pere Diaz Bou)
os/bluestore: fix crash caused by dividing by 0 (pr#57198, Jrchyang Yu)
os/bluestore: fix free space update after bdev-expand in NCB mode (pr#55776, Igor Fedotov)
os/bluestore: fix the problem of l_bluefs_log_compactions double recording (pr#57196, Wang Linke)
os/bluestore: get rid off resulting lba alignment in allocators (pr#54877, Igor Fedotov)
os/bluestore: set rocksdb iterator bounds for Bluestore::_collection_list() (pr#57622, Cory Snyder)
os/bluestore: Warning added for slow operations and stalled read (pr#59468, Md Mahamudur Rahaman Sajib)
os/store_test: Retune tests to current code (pr#56138, Adam Kupczyk)
os: introduce ObjectStore::refresh_perf_counters() method (pr#55133, Igor Fedotov)
osd/ECTransaction: Remove incorrect asserts in generate_transactions (pr#59132, Mark Nelson)
osd/OSD: introduce reset_purged_snaps_last (pr#53973, Matan Breizman)
osd/OSDMap: Check for uneven weights & != 2 buckets post stretch mode (pr#52458, Kamoltat)
osd/scrub: increasing max_osd_scrubs to 3 (pr#55174, Ronen Friedman)
osd/SnapMapper: fix _lookup_purged_snap (pr#56815, Matan Breizman)
osd/TrackedOp: Fix TrackedOp event order (pr#59109, YiteGu)
osd: always send returnvec-on-errors for client\'s retry (pr#59378, Radoslaw Zarzynski)
osd: avoid watcher remains after \\"rados watch\\" is interrupted (pr#58845, weixinwei)
osd: bring the missed fmt::formatter for snapid_t to address FTBFS (pr#54175, Radosław Zarzyński)
osd: CEPH_OSD_OP_FLAG_BYPASS_CLEAN_CACHE flag is passed from ECBackend (pr#57620, Md Mahamudur Rahaman Sajib)
osd: do not assert on fast shutdown timeout (pr#55134, Igor Fedotov)
osd: don\'t require RWEXCL lock for stat+write ops (pr#54594, Alice Zhao)
osd: ensure async recovery does not drop a pg below min_size (pr#54549, Samuel Just)
osd: fix for segmentation fault on OSD fast shutdown (pr#57614, Md Mahamudur Rahaman Sajib)
osd: fix use-after-move in build_incremental_map_msg() (pr#54269, Ronen Friedman)
osd: improve OSD robustness (pr#54785, Igor Fedotov)
osd: log the number of extents for sparse read (pr#54605, Xiubo Li)
osd: make _set_cache_sizes ratio aware of cache_kv_onode_ratio (pr#55235, Raimund Sacherer)
osd: Report health error if OSD public address is not within subnet (pr#55698, Prashant D)
override client features (pr#58227, Patrick Donnelly)
pybind/mgr/devicehealth: replace SMART data if exists for same DATETIME (pr#54880, Patrick Donnelly)
pybind/mgr/devicehealth: skip legacy objects that cannot be loaded (pr#56480, Patrick Donnelly)
pybind/mgr/mirroring: drop mon_host from peer_list (pr#55238, Jos Collin)
pybind/mgr/pg_autoscaler: Cut back osdmap.get_pools calls (pr#54904, Kamoltat)
pybind/mgr/volumes: log mutex locks to help debug deadlocks (pr#53917, Kotresh HR)
pybind/mgr: disable sqlite3/python autocommit (pr#57199, Patrick Donnelly)
pybind/mgr: reopen database handle on blocklist (pr#52461, Patrick Donnelly)
pybind/rbd: don\'t produce info on errors in aio_mirror_image_get_info() (pr#54054, Ilya Dryomov)
pybind/rbd: expose CLONE_FORMAT and FLATTEN image options (pr#57308, Ilya Dryomov)
python-common/drive_group: handle fields outside of \'spec\' even when \'spec\' is provided (pr#55962, Adam King)
python-common/drive_selection: fix limit with existing devices (pr#56085, Adam King)
python-common/drive_selection: lower log level of limit policy message (pr#55961, Adam King)
python-common: fix osdspec_affinity check (pr#56084, Guillaume Abrioux)
python-common: handle \\"anonymous_access: false\\" in to_json of Grafana spec (pr#58756, Adam King)
qa/cephadm: testing for extra daemon/container features (pr#55958, Adam King)
qa/cephfs: add mgr debugging (pr#56417, Patrick Donnelly)
qa/cephfs: add probabilistic ignorelist for pg_health (pr#56667, Patrick Donnelly)
qa/cephfs: CephFSTestCase.create_client() must keyring (pr#56837, Rishabh Dave)
qa/cephfs: fix build failure for mdtest project (pr#53826, Rishabh Dave)
qa/cephfs: fix ior project build failure (pr#53824, Rishabh Dave)
qa/cephfs: handle non-numeric values for json.loads() (pr#54187, Rishabh Dave)
qa/cephfs: ignorelist clog of MDS_UP_LESS_THAN_MAX (pr#56404, Patrick Donnelly)
qa/cephfs: no reliance on centos (pr#59037, Venky Shankar)
qa/cephfs: switch to python3 for centos stream 9 (pr#53626, Xiubo Li)
qa/distros: backport update from rhel 8.4 -> 8.6 (pr#54902, David Galloway)
qa/distros: replace centos 8 references with centos 9 in the rados suite (pr#58520, Laura Flores)
qa/orch: drop centos 8 and rhel 8.6 for orch suite tests (pr#58769, Adam King, Laura Flores, Guillaume Abrioux, Casey Bodley)
qa/rgw: adapt tests to centos 9 (pr#58601, Mark Kogan, Casey Bodley, Ali Maredia, Yuval Lifshitz)
qa/rgw: barbican uses branch stable/2023.1 (pr#56818, Casey Bodley)
qa/suites/fs/nfs: use standard health ignorelist (pr#56393, Patrick Donnelly)
qa/suites/fs: skip check-counters for iogen workload (pr#58278, Ramana Raja)
qa/suites/krbd: drop pre-single-major and move \\"layering only\\" coverage (pr#57463, Ilya Dryomov)
qa/suites/krbd: stress test for recovering from watch errors for -o exclusive (pr#58855, Ilya Dryomov)
qa/suites/rados/singleton: add POOL_APP_NOT_ENABLED to ignorelist (pr#57488, Laura Flores)
qa/suites/rbd/iscsi: enable all supported container hosts (pr#60087, Ilya Dryomov)
qa/suites/rbd: add test to check rbd_support module recovery (pr#54292, Ramana Raja)
qa/suites/rbd: override extra_system_packages directly on install task (pr#57764, Ilya Dryomov)
qa/suites/upgrade/quincy-p2p: run librbd python API tests from quincy tip (pr#55554, Yuri Weinstein)
qa/suites: add \\"mon down\\" log variations to ignorelist (pr#58762, Laura Flores)
qa/suites: drop --show-reachable=yes from fs:valgrind tests (pr#59067, Jos Collin)
qa/tasks/ceph_manager.py: Rewrite test_pool_min_size (pr#55882, Kamoltat)
qa/tasks/cephfs/test_misc: switch duration to timeout (pr#55745, Xiubo Li)
qa/tasks/qemu: Fix OS version comparison (pr#58169, Zack Cerza)
qa/test_nfs: fix test failure when cluster does not exist (pr#56753, John Mulligan)
qa/tests: added client-upgrade-quincy-squid tests (pr#58445, Yuri Weinstein)
qa/workunits/rados: enable crb and install generic package for c9 (pr#59330, Laura Flores)
qa/workunits/rbd/cli_generic.sh: narrow race window when checking that rbd_support module command fails after blocklisting the module\'s client (pr#54770, Ramana Raja)
qa/workunits/rbd: avoid caching effects in luks-encryption.sh (pr#58852, Ilya Dryomov, Or Ozeri)
qa/workunits: fix test_dashboard_e2e.sh: no spec files found (pr#53857, Nizamudeen A)
qa: account for rbd_trash object in krbd_data_pool.sh + related ceph{,adm} task fixes (pr#58539, Ilya Dryomov)
qa: add a YAML to ignore MGR_DOWN warning (pr#57564, Dhairya Parmar)
qa: add diff-continuous and compare-mirror-image tests to rbd and krbd suites respectively (pr#55929, Ramana Raja)
qa: Add tests to validate synced images on rbd-mirror (pr#55763, Ilya Dryomov, Ramana Raja)
qa: adjust expected io_opt in krbd_discard_granularity.t (pr#59230, Ilya Dryomov)
qa: assign file system affinity for replaced MDS (issue#61764, pr#54038, Venky Shankar)
qa: barbican: restrict python packages with upper-constraints (pr#59325, Tobias Urdin)
qa: bump up scrub status command timeout (pr#55916, Milind Changire)
qa: cleanup snapshots before subvolume delete (pr#58333, Milind Changire)
qa: correct usage of DEBUGFS_META_DIR in dedent (pr#56166, Venky Shankar)
qa: fix error reporting string in assert_cluster_log (pr#55392, Dhairya Parmar)
qa: Fix fs/full suite (pr#55828, Kotresh HR)
qa: fix krbd_msgr_segments and krbd_rxbounce failing on 8.stream (pr#57029, Ilya Dryomov)
qa: fix rank_asok() to handle errors from asok commands (pr#55301, Neeraj Pratap Singh)
qa: ignore container checkpoint/restore related selinux denials for c… (issue#67119, issue#66640, pr#58807, Venky Shankar)
qa: increase the http postBuffer size and disable sslVerify (pr#53629, Xiubo Li)
qa: lengthen shutdown timeout for thrashed MDS (pr#53554, Patrick Donnelly)
qa: move nfs (mgr/nfs) related tests to fs suite (pr#53907, Dhairya Parmar, Venky Shankar)
qa: remove error string checks and check w/ return value (pr#55944, Venky Shankar)
qa: remove vstart runner from radosgw_admin task (pr#55098, Ali Maredia)
qa: run kernel_untar_build with newer tarball (pr#54712, Milind Changire)
qa: set mds config with config set
for a particular test (issue#57087, pr#56168, Venky Shankar)
qa: unmount clients before damaging the fs (pr#57526, Patrick Donnelly)
qa: Wait for purge to complete (pr#53911, Kotresh HR)
rados: Set snappy as default value in ms_osd_compression_algorithm (pr#57406, shreyanshjain7174)
RadosGW API: incorrect bucket quota in response to HEAD /{bucket}/?usage (pr#53438, shreyanshjain7174)
radosgw-admin: don\'t crash on --placement-id without --storage-class (pr#53473, Casey Bodley)
radosgw-admin: fix segfault on pipe modify without source/dest zone specified (pr#51257, caisan)
rbd-mirror: clean up stale pool replayers and callouts better (pr#57305, Ilya Dryomov)
rbd-mirror: use correct ioctx for namespace (pr#59774, N Balachandran)
rbd-nbd: fix resize of images mapped using netlink (pr#55317, Ramana Raja)
rbd-nbd: fix stuck with disable request (pr#54255, Prasanna Kumar Kalever)
rbd: \\"rbd bench\\" always writes the same byte (pr#59500, Ilya Dryomov)
rbd: amend \\"rbd {group,} rename\\" and \\"rbd mirror pool\\" command descriptions (pr#59600, Ilya Dryomov)
Revert \\"exporter: user only counter dump/schema commands for extacting counters\\" (pr#54169, Casey Bodley)
Revert \\"quincy: ceph_fs.h: add separate owner\\\\_{u,g}id fields\\" (pr#54108, Venky Shankar)
RGW - Get quota on OPs with a bucket (pr#52935, Daniel Gryniewicz)
rgw : fix add initialization for RGWGC::process() (pr#59338, caolei)
rgw/admin/notifications: support admin operations on topics with tenants (pr#59322, Yuval Lifshitz)
rgw/amqp: store CA location string in connection object (pr#54170, Yuval Lifshitz)
rgw/auth/s3: validate x-amz-content-sha256 for empty payloads (pr#59359, Casey Bodley)
rgw/auth: Add service token support for Keystone auth (pr#54445, Tobias Urdin)
rgw/auth: Fix the return code returned by AuthStrategy (pr#54795, Pritha Srivastava)
rgw/auth: ignoring signatures for HTTP OPTIONS calls (pr#60458, Tobias Urdin)
rgw/beast: Enable SSL session-id reuse speedup mechanism (pr#56119, Mark Kogan)
rgw/crypt: apply rgw_crypt_default_encryption_key by default (pr#52795, Casey Bodley)
rgw/iam: admin/system users ignore iam policy parsing errors (pr#54842, Casey Bodley)
rgw/kafka/amqp: fix race conditionn in async completion handlers (pr#54737, Yuval Lifshitz)
rgw/kafka: remove potential race condition between creation and deletion of endpoint (pr#51797, Yuval Lifshitz)
rgw/kafka: set message timeout to 5 seconds (pr#56163, Yuval Lifshitz)
rgw/keystone: EC2Engine uses reject() for ERR_SIGNATURE_NO_MATCH (pr#53763, Casey Bodley)
rgw/keystone: use secret key from EC2 for sigv4 streaming mode (pr#57899, Casey Bodley)
rgw/lua: add lib64 to the package search path (pr#59342, Yuval Lifshitz)
rgw/lua: fix CopyFrom crash (pr#59336, Yuval Lifshitz)
rgw/multisite: fix sync_error_trim command (pr#59347, Shilpa Jagannath)
rgw/notification: Kafka persistent notifications not retried and removed even when the broker is down (pr#56145, kchheda3)
rgw/notification: remove non x-amz-meta-* attributes from bucket notifications (pr#53374, Juan Zhu)
rgw/notifications/test: fix rabbitmq and kafka issues in centos9 (pr#58313, Yuval Lifshitz)
rgw/notifications: cleanup all coroutines after sending the notification (pr#59353, Yuval Lifshitz)
rgw/putobj: RadosWriter uses part head object for multipart parts (pr#55622, Casey Bodley)
rgw/rest: fix url decode of post params for iam/sts/sns (pr#55357, Casey Bodley)
rgw/rgw-gap-list: refactoring and adding more error checking (pr#59320, Michael J. Kidd)
rgw/rgw-orphan-list: refactor and add more checks to the tool (pr#59321, Michael J. Kidd)
rgw/s3: DeleteObjects response uses correct delete_marker flag (pr#54165, Casey Bodley)
rgw/s3: ListObjectsV2 returns correct object owners (pr#54162, Casey Bodley)
rgw/sts: AssumeRole no longer writes to user metadata (pr#52049, Casey Bodley)
rgw/sts: changing identity to boost::none, when role policy (pr#59345, Pritha Srivastava)
rgw/sts: modify max_session_duration using update role REST API/ radosgw-admin command (pr#48082, Pritha Srivastava)
RGW/STS: when generating keys, take the trailing null character into account (pr#54128, Oguzhan Ozmen)
rgw/swift: preserve dashes/underscores in swift user metadata names (pr#56616, Juan Zhu, Ali Maredia)
rgw: \'bucket check\' deletes index of multipart meta when its pending_map is nonempty (pr#54017, Huber-ming)
rgw: add crypt attrs for iam policy to PostObj and Init/CompleteMultipart (pr#59344, Casey Bodley)
rgw: add headers to guide cache update in 304 response (pr#55095, Casey Bodley, Ilsoo Byun)
rgw: Add missing empty checks to the split string in is_string_in_set() (pr#56348, Matt Benjamin)
rgw: add versioning info to radosgw-admin bucket stats output (pr#54190, J. Eric Ivancich, Cory Snyder)
rgw: address crash and race in RGWIndexCompletionManager (pr#50538, J. Eric Ivancich)
RGW: allow user disabling presigned urls in rgw configuration (pr#56447, Marc Singer)
rgw: avoid use-after-move in RGWDataSyncSingleEntryCR ctor (pr#59319, Casey Bodley)
rgw: beast frontend checks for local_endpoint() errors (pr#54166, Casey Bodley)
rgw: catches nobjects_begin() exceptions (pr#59360, lichaochao)
rgw: cmake configure error on fedora-37/rawhide (pr#59313, Kaleb S. KEITHLEY)
rgw: CopyObject works with x-amz-copy-source-if-* headers (pr#50519, Wang Hao)
rgw: d3n: fix valgrind reported leak related to libaio worker threads (pr#54851, Mark Kogan)
rgw: disable RGWDataChangesLog::add_entry() when log_data is off (pr#59314, Casey Bodley)
rgw: do not copy olh attributes in versioning suspended bucket (pr#55607, Juan Zhu)
rgw: Drain async_processor request queue during shutdown (pr#53471, Soumya Koduri)
rgw: Erase old storage class attr when the object is rewrited using r… (pr#50520, zhiming zhang)
rgw: Fix Browser POST content-length-range min value (pr#52937, Robin H. Johnson)
rgw: fix issue with concurrent versioned deletes leaving behind olh entries (pr#59357, Cory Snyder)
rgw: fix ListOpenIDConnectProviders XML format (pr#57131, caolei)
rgw: fix multipart upload object leaks due to re-upload (pr#51976, J. Eric Ivancich, Yixin Jin, Matt Benjamin, Daniel Gryniewicz)
rgw: fix rgw cache invalidation after unregister_watch() error (pr#54015, lichaochao)
rgw: Get canonical storage class when storage class is empty in (pr#59317, zhiming zhang)
rgw: handle old clients with transfer-encoding: chunked (pr#57133, Marcus Watts)
rgw: invalidate and retry keystone admin token (pr#59076, Tobias Urdin)
rgw: make incomplete multipart upload part of bucket check efficient (pr#57405, J. Eric Ivancich)
rgw: modify string match_wildcards with fnmatch (pr#57907, zhipeng li, Adam Emerson)
rgw: multisite data log flag not used (pr#52054, J. Eric Ivancich)
rgw: object lock avoids 32-bit truncation of RetainUntilDate (pr#54675, Casey Bodley)
rgw: remove potentially conficting definition of dout_subsys (pr#53462, J. Eric Ivancich)
rgw: RGWSI_SysObj_Cache::remove() invalidates after successful delete (pr#55718, Casey Bodley)
rgw: s3 object lock avoids overflow in retention date (pr#52606, Casey Bodley)
rgw: set requestPayment in slave zone (pr#57149, Huber-ming)
rgw: SignatureDoesNotMatch for certain RGW Admin Ops endpoints w/v4 auth (pr#54792, David.Hall)
RGW: Solving the issue of not populating etag in Multipart upload result (pr#51446, Ali Masarwa)
rgw: swift: tempurl fixes for ceph (pr#59355, Casey Bodley, Adam Emerson, Marcus Watts)
rgw: Update \\"CEPH_RGW_DIR_SUGGEST_LOG_OP\\" for remove entries (pr#50539, Soumya Koduri)
rgw: update options yaml file so LDAP uri isn\'t an invalid example (pr#56722, J. Eric Ivancich)
rgw: Use STANDARD storage class in objects appending operation when the (pr#59316, zhiming zhang)
rgw: use unique_ptr for flat_map emplace in BucketTrimWatche (pr#52995, Vedansh Bhartia)
rgw: when there are a large number of multiparts, the unorder list result may miss objects (pr#59337, J. Eric Ivancich)
rgwfile: fix lock_guard decl (pr#59350, Matt Benjamin)
rgwlc: fix compat-decoding of cls_rgw_lc_get_entry_ret (pr#59312, Matt Benjamin)
rgwlc: permit lifecycle to reduce data conditionally in archive zone (pr#54873, Matt Benjamin)
run-make-check: use get_processors in run-make-check script (pr#58871, John Mulligan)
src/ceph-volume/ceph_volume/devices/lvm/listing.py : lvm list filters with vg name (pr#58999, Pierre Lemay)
src/common/options: Correct typo in rgw.yaml.in (pr#55446, Anthony D\'Atri)
src/mon/Monitor: Fix set_elector_disallowed_leaders (pr#54004, Kamoltat)
src/mount: kernel mount command returning misleading error message (pr#55299, Neeraj Pratap Singh)
test/cls_lock: expired lock before unlock and start check (pr#59272, Nitzan Mordechai)
test/lazy-omap-stats: Convert to boost::regex (pr#59523, Brad Hubbard)
test/librbd: clean up unused TEST_COOKIE variable (pr#58548, Rongqi Sun)
test/pybind: replace nose with pytest (pr#55060, Casey Bodley)
test/rgw/notifications: fix kafka consumer shutdown issue (pr#59340, Yuval Lifshitz)
test/rgw: increase timeouts in unittest_rgw_dmclock_scheduler (pr#55789, Casey Bodley)
test/store_test: enforce sync compactions for spillover tests (pr#59532, Igor Fedotov)
test/store_test: fix deferred writing test cases (pr#55779, Igor Fedotov)
test/store_test: fix DeferredWrite test when prefer_deferred_size=0 (pr#56201, Igor Fedotov)
test/store_test: get rid off assert_death (pr#55775, Igor Fedotov)
test/store_test: refactor spillover tests (pr#55216, Igor Fedotov)
test: Create ParallelPGMapper object before start threadpool (pr#58921, Mohit Agrawal)
Test: osd-recovery-space.sh extends the wait time for \\"recovery toofull\\" (pr#59042, Nitzan Mordechai)
tools/ceph_objectstore_tool: action_on_all_objects_in_pg to skip pgmeta (pr#54692, Matan Breizman)
tools/ceph_objectstore_tool: Support get/set/superblock (pr#55014, Matan Breizman)
Tools/rados: Improve Error Messaging for Object Name Resolution (pr#55598, Nitzan Mordechai)
tools/rbd: make \'children\' command support --image-id (pr#55618, Mykola Golub)
win32_deps_build.sh: change Boost URL (pr#55085, Lucian Petrut)
Many clients are transforming their IT infrastructure to enterprise platforms because their mission critical applications are demanding a cloud native experience on premises with the following:
As IT leaders build out their enterprise platforms, they have the following requirements for the underlying enterprise storage platform:
Clients have issued vendor requests for proposals (RFP) for enterprise storage platforms and evaluated the responses. IT leaders have learned that only Ceph can meet their requirements for multiprotocol, software-defined enterprise storage platforms. None of the other alternatives can deliver all three protocols (block-NVMeoF, file-NFS,SMB and object-S3) from a single software-defined platform.
Clients that have implemented enterprise storage platforms on Ceph have reported 50% lower total cost of ownership (TCO) and 67% faster deployment times.
A global IBM client was struggling with their legacy HDFS environment due to the tight coupling of compute and storage along with limits around scalability, erasure coding support, hardware alternatives and security. The client replaced HDFS with IBM Storage Ceph with open-source S3A interface, erasure coding, encryption at rest and inflight all running on open compute style hardware of their choice. The client had a parallel effort to modernize their analytics environment so IBM Storage Ceph support for Iceberg, Parquet, Trino and Apache Spark was also a benefit. In the end, the transition from HDFS to IBM Storage Ceph reduced their TCO by 50%.
Another government agency IBM client was seeking to modernize their legacy infrastructure and applications with a new enterprise storage platform. The client was struggling with a legacy storage platform that was difficult to expand, difficult to secure, difficult to manage and expensive to maintain. As the client containerized their cloud applications that serve 35 million users, they needed S3 object storage to store large amounts of unstructured data. The client turned to IBM Storage Ceph as their new enterprise storage platform. The open standard S3 APIs made it much easier to onboard new applications and services ultimately reducing deployment times by 67%.
A third global IBM client is planning to eventually migrate all their workloads to NVMe over TCP starting with their block workloads running on VMware. The client wants to move away from proprietary initiators that lock them in and toward more open alternatives where they have the flexibility to change vendors and improve business agility. The client also wants to significantly improve the security compared to legacy block solutions by using mutual challenge handshake authentication protocol (CHAP), transport layer security (TLS) inflight encryption, and host IP tables. The improved agility and security along with lower TCO are the compelling reasons this client is building an enterprise storage platform with IBM Storage Ceph.
IBM employees and clients continue to make large contributions to the Ceph community to help mature the technology to maintain Ceph as the leading enterprise storage platform. Please join us in the community.
The potential of agentic AI to transform customer and employee experiences has never been greater. AI agents are fast becoming indispensable in the modern enterprise landscape. Yet, as compelling as the use cases are for agentic AI, they rely on one critical factor: access to vast quantities of accurate, timely data.
For AI applications to generate meaningful insights and usable results, they need access to data on a scale that has never been required before. This data feeds into model training, enhances retrieval-augmented generation (RAG), and enables high-quality inference in production environments. To support this, enterprises need a data architecture that is not only scalable and affordable but also seamlessly integrated to get data where it needs to be without friction.
Enterprise data is typically scattered across a range of locations, from on-premises servers and edge storage to data lakes, warehouses, and cloud storage environments. Each of these storage environments serves a unique function, and each has its own set of protocols and constraints. Moving data from these various storage locations to the cloud, where it can be effectively processed and leveraged by AI models, is a complex and often resource-intensive process. To maximize the value of this data for AI applications, it must be transported reliably, efficiently, and automatically.
Vultr, in collaboration with NetApp, is transforming this process by introducing an innovative cloud storage solution designed specifically to handle the demands of AI data workflows. By seamlessly migrating data to the cloud and routing it precisely where it needs to go, Vultr’s new solution simplifies the data pipeline for AI applications. In this new setup, data can move automatically from storage to processing environments, making it ready for use by AI models in a more streamlined and efficient way than ever before.
At the core of Vultr\'s solution is Kubernetes-compatible NVMe storage, designed to support high-throughput data movement and delivery. This state-of-the-art storage infrastructure enables users to quickly and affordably transfer data where it’s needed, whether for training or inference, while also providing a straightforward control panel to manage data flows.
For AI deployments, Vultr’s NVMe storage integrates seamlessly with Kubernetes-based applications, feeding data directly to containerized models running on Vultr’s cloud GPU training clusters. This architecture enables fine-tuning of open-source AI models in a controlled and efficient manner. Once fine-tuned, models can then be deployed for inference using Vultr’s Serverless Inference, allowing for real-time data processing and instant output generation.
With this infrastructure, organizations can pull data from its source, use it to fine-tune models, and infer results in production – all within a unified environment. This streamlined process saves time, reduces costs, and simplifies operations, enabling companies to focus on improving their AI applications without worrying about complex data logistics.
In addition to providing flexibility and speed, Vultr’s new cloud storage solution is also designed to address one of the most pressing concerns in today’s data-driven world: data residency. With regulatory requirements becoming increasingly stringent, companies must ensure that data does not leave its region of origin without proper oversight. This is especially important for AI applications, as any breach of data residency rules can lead to serious legal and compliance issues.
Vultr’s solution incorporates Vultr Managed Apache Kafka and a managed vector database, enabling regional data to be pulled directly into localized data stores. This approach keeps data within its origin region while still making it accessible for AI applications, ensuring compliance with residency rules. As a result, organizations can tailor their AI deployments for specific regions, maintaining compliance without compromising on the quality or accuracy of their AI outputs.
To fully harness the power of AI, organizations must rethink their approach to data lifecycle management. In the past, data was something that enterprises recorded, secured, and stored for later use. The primary concerns were compliance, security, and long-term preservation. However, in the AI-driven landscape, data lifecycle management must prioritize data availability, accessibility, and flow. AI applications require real-time data, continuously updated and instantly accessible, to deliver efficient results and improved user experiences.
The new data lifecycle is about creating a constant flow from storage to fine-tuning clusters to inference models. Vultr’s solution accomplishes this by enabling fast, rules-based data migration across various environments. This flexibility ensures that data is always available where it’s needed, supporting continuous training, adaptation, and deployment of AI models. Vultr simplifies AI adoption and deployment, making it easier for organizations to embrace AI without overhauling their existing infrastructure.
Automated Data Migration: Vultr’s solution automatically migrates data to its required location, whether for training, fine-tuning, or inference. This reduces manual intervention and minimizes latency, allowing AI models to operate more effectively.
High-Throughput NVMe Storage: With Kubernetes-compatible NVMe storage, data flows seamlessly into Vultr’s cloud GPU training clusters and serverless inference environments, enabling high-speed, high-volume processing for AI applications.
Enhanced Compliance with Data Residency: By leveraging Vultr Managed Apache Kafka and localized vector databases, the solution ensures data residency compliance, allowing enterprises to operate regionally without fear of regulatory breaches.
Scalability and Affordability: Vultr’s architecture is built for scalability, making it suitable for both small-scale and enterprise-level AI applications. The solution provides an affordable way to manage data flow without compromising on performance, helping organizations adopt AI cost-effectively.
Simple Control Panel: The intuitive control panel allows users to manage data flows with ease, setting rules for automated migration and ensuring that data is always where it needs to be for AI processing.
Vultr’s new cloud storage solution represents a shift toward data architectures that fully support AI applications, making data accessible, compliant, and ready to power new digital experiences. As AI continues to evolve, so too will the infrastructure that supports it. With Vultr Cloud Storage, enterprises can pioneer a new Cloud Storage and data lifecycle model to fuel AI innovation.
In the fast-evolving world of object storage, seamless data replication across clusters is crucial for ensuring data availability, redundancy, and disaster recovery. In Ceph, this is achieved through the RADOS Gateway (RGW) multisite replication feature. However, setting up and managing RGW multisite configurations through the command line can be a time-consuming process that involves executing a long series of complex commands—sometimes as many as 20 to 25.
To simplify this workflow, we\'ve developed a user-friendly 4-step wizard, now accessible through a single button in the Ceph dashboard’s RGW multisite page. This new wizard significantly reduces the setup time for RGW multisite replication to just a few steps while ensuring that users can configure realms, zonegroups, and zones efficiently and with minimal effort.
The command-line interface (CLI) is a powerful tool, but when it comes to RGW multisite replication, its complexity can become a roadblock. Configuring realms, zonegroups, and zones involves running a multitude of commands, each of which needs to be executed in a precise order. From creating realms and defining zonegroups with their respective endpoints to setting up zones and configuring system users, every step has to be done carefully. Any misstep can lead to replication failures or misconfigurations.
With the new wizard, we’ve drastically reduced the setup complexity. The wizard takes care of these steps for you in an intuitive, guided process. What previously required up to 25 CLI commands can now be achieved in just 3 to 4 steps—saving both time and effort while also lowering the risk of misconfiguration.
The RGW multisite wizard in the Ceph dashboard is designed to streamline the entire configuration process. Here’s how it works:
In the first step, you’re prompted to enter the realm name, zonegroup name, and zonegroup endpoints. These are the fundamental elements of any multisite setup. The zonegroup endpoints refer to the cluster addresses that will serve as part of the replication ecosystem.
Figure: Step 1 - Entering realm and zonegroup information.
Next, you’ll define the zone name and its corresponding endpoints, as well as create the system user that will operate within this zone. The system user is crucial for managing access and permissions in the replication process.
Figure: Step 2 - Configuring the zone and creating a system user.
If you’ve added another cluster in the multi-cluster setup, the third step presents an option to select that cluster. This step allows you to replicate the configuration automatically to the secondary cluster. If no additional cluster has been added or you do not wish to select a cluster at the moment, you can skip this step.
Figure: Step 3 - Selecting a replication cluster and entering replication zone name
Figure: Secondary cluster added in multi-cluster setup
The final step serves as a review page where you can verify all the values entered in the previous steps. If no additional cluster is added for replication, submitting this step generates a token. This token contains the realm name, access keys, and endpoints, and it can be manually imported into the secondary cluster using the realm pull
command.
However, if a secondary cluster is already present in the multi-cluster setup, you can select the cluster from the list, and the wizard will automatically import the realm token into the secondary cluster, completing the process seamlessly.
Figure: Step 4 - Reviewing the configuration
You can see a step by step progress on submitting the wizard.
On the completion of the wizard you can verify the configuration in the secondary cluster, which in our case should look something like
Figure: Configuration in the secondary cluster
To verify the sync status you can visit the Objects > Overview Page
Figure: Sync status in the primary cluster
This new wizard simplifies RGW multisite configuration in several critical use cases:
By replicating data across geographically distant clusters, organizations can ensure that they are protected from data loss in case of a regional failure. The wizard makes it easy to set up DR scenarios in just a few clicks.
Enterprises that need to maintain multiple copies of their data across different locations can now do so without spending hours on the command line. The wizard enables easy setup of data redundancy policies between clusters.
For users managing multiple Ceph clusters, this tool provides an efficient way to replicate configurations and data without manual intervention or the complexity of multi-step CLI commands. With the wizard’s ability to handle automatic realm imports between clusters, the entire process becomes frictionless.
When deploying new Ceph clusters and configuring RGW multisite replication, time is of the essence. The wizard cuts down the time required to configure a new multi-site deployment, making it ideal for administrators who need to get their systems up and running quickly.
The introduction of the RGW multisite replication wizard marks a significant improvement in the way Ceph users can manage multisite configurations. By reducing the complexity of a process that previously required up to 25 commands into a simple, intuitive 4-step wizard, we’ve made it easier than ever to set up and manage multisite replication in Ceph. Whether you’re setting up a disaster recovery plan, ensuring data redundancy, or managing multiple clusters, this tool empowers users with a streamlined, error-free process that gets the job done in a fraction of the time.
We encourage you to explore this new feature in the Ceph dashboard and experience firsthand how it can transform your RGW multisite management workflows.
Squid is the 19th stable release of Ceph.
This is the first stable release of Ceph Squid.
ATTENTION:
iSCSI users are advised that the upstream developers of Ceph encountered a bug during an upgrade from Ceph 19.1.1 to Ceph 19.2.0. Read Tracker Issue 68215 before attempting an upgrade to 19.2.0.
Contents:
RADOS
Dashboard
CephFS
RBD
RGW
Crimson/Seastore
ceph: a new --daemon-output-file
switch is available for ceph tell
commands to dump output to a file local to the daemon. For commands which produce large amounts of output, this avoids a potential spike in memory usage on the daemon, allows for faster streaming writes to a file local to the daemon, and reduces time holding any locks required to execute the command. For analysis, it is necessary to manually retrieve the file from the host running the daemon. Currently, only --format=json|json-pretty
are supported.
cls_cxx_gather
is marked as deprecated.
Tracing: The blkin tracing feature (see https://docs.ceph.com/en/reef/dev/blkin/) is now deprecated in favor of Opentracing (https://docs.ceph.com/en/reef/dev/developer_guide/jaegertracing/) and will be removed in a later release.
PG dump: The default output of ceph pg dump --format json
has changed. The default JSON format produces a rather massive output in large clusters and isn\'t scalable, so we have removed the \'network_ping_times\' section from the output. Details in the tracker: https://tracker.ceph.com/issues/57460
CephFS: it is now possible to pause write I/O and metadata mutations on a tree in the file system using a new suite of subvolume quiesce commands. This is implemented to support crash-consistent snapshots for distributed applications. Please see the relevant section in the documentation on CephFS subvolumes for more information.
CephFS: MDS evicts clients which are not advancing their request tids which causes a large buildup of session metadata resulting in the MDS going read-only due to the RADOS operation exceeding the size threshold. mds_session_metadata_threshold
config controls the maximum size that a (encoded) session metadata can grow.
CephFS: A new \\"mds last-seen\\" command is available for querying the last time an MDS was in the FSMap, subject to a pruning threshold.
CephFS: For clusters with multiple CephFS file systems, all the snap-schedule commands now expect the \'--fs\' argument.
CephFS: The period specifier m
now implies minutes and the period specifier M
now implies months. This has been made consistent with the rest of the system.
CephFS: Running the command \\"ceph fs authorize\\" for an existing entity now upgrades the entity\'s capabilities instead of printing an error. It can now also change read/write permissions in a capability that the entity already holds. If the capability passed by user is same as one of the capabilities that the entity already holds, idempotency is maintained.
CephFS: Two FS names can now be swapped, optionally along with their IDs, using \\"ceph fs swap\\" command. The function of this API is to facilitate file system swaps for disaster recovery. In particular, it avoids situations where a named file system is temporarily missing which would prompt a higher level storage operator (like Rook) to recreate the missing file system. See https://docs.ceph.com/en/latest/cephfs/administration/#file-systems docs for more information.
CephFS: Before running the command \\"ceph fs rename\\", the filesystem to be renamed must be offline and the config \\"refuse_client_session\\" must be set for it. The config \\"refuse_client_session\\" can be removed/unset and filesystem can be online after the rename operation is complete.
CephFS: Disallow delegating preallocated inode ranges to clients. Config mds_client_delegate_inos_pct
defaults to 0 which disables async dirops in the kclient.
CephFS: MDS log trimming is now driven by a separate thread which tries to trim the log every second (mds_log_trim_upkeep_interval
config). Also, a couple of configs govern how much time the MDS spends in trimming its logs. These configs are mds_log_trim_threshold
and mds_log_trim_decay_rate
.
CephFS: Full support for subvolumes and subvolume groups is now available
CephFS: The subvolume snapshot clone
command now depends on the config option snapshot_clone_no_wait
which is used to reject the clone operation when all the cloner threads are busy. This config option is enabled by default which means that if no cloner threads are free, the clone request errors out with EAGAIN. The value of the config option can be fetched by using: ceph config get mgr mgr/volumes/snapshot_clone_no_wait
and it can be disabled by using: ceph config set mgr mgr/volumes/snapshot_clone_no_wait false
for snap_schedule Manager module.
CephFS: Commands ceph mds fail
and ceph fs fail
now require a confirmation flag when some MDSs exhibit health warning MDS_TRIM or MDS_CACHE_OVERSIZED. This is to prevent accidental MDS failover causing further delays in recovery.
CephFS: fixes to the implementation of the root_squash
mechanism enabled via cephx mds
caps on a client credential require a new client feature bit, client_mds_auth_caps
. Clients using credentials with root_squash
without this feature will trigger the MDS to raise a HEALTH_ERR on the cluster, MDS_CLIENTS_BROKEN_ROOTSQUASH. See the documentation on this warning and the new feature bit for more information.
CephFS: Expanded removexattr support for cephfs virtual extended attributes. Previously one had to use setxattr to restore the default in order to \\"remove\\". You may now properly use removexattr to remove. You can also now remove layout on root inode, which then will restore layout to default layout.
CephFS: cephfs-journal-tool is guarded against running on an online file system. The \'cephfs-journal-tool --rank <fs_name>:<mds_rank> journal reset\' and \'cephfs-journal-tool --rank <fs_name>:<mds_rank> journal reset --force\' commands require \'--yes-i-really-really-mean-it\'.
CephFS: \\"ceph fs clone status\\" command will now print statistics about clone progress in terms of how much data has been cloned (in both percentage as well as bytes) and how many files have been cloned.
CephFS: \\"ceph status\\" command will now print a progress bar when cloning is ongoing. If clone jobs are more than the cloner threads, it will print one more progress bar that shows total amount of progress made by both ongoing as well as pending clones. Both progress are accompanied by messages that show number of clone jobs in the respective categories and the amount of progress made by each of them.
cephfs-shell: The cephfs-shell utility is now packaged for RHEL 9 / CentOS 9 as required python dependencies are now available in EPEL9.
The CephFS automatic metadata load (sometimes called \\"default\\") balancer is now disabled by default. The new file system flag balance_automate
can be used to toggle it on or off. It can be enabled or disabled via ceph fs set <fs_name> balance_automate <bool>
.
ceph auth rotate
. Previously, this was only possible by deleting and then recreating the key.Dashboard: Rearranged Navigation Layout: The navigation layout has been reorganized for improved usability and easier access to key features.
Dashboard: CephFS Improvments
Support for managing CephFS snapshots and clones, as well as snapshot schedule management
Manage authorization capabilities for CephFS resources
Helpers on mounting a CephFS volume
Dashboard: RGW Improvements
Support for managing bucket policies
Add/Remove bucket tags
ACL Management
Several UI/UX Improvements to the bucket form
MGR/REST: The REST manager module will trim requests based on the \'max_requests\' option. Without this feature, and in the absence of manual deletion of old requests, the accumulation of requests in the array can lead to Out Of Memory (OOM) issues, resulting in the Manager crashing.
MGR: An OpTracker to help debug mgr module issues is now available.
Monitoring: Grafana dashboards are now loaded into the container at runtime rather than building a grafana image with the grafana dashboards. Official Ceph grafana images can be found in quay.io/ceph/grafana
Monitoring: RGW S3 Analytics: A new Grafana dashboard is now available, enabling you to visualize per bucket and user analytics data, including total GETs, PUTs, Deletes, Copies, and list metrics.
The mon_cluster_log_file_level
and mon_cluster_log_to_syslog_level
options have been removed. Henceforth, users should use the new generic option mon_cluster_log_level
to control the cluster log level verbosity for the cluster log file as well as for all external entities.
RADOS: A POOL_APP_NOT_ENABLED
health warning will now be reported if the application is not enabled for the pool irrespective of whether the pool is in use or not. Always tag a pool with an application using ceph osd pool application enable
command to avoid reporting of POOL_APP_NOT_ENABLED health warning for that pool. The user might temporarily mute this warning using ceph health mute POOL_APP_NOT_ENABLED
.
RADOS: get_pool_is_selfmanaged_snaps_mode
C++ API has been deprecated due to being prone to false negative results. Its safer replacement is pool_is_in_selfmanaged_snaps_mode
.
RADOS: For bug 62338 (https://tracker.ceph.com/issues/62338), we did not choose to condition the fix on a server flag in order to simplify backporting. As a result, in rare cases it may be possible for a PG to flip between two acting sets while an upgrade to a version with the fix is in progress. If you observe this behavior, you should be able to work around it by completing the upgrade or by disabling async recovery by setting osd_async_recovery_min_cost to a very large value on all OSDs until the upgrade is complete: ceph config set osd osd_async_recovery_min_cost 1099511627776
RADOS: A detailed version of the balancer status
CLI command in the balancer module is now available. Users may run ceph balancer status detail
to see more details about which PGs were updated in the balancer\'s last optimization. See https://docs.ceph.com/en/latest/rados/operations/balancer/ for more information.
RADOS: Read balancing may now be managed automatically via the balancer manager module. Users may choose between two new modes: upmap-read
, which offers upmap and read optimization simultaneously, or read
, which may be used to only optimize reads. For more detailed information see https://docs.ceph.com/en/latest/rados/operations/read-balancer/#online-optimization.
RADOS: BlueStore has been optimized for better performance in snapshot-intensive workloads.
RADOS: BlueStore RocksDB LZ4 compression is now enabled by default to improve average performance and \\"fast device\\" space usage.
RADOS: A new CRUSH rule type, MSR (Multi-Step Retry), allows for more flexible EC configurations.
RADOS: Scrub scheduling behavior has been improved.
RBD: When diffing against the beginning of time (fromsnapname == NULL
) in fast-diff mode (whole_object == true
with fast-diff
image feature enabled and valid), diff-iterate is now guaranteed to execute locally if exclusive lock is available. This brings a dramatic performance improvement for QEMU live disk synchronization and backup use cases.
RBD: The try-netlink
mapping option for rbd-nbd has become the default and is now deprecated. If the NBD netlink interface is not supported by the kernel, then the mapping is retried using the legacy ioctl interface.
RBD: The option --image-id
has been added to rbd children
CLI command, so it can be run for images in the trash.
RBD: Image::access_timestamp
and Image::modify_timestamp
Python APIs now return timestamps in UTC.
RBD: Support for cloning from non-user type snapshots is added. This is intended primarily as a building block for cloning new groups from group snapshots created with rbd group snap create
command, but has also been exposed via the new --snap-id
option for rbd clone
command.
RBD: The output of rbd snap ls --all
command now includes the original type for trashed snapshots.
RBD: RBD_IMAGE_OPTION_CLONE_FORMAT
option has been exposed in Python bindings via clone_format
optional parameter to clone
, deep_copy
and migration_prepare
methods.
RBD: RBD_IMAGE_OPTION_FLATTEN
option has been exposed in Python bindings via flatten
optional parameter to deep_copy
and migration_prepare
methods.
RBD: rbd-wnbd
driver has gained the ability to multiplex image mappings. Previously, each image mapping spawned its own rbd-wnbd
daemon, which lead to an excessive amount of TCP sessions and other resources being consumed, eventually exceeding Windows limits. With this change, a single rbd-wnbd
daemon is spawned per host and most OS resources are shared between image mappings. Additionally, ceph-rbd
service starts much faster.
RGW: GetObject and HeadObject requests now return a x-rgw-replicated-at header for replicated objects. This timestamp can be compared against the Last-Modified header to determine how long the object took to replicate.
RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in multi-site. Previously, the replicas of such objects were corrupted on decryption. A new tool, radosgw-admin bucket resync encrypted multipart
, can be used to identify these original multipart uploads. The LastModified
timestamp of any identified object is incremented by 1ns to cause peer zones to replicate it again. For multi-site deployments that make any use of Server-Side Encryption, we recommended running this command against every bucket in every zone after all zones have upgraded.
RGW: Introducing a new data layout for the Topic metadata associated with S3 Bucket Notifications, where each Topic is stored as a separate RADOS object and the bucket notification configuration is stored in a bucket attribute. This new representation supports multisite replication via metadata sync and can scale to many topics. This is on by default for new deployments, but is not enabled by default on upgrade. Once all radosgws have upgraded (on all zones in a multisite configuration), the notification_v2
zone feature can be enabled to migrate to the new format. See https://docs.ceph.com/en/squid/radosgw/zone-features for details. The \\"v1\\" format is now considered deprecated and may be removed after 2 major releases.
RGW: New tools have been added to radosgw-admin for identifying and correcting issues with versioned bucket indexes. Historical bugs with the versioned bucket index transaction workflow made it possible for the index to accumulate extraneous \\"book-keeping\\" olh entries and plain placeholder entries. In some specific scenarios where clients made concurrent requests referencing the same object key, it was likely that a lot of extra index entries would accumulate. When a significant number of these entries are present in a single bucket index shard, they can cause high bucket listing latencies and lifecycle processing failures. To check whether a versioned bucket has unnecessary olh entries, users can now run radosgw-admin bucket check olh
. If the --fix
flag is used, the extra entries will be safely removed. A distinct issue from the one described thus far, it is also possible that some versioned buckets are maintaining extra unlinked objects that are not listable from the S3/ Swift APIs. These extra objects are typically a result of PUT requests that exited abnormally, in the middle of a bucket index transaction - so the client would not have received a successful response. Bugs in prior releases made these unlinked objects easy to reproduce with any PUT request that was made on a bucket that was actively resharding. Besides the extra space that these hidden, unlinked objects consume, there can be another side effect in certain scenarios, caused by the nature of the failure mode that produced them, where a client of a bucket that was a victim of this bug may find the object associated with the key to be in an inconsistent state. To check whether a versioned bucket has unlinked entries, users can now run radosgw-admin bucket check unlinked
. If the --fix
flag is used, the unlinked objects will be safely removed. Finally, a third issue made it possible for versioned bucket index stats to be accounted inaccurately. The tooling for recalculating versioned bucket stats also had a bug, and was not previously capable of fixing these inaccuracies. This release resolves those issues and users can now expect that the existing radosgw-admin bucket check
command will produce correct results. We recommend that users with versioned buckets, especially those that existed on prior releases, use these new tools to check whether their buckets are affected and to clean them up accordingly.
RGW: The User Accounts feature unlocks several new AWS-compatible IAM APIs for the self-service management of users, keys, groups, roles, policy and more. Existing users can be adopted into new accounts. This process is optional but irreversible. See https://docs.ceph.com/en/squid/radosgw/account and https://docs.ceph.com/en/squid/radosgw/iam for details.
RGW: On startup, radosgw and radosgw-admin now validate the rgw_realm
config option. Previously, they would ignore invalid or missing realms and go on to load a zone/zonegroup in a different realm. If startup fails with a \\"failed to load realm\\" error, fix or remove the rgw_realm
option.
RGW: The radosgw-admin commands realm create
and realm pull
no longer set the default realm without --default
.
RGW: Fixed an S3 Object Lock bug with PutObjectRetention requests that specify a RetainUntilDate after the year 2106. This date was truncated to 32 bits when stored, so a much earlier date was used for object lock enforcement. This does not effect PutBucketObjectLockConfiguration where a duration is given in Days. The RetainUntilDate encoding is fixed for new PutObjectRetention requests, but cannot repair the dates of existing object locks. Such objects can be identified with a HeadObject request based on the x-amz-object-lock-retain-until-date response header.
S3 Get/HeadObject
now supports the query parameter partNumber
to read a specific part of a completed multipart upload.
RGW: The SNS CreateTopic API now enforces the same topic naming requirements as AWS: Topic names must be made up of only uppercase and lowercase ASCII letters, numbers, underscores, and hyphens, and must be between 1 and 256 characters long.
RGW: Notification topics are now owned by the user that created them. By default, only the owner can read/write their topics. Topic policy documents are now supported to grant these permissions to other users. Preexisting topics are treated as if they have no owner, and any user can read/write them using the SNS API. If such a topic is recreated with CreateTopic, the issuing user becomes the new owner. For backward compatibility, all users still have permission to publish bucket notifications to topics owned by other users. A new configuration parameter, rgw_topic_require_publish_policy
, can be enabled to deny sns:Publish
permissions unless explicitly granted by topic policy.
RGW: Fix issue with persistent notifications where the changes to topic param that were modified while persistent notifications were in the queue will be reflected in notifications. So if the user sets up topic with incorrect config (password/ssl) causing failure while delivering the notifications to broker, can now modify the incorrect topic attribute and on retry attempt to delivery the notifications, new configs will be used.
RGW: in bucket notifications, the principalId
inside ownerIdentity
now contains the complete user ID, prefixed with the tenant ID.
basic
channel in telemetry now captures pool flags that allows us to better understand feature adoption, such as Crimson. To opt in to telemetry, run ceph telemetry on
.Before starting, make sure your cluster is stable and healthy (no down or recovering OSDs). (This is optional, but recommended.) You can disable the autoscaler for all pools during the upgrade using the noautoscale flag.
Note:
You can monitor the progress of your upgrade at each stage with the
ceph versions
command, which will tell you what ceph version(s) are running for each type of daemon.
If your cluster is deployed with cephadm (first introduced in Octopus), then the upgrade process is entirely automated. To initiate the upgrade,
ceph orch upgrade start --image quay.io/ceph/ceph:v19.2.0\\n
The same process is used to upgrade to future minor releases.
Upgrade progress can be monitored with
ceph orch upgrade status\\n
Upgrade progress can also be monitored with ceph -s
(which provides a simple progress bar) or more verbosely with
ceph -W cephadm\\n
The upgrade can be paused or resumed with
ceph orch upgrade pause # to pause\\nceph orch upgrade resume # to resume\\n
or canceled with
ceph orch upgrade stop\\n
Note that canceling the upgrade simply stops the process; there is no ability to downgrade back to Quincy or Reef.
Note:
If your cluster is running Quincy (17.2.x) or later, you might choose to first convert it to use cephadm so that the upgrade to Squid is automated (see above). For more information, see https://docs.ceph.com/en/squid/cephadm/adoption/.
If your cluster is running Quincy (17.2.x) or later, systemd unit file names have changed to include the cluster fsid. To find the correct systemd unit file name for your cluster, run following command:
systemctl -l | grep <daemon type>\\n
Example:
$ systemctl -l | grep mon | grep active\\nceph-6ce0347c-314a-11ee-9b52-000af7995d6c@mon.f28-h21-000-r630.service loaded active running Ceph mon.f28-h21-000-r630 for 6ce0347c-314a-11ee-9b52-000af7995d6c\\n
Set the noout
flag for the duration of the upgrade. (Optional, but recommended.)
ceph osd set noout\\n
Upgrade monitors by installing the new packages and restarting the monitor daemons. For example, on each monitor host
systemctl restart ceph-mon.target\\n
Once all monitors are up, verify that the monitor upgrade is complete by looking for the squid
string in the mon map. The command
ceph mon dump | grep min_mon_release\\n
should report:
min_mon_release 19 (squid)\\n
If it does not, that implies that one or more monitors hasn\'t been upgraded and restarted and/or the quorum does not include all monitors.
Upgrade ceph-mgr
daemons by installing the new packages and restarting all manager daemons. For example, on each manager host,
systemctl restart ceph-mgr.target\\n
Verify the ceph-mgr
daemons are running by checking ceph -s
:
ceph -s\\n\\n...\\nservices:\\nmon: 3 daemons, quorum foo,bar,baz\\nmgr: foo(active), standbys: bar, baz\\n...\\n
Upgrade all OSDs by installing the new packages and restarting the ceph-osd daemons on all OSD hosts
systemctl restart ceph-osd.target\\n
Upgrade all CephFS MDS daemons. For each CephFS file system,
Disable standby_replay:
ceph fs set <fs_name> allow_standby_replay false\\n
Reduce the number of ranks to 1. (Make note of the original number of MDS daemons first if you plan to restore it later.)
ceph status # ceph fs set <fs_name> max_mds 1\\n
Wait for the cluster to deactivate any non-zero ranks by periodically checking the status
ceph status\\n
Take all standby MDS daemons offline on the appropriate hosts with
systemctl stop ceph-mds@<daemon_name>\\n
Confirm that only one MDS is online and is rank 0 for your FS
ceph status\\n
Upgrade the last remaining MDS daemon by installing the new packages and restarting the daemon
systemctl restart ceph-mds.target\\n
Restart all standby MDS daemons that were taken offline
systemctl start ceph-mds.target\\n
Restore the original value of max_mds
for the volume
ceph fs set <fs_name> max_mds <original_max_mds>\\n
Upgrade all radosgw daemons by upgrading packages and restarting daemons on all hosts
systemctl restart ceph-radosgw.target\\n
Complete the upgrade by disallowing pre-Squid OSDs and enabling all new Squid-only functionality
ceph osd require-osd-release squid\\n
If you set noout
at the beginning, be sure to clear it with
ceph osd unset noout\\n
Consider transitioning your cluster to use the cephadm deployment and orchestration framework to simplify cluster management and future upgrades. For more information on converting an existing cluster to cephadm, see https://docs.ceph.com/en/squid/cephadm/adoption/.
Verify the cluster is healthy with ceph health
. If your cluster is running Filestore, and you are upgrading directly from Quincy to Squid, a deprecation warning is expected. This warning can be temporarily muted using the following command
ceph health mute OSD_FILESTORE\\n
Consider enabling the telemetry module to send anonymized usage statistics and crash information to the Ceph upstream developers. To see what would be reported (without actually sending any information to anyone),
ceph telemetry preview-all\\n
If you are comfortable with the data that is reported, you can opt-in to automatically report the high-level cluster metadata with
ceph telemetry on\\n
The public dashboard that aggregates Ceph telemetry can be found at https://telemetry-public.ceph.com/.
You must first upgrade to Quincy (17.2.z) or Reef (18.2.z) before upgrading to Squid.
We express our gratitude to all members of the Ceph community who contributed by proposing pull requests, testing this release, providing feedback, and offering valuable suggestions.
If you are interested in helping test the next release, Tentacle, please join us at the #ceph-at-scale Slack channel.
The Squid release would not be possible without the contributions of the community:
Aashish Sharma ▪ Abhishek Lekshmanan ▪ Adam C. Emerson ▪ Adam King ▪ Adam Kupczyk ▪ Afreen Misbah ▪ Aishwarya Mathuria ▪ Alexander Indenbaum ▪ Alexander Mikhalitsyn ▪ Alexander Proschek ▪ Alex Wojno ▪ Aliaksei Makarau ▪ Alice Zhao ▪ Ali Maredia ▪ Ali Masarwa ▪ Alvin Owyong ▪ Andreas Schwab ▪ Ankush Behl ▪ Anoop C S ▪ Anthony D Atri ▪ Anton Turetckii ▪ Aravind Ramesh ▪ Arjun Sharma ▪ Arun Kumar Mohan ▪ Athos Ribeiro ▪ Avan Thakkar ▪ barakda ▪ Bernard Landon ▪ Bill Scales ▪ Brad Hubbard ▪ caisan ▪ Casey Bodley ▪ chentao.2022 ▪ Chen Xu Qiang ▪ Chen Yuanrun ▪ Christian Rohmann ▪ Christian Theune ▪ Christopher Hoffman ▪ Christoph Grüninger ▪ Chunmei Liu ▪ cloudbehl ▪ Cole Mitchell ▪ Conrad Hoffmann ▪ Cory Snyder ▪ cuiming_yewu ▪ Cyril Duval ▪ daegon.yang ▪ daijufang ▪ Daniel Clavijo Coca ▪ Daniel Gryniewicz ▪ Daniel Parkes ▪ Daniel Persson ▪ Dan Mick ▪ Dan van der Ster ▪ David.Hall ▪ Deepika Upadhyay ▪ Dhairya Parmar ▪ Didier Gazen ▪ Dillon Amburgey ▪ Divyansh Kamboj ▪ Dmitry Kvashnin ▪ Dnyaneshwari ▪ Dongsheng Yang ▪ Doug Whitfield ▪ dpandit ▪ Eduardo Roldan ▪ ericqzhao ▪ Ernesto Puerta ▪ ethanwu ▪ Feng Hualong ▪ Florent Carli ▪ Florian Weimer ▪ Francesco Pantano ▪ Frank Filz ▪ Gabriel Adrian Samfira ▪ Gabriel BenHanokh ▪ Gal Salomon ▪ Gilad Sid ▪ Gil Bregman ▪ gitkenan ▪ Gregory O\'Neill ▪ Guido Santella ▪ Guillaume Abrioux ▪ gukaifeng ▪ haoyixing ▪ hejindong ▪ Himura Kazuto ▪ hosomn ▪ hualong feng ▪ HuangWei ▪ igomon ▪ Igor Fedotov ▪ Ilsoo Byun ▪ Ilya Dryomov ▪ imtzw ▪ Ionut Balutoiu ▪ ivan ▪ Ivo Almeida ▪ Jaanus Torp ▪ jagombar ▪ Jakob Haufe ▪ James Lakin ▪ Jane Zhu ▪ Javier ▪ Jayanth Reddy ▪ J. Eric Ivancich ▪ Jiffin Tony Thottan ▪ Jimyeong Lee ▪ Jinkyu Yi ▪ John Mulligan ▪ Jos Collin ▪ Jose J Palacios-Perez ▪ Josh Durgin ▪ Josh Salomon ▪ Josh Soref ▪ Joshua Baergen ▪ jrchyang ▪ Juan Miguel Olmo Martínez ▪ junxiang Mu ▪ Justin Caratzas ▪ Kalpesh Pandya ▪ Kamoltat Sirivadhna ▪ kchheda3 ▪ Kefu Chai ▪ Ken Dreyer ▪ Kim Minjong ▪ Konstantin Monakhov ▪ Konstantin Shalygin ▪ Kotresh Hiremath Ravishankar ▪ Kritik Sachdeva ▪ Laura Flores ▪ Lei Cao ▪ Leonid Usov ▪ lichaochao ▪ lightmelodies ▪ limingze ▪ liubingrun ▪ LiuBingrun ▪ liuhong ▪ Liu Miaomiao ▪ liuqinfei ▪ Lorenz Bausch ▪ Lucian Petrut ▪ Luis Domingues ▪ Luís Henriques ▪ luo rixin ▪ Manish M Yathnalli ▪ Marcio Roberto Starke ▪ Marc Singer ▪ Marcus Watts ▪ Mark Kogan ▪ Mark Nelson ▪ Matan Breizman ▪ Mathew Utter ▪ Matt Benjamin ▪ Matthew Booth ▪ Matthew Vernon ▪ mengxiangrui ▪ Mer Xuanyi ▪ Michaela Lang ▪ Michael Fritch ▪ Michael J. Kidd ▪ Michael Schmaltz ▪ Michal Nasiadka ▪ Mike Perez ▪ Milind Changire ▪ Mindy Preston ▪ Mingyuan Liang ▪ Mitsumasa KONDO ▪ Mohamed Awnallah ▪ Mohan Sharma ▪ Mohit Agrawal ▪ molpako ▪ Mouratidis Theofilos ▪ Mykola Golub ▪ Myoungwon Oh ▪ Naman Munet ▪ Neeraj Pratap Singh ▪ Neha Ojha ▪ Nico Wang ▪ Niklas Hambüchen ▪ Nithya Balachandran ▪ Nitzan Mordechai ▪ Nizamudeen A ▪ Nobuto Murata ▪ Oguzhan Ozmen ▪ Omri Zeneva ▪ Or Friedmann ▪ Orit Wasserman ▪ Or Ozeri ▪ Parth Arora ▪ Patrick Donnelly ▪ Patty8122 ▪ Paul Cuzner ▪ Paulo E. Castro ▪ Paul Reece ▪ PC-Admin ▪ Pedro Gonzalez Gomez ▪ Pere Diaz Bou ▪ Pete Zaitcev ▪ Philip de Nier ▪ Philipp Hufnagl ▪ Pierre Riteau ▪ pilem94 ▪ Pinghao Wu ▪ Piotr Parczewski ▪ Ponnuvel Palaniyappan ▪ Prasanna Kumar Kalever ▪ Prashant D ▪ Pritha Srivastava ▪ QinWei ▪ qn2060 ▪ Radoslaw Zarzynski ▪ Raimund Sacherer ▪ Ramana Raja ▪ Redouane Kachach ▪ RickyMaRui ▪ Rishabh Dave ▪ rkhudov ▪ Ronen Friedman ▪ Rongqi Sun ▪ Roy Sahar ▪ Sachin Punadikar ▪ Sage Weil ▪ Sainithin Artham ▪ sajibreadd ▪ samarah ▪ Samarah ▪ Samuel Just ▪ Sascha Lucas ▪ sayantani11 ▪ Seena Fallah ▪ Shachar Sharon ▪ Shilpa Jagannath ▪ shimin ▪ ShimTanny ▪ Shreyansh Sancheti ▪ sinashan ▪ Soumya Koduri ▪ sp98 ▪ spdfnet ▪ Sridhar Seshasayee ▪ Sungmin Lee ▪ sunlan ▪ Super User ▪ Suyashd999 ▪ Suyash Dongre ▪ Taha Jahangir ▪ tanchangzhi ▪ Teng Jie ▪ tengjie5 ▪ Teoman Onay ▪ tgfree ▪ Theofilos Mouratidis ▪ Thiago Arrais ▪ Thomas Lamprecht ▪ Tim Serong ▪ Tobias Urdin ▪ tobydarling ▪ Tom Coldrick ▪ TomNewChao ▪ Tongliang Deng ▪ tridao ▪ Vallari Agrawal ▪ Vedansh Bhartia ▪ Venky Shankar ▪ Ville Ojamo ▪ Volker Theile ▪ wanglinke ▪ wangwenjuan ▪ wanwencong ▪ Wei Wang ▪ weixinwei ▪ Xavi Hernandez ▪ Xinyu Huang ▪ Xiubo Li ▪ Xuehan Xu ▪ XueYu Bai ▪ xuxuehan ▪ Yaarit Hatuka ▪ Yantao xue ▪ Yehuda Sadeh ▪ Yingxin Cheng ▪ yite gu ▪ Yonatan Zaken ▪ Yongseok Oh ▪ Yuri Weinstein ▪ Yuval Lifshitz ▪ yu.wang ▪ Zac Dover ▪ Zack Cerza ▪ zhangjianwei ▪ Zhang Song ▪ Zhansong Gao ▪ Zhelong Zhao ▪ Zhipeng Li ▪ Zhiwei Huang ▪ 叶海丰 ▪ 胡玮文
The Cephalocon Conference t-shirt is a perennial favorite and is literally worn as a badge of honor around the world. And the design on the shirt is what makes it so special!
How would you like to be honored as the creator adorning this year’s object d’arte!, and receive a complimentary registration to this year’s event at CERN, in Geneva, Switzerland this December, in recognition!
You don’t need to be an artist nor a graphics designer, as we are looking for simple conceptual renderings of your design - scan in a hand-drawn image or sketch with your favorite tool. All we ask is that it be original art (need to avoid licensing issues). Also, please limit to black/white if possible, or at most one additional color, to be budget friendly.
To submit your idea for consideration, please email your drawing file (PDF or JPG) to cephalocon24@ceph.io. All submissions must be received no later than Friday, August 16th - so get those creative juices flowing!!
The Conference planning team will review and announce the winner when the Conference Schedule is announced in September.
2023’s Image for reference, in case you need inspiration
This is the fourth backport release in the Reef series. We recommend that all users update to this release.
An early build of this release was accidentally exposed and packaged as 18.2.3 by the Debian project in April. That 18.2.3 release should not be used. The official release was re-tagged as v18.2.4 to avoid further confusion.
v18.2.4 container images, now based on CentOS 9, may be incompatible on older kernels (e.g., Ubuntu 18.04) due to differences in thread creation methods. Users upgrading to v18.2.4 container images on older OS versions may encounter crashes during pthread_create. For workarounds, refer to the related tracker. However, we recommend upgrading your OS to avoid this unsupported combination. Related tracker: https://tracker.ceph.com/issues/66989
fromsnapname == NULL
) in fast-diff mode (whole_object == true
with fast-diff
image feature enabled and valid), diff-iterate is now guaranteed to execute locally if exclusive lock is available. This brings a dramatic performance improvement for QEMU live disk synchronization and backup use cases.get_pool_is_selfmanaged_snaps_mode
C++ API has been deprecated due to being prone to false negative results. Its safer replacement is pool_is_in_selfmanaged_snaps_mode
.--image-id
has been added to rbd children
CLI command, so it can be run for images in the trash.(reef) node-proxy: improve http error handling in fetch_oob_details (pr#55538, Guillaume Abrioux)
[rgw][lc][rgw_lifecycle_work_time] adjust timing if the configured end time is less than the start time (pr#54866, Oguzhan Ozmen)
add checking for rgw frontend init (pr#54844, zhipeng li)
admin/doc-requirements: bump Sphinx to 5.0.2 (pr#55191, Nizamudeen A)
backport of fixes for 63678 and 63694 (pr#55104, Redouane Kachach)
backport rook/mgr recent changes (pr#55706, Redouane Kachach)
ceph-menv:fix typo in README (pr#55163, yu.wang)
ceph-volume: add missing import (pr#56259, Guillaume Abrioux)
ceph-volume: fix a bug in _check_generic_reject_reasons (pr#54705, Kim Minjong)
ceph-volume: Fix migration from WAL to data with no DB (pr#55497, Igor Fedotov)
ceph-volume: fix mpath device support (pr#53539, Guillaume Abrioux)
ceph-volume: fix zap_partitions() in devices.lvm.zap (pr#55477, Guillaume Abrioux)
ceph-volume: fixes fallback to stat in is_device and is_partition (pr#54629, Teoman ONAY)
ceph-volume: update functional testing (pr#56857, Guillaume Abrioux)
ceph-volume: use \'no workqueue\' options with dmcrypt (pr#55335, Guillaume Abrioux)
ceph-volume: Use safe accessor to get TYPE info (pr#56323, Dillon Amburgey)
ceph.spec.in: add support for openEuler OS (pr#56361, liuqinfei)
ceph.spec.in: remove command-with-macro line (pr#57357, John Mulligan)
cephadm/nvmeof: scrape nvmeof prometheus endpoint (pr#56108, Avan Thakkar)
cephadm: Add mount for nvmeof log location (pr#55819, Roy Sahar)
cephadm: Add nvmeof to autotuner calculation (pr#56100, Paul Cuzner)
cephadm: add timemaster to timesync services list (pr#56307, Florent Carli)
cephadm: adjust the ingress ha proxy health check interval (pr#56286, Jiffin Tony Thottan)
cephadm: create ceph-exporter sock dir if it\'s not present (pr#56102, Adam King)
cephadm: fix get_version for nvmeof (pr#56099, Adam King)
cephadm: improve cephadm pull usage message (pr#56292, Adam King)
cephadm: remove restriction for crush device classes (pr#56106, Seena Fallah)
cephadm: rm podman-auth.json if removing last cluster (pr#56105, Adam King)
cephfs-shell: remove distutils Version classes because they\'re deprecated (pr#54119, Venky Shankar, Jos Collin)
cephfs-top: include the missing fields in --dump output (pr#54520, Jos Collin)
client/fuse: handle case of renameat2 with non-zero flags (pr#55002, Leonid Usov, Shachar Sharon)
client: append to buffer list to save all result from wildcard command (pr#53893, Rishabh Dave, Jinmyeong Lee, Jimyeong Lee)
client: call _getattr() for -ENODATA returned _getvxattr() calls (pr#54404, Jos Collin)
client: fix leak of file handles (pr#56122, Xavi Hernandez)
client: Fix return in removexattr for xattrs from system<span></span>.
namespace (pr#55803, Anoop C S)
client: queue a delay cap flushing if there are ditry caps/snapcaps (pr#54466, Xiubo Li)
client: readdir_r_cb: get rstat for dir only if using rbytes for size (pr#53359, Pinghao Wu)
cmake/arrow: don\'t treat warnings as errors (pr#57375, Casey Bodley)
cmake/modules/BuildRocksDB.cmake: inherit parent\'s CMAKE_CXX_FLAGS (pr#55502, Kefu Chai)
cmake: use or turn off liburing for rocksdb (pr#54122, Casey Bodley, Patrick Donnelly)
common/options: Set LZ4 compression for bluestore RocksDB (pr#55197, Mark Nelson)
common/weighted_shuffle: don\'t feed std::discrete_distribution with all-zero weights (pr#55153, Radosław Zarzyński)
common: resolve config proxy deadlock using refcounted pointers (pr#54373, Patrick Donnelly)
DaemonServer.cc: fix config show command for RGW daemons (pr#55077, Aishwarya Mathuria)
debian: add ceph-exporter package (pr#56541, Shinya Hayashi)
debian: add missing bcrypt to ceph-mgr .requires to fix resulting package dependencies (pr#54662, Thomas Lamprecht)
doc/architecture.rst - fix typo (pr#55384, Zac Dover)
doc/architecture.rst: improve rados definition (pr#55343, Zac Dover)
doc/architecture: correct typo (pr#56012, Zac Dover)
doc/architecture: improve some paragraphs (pr#55399, Zac Dover)
doc/architecture: remove pleonasm (pr#55933, Zac Dover)
doc/cephadm - edit t11ing (pr#55482, Zac Dover)
doc/cephadm/services: Improve monitoring.rst (pr#56290, Anthony D\'Atri)
doc/cephadm: correct nfs config pool name (pr#55603, Zac Dover)
doc/cephadm: improve host-management.rst (pr#56111, Anthony D\'Atri)
doc/cephadm: Improve multiple files (pr#56130, Anthony D\'Atri)
doc/cephfs/client-auth.rst: correct ``fs authorize cephfs1 /dir1 clie… (pr#55246, 叶海丰)
doc/cephfs: edit add-remove-mds (pr#55648, Zac Dover)
doc/cephfs: fix architecture link to correct relative path (pr#56340, molpako)
doc/cephfs: Update disaster-recovery-experts.rst to mention Slack (pr#55044, Dhairya Parmar)
doc/crimson: cleanup duplicate seastore description (pr#55730, Rongqi Sun)
doc/dev: backport zipapp docs to reef (pr#56161, Zac Dover)
doc/dev: edit internals.rst (pr#55852, Zac Dover)
doc/dev: edit teuthology workflow (pr#56002, Zac Dover)
doc/dev: fix spelling in crimson.rst (pr#55737, Zac Dover)
doc/dev: osd_internals/snaps.rst: add clone_overlap doc (pr#56523, Matan Breizman)
doc/dev: refine \\"Concepts\\" (pr#56660, Zac Dover)
doc/dev: refine \\"Concepts\\" 2 of 3 (pr#56725, Zac Dover)
doc/dev: refine \\"Concepts\\" 3 of 3 (pr#56729, Zac Dover)
doc/dev: refine \\"Concepts\\" 4 of 3 (pr#56740, Zac Dover)
doc/dev: update leads list (pr#56603, Zac Dover)
doc/dev: update leads list (pr#56589, Zac Dover)
doc/glossary.rst: add \\"Monitor Store\\" (pr#54743, Zac Dover)
doc/glossary: add \\"Crimson\\" entry (pr#56073, Zac Dover)
doc/glossary: add \\"librados\\" entry (pr#56235, Zac Dover)
doc/glossary: Add \\"OMAP\\" to glossary (pr#55749, Zac Dover)
doc/glossary: Add link to CRUSH paper (pr#55557, Zac Dover)
doc/glossary: improve \\"MDS\\" entry (pr#55849, Zac Dover)
doc/glossary: improve OSD definitions (pr#55613, Zac Dover)
doc/install: add manual RADOSGW install procedure (pr#55880, Zac Dover)
doc/install: update \\"update submodules\\" (pr#54961, Zac Dover)
doc/man/8/mount.ceph.rst: add more mount options (pr#55754, Xiubo Li)
doc/man: edit \\"manipulating the omap key\\" (pr#55635, Zac Dover)
doc/man: edit ceph-osd description (pr#54551, Zac Dover)
doc/mgr: credit John Jasen for Zabbix 2 (pr#56684, Zac Dover)
doc/mgr: document lack of MSWin NFS 4.x support (pr#55032, Zac Dover)
doc/mgr: update zabbix information (pr#56631, Zac Dover)
doc/rados/configuration/bluestore-config-ref: Fix lowcase typo (pr#54694, Adam Kupczyk)
doc/rados/configuration/osd-config-ref: fix typo (pr#55678, Pierre Riteau)
doc/rados/operations: add EC overhead table to erasure-code.rst (pr#55244, Anthony D\'Atri)
doc/rados/operations: Fix off-by-one errors in control.rst (pr#55231, tobydarling)
doc/rados/operations: Improve crush_location docs (pr#56594, Niklas Hambüchen)
doc/rados: add \\"change public network\\" procedure (pr#55799, Zac Dover)
doc/rados: add link to pg blog post (pr#55611, Zac Dover)
doc/rados: add PG definition (pr#55630, Zac Dover)
doc/rados: edit \\"client can\'t connect...\\" (pr#54654, Zac Dover)
doc/rados: edit \\"Everything Failed! Now What?\\" (pr#54665, Zac Dover)
doc/rados: edit \\"monitor store failures\\" (pr#54659, Zac Dover)
doc/rados: edit \\"recovering broken monmap\\" (pr#54601, Zac Dover)
doc/rados: edit \\"understanding mon_status\\" (pr#54579, Zac Dover)
doc/rados: edit \\"Using the Monitor\'s Admin Socket\\" (pr#54576, Zac Dover)
doc/rados: fix broken links (pr#55680, Zac Dover)
doc/rados: format sections in tshooting-mon.rst (pr#54638, Zac Dover)
doc/rados: improve \\"Ceph Subsystems\\" (pr#54702, Zac Dover)
doc/rados: improve formatting of log-and-debug.rst (pr#54746, Zac Dover)
doc/rados: link to pg setting commands (pr#55936, Zac Dover)
doc/rados: ops/pgs: s/power of 2/power of two (pr#54700, Zac Dover)
doc/rados: remove PGcalc from docs (pr#55901, Zac Dover)
doc/rados: repair stretch-mode.rst (pr#54762, Zac Dover)
doc/rados: restore PGcalc tool (pr#56057, Zac Dover)
doc/rados: update \\"stretch mode\\" (pr#54756, Michael Collins)
doc/rados: update common.rst (pr#56268, Zac Dover)
doc/rados: update config for autoscaler (pr#55438, Zac Dover)
doc/rados: update PG guidance (pr#55460, Zac Dover)
doc/radosgw - edit admin.rst \\"set user rate limit\\" (pr#55150, Zac Dover)
doc/radosgw/admin.rst: use underscores in config var names (pr#54933, Ville Ojamo)
doc/radosgw: add confval directives (pr#55484, Zac Dover)
doc/radosgw: add gateway starting command (pr#54833, Zac Dover)
doc/radosgw: admin.rst - edit \\"Create a Subuser\\" (pr#55020, Zac Dover)
doc/radosgw: admin.rst - edit \\"Create a User\\" (pr#55004, Zac Dover)
doc/radosgw: admin.rst - edit sections (pr#55017, Zac Dover)
doc/radosgw: edit \\"Add/Remove a Key\\" (pr#55055, Zac Dover)
doc/radosgw: edit \\"Enable/Disable Bucket Rate Limit\\" (pr#55260, Zac Dover)
doc/radosgw: edit \\"read/write global rate limit\\" admin.rst (pr#55271, Zac Dover)
doc/radosgw: edit \\"remove a subuser\\" (pr#55034, Zac Dover)
doc/radosgw: edit \\"Usage\\" admin.rst (pr#55321, Zac Dover)
doc/radosgw: edit admin.rst \\"Get Bucket Rate Limit\\" (pr#55253, Zac Dover)
doc/radosgw: edit admin.rst \\"get user rate limit\\" (pr#55157, Zac Dover)
doc/radosgw: edit admin.rst \\"set bucket rate limit\\" (pr#55242, Zac Dover)
doc/radosgw: edit admin.rst - quota (pr#55082, Zac Dover)
doc/radosgw: edit admin.rst 1 of x (pr#55000, Zac Dover)
doc/radosgw: edit compression.rst (pr#54985, Zac Dover)
doc/radosgw: edit front matter - role.rst (pr#54854, Zac Dover)
doc/radosgw: edit multisite.rst (pr#55671, Zac Dover)
doc/radosgw: edit sections (pr#55027, Zac Dover)
doc/radosgw: fix formatting (pr#54753, Zac Dover)
doc/radosgw: Fix JSON typo in Principal Tag example code snippet (pr#54642, Daniel Parkes)
doc/radosgw: fix verb disagreement - index.html (pr#55338, Zac Dover)
doc/radosgw: format \\"Create a Role\\" (pr#54886, Zac Dover)
doc/radosgw: format commands in role.rst (pr#54905, Zac Dover)
doc/radosgw: format POST statements (pr#54849, Zac Dover)
doc/radosgw: list supported plugins-compression.rst (pr#54995, Zac Dover)
doc/radosgw: update link in rgw-cache.rst (pr#54805, Zac Dover)
doc/radosrgw: edit admin.rst (pr#55073, Zac Dover)
doc/rbd: add clone mapping command (pr#56208, Zac Dover)
doc/rbd: add map information for clone images to rbd-encryption.rst (pr#56186, N Balachandran)
doc/rbd: minor changes to the rbd man page (pr#56256, N Balachandran)
doc/rbd: repair ordered list (pr#55732, Zac Dover)
doc/releases: edit reef.rst (pr#55064, Zac Dover)
doc/releases: specify dashboard improvements (pr#55049, Laura Flores, Zac Dover)
doc/rgw: edit admin.rst - rate limit management (pr#55128, Zac Dover)
doc/rgw: fix Attributes index in CreateTopic example (pr#55432, Casey Bodley)
doc/start: add Slack invite link (pr#56041, Zac Dover)
doc/start: explain \\"OSD\\" (pr#54559, Zac Dover)
doc/start: improve MDS explanation (pr#56466, Zac Dover)
doc/start: improve MDS explanation (pr#56426, Zac Dover)
doc/start: link to mon map command (pr#56410, Zac Dover)
doc/start: update release names (pr#54572, Zac Dover)
doc: add description of metric fields for cephfs-top (pr#55511, Neeraj Pratap Singh)
doc: Add NVMe-oF gateway documentation (pr#55724, Orit Wasserman)
doc: add supported file types in cephfs-mirroring.rst (pr#54822, Jos Collin)
doc: adding documentation for secure monitoring stack configuration (pr#56104, Redouane Kachach)
doc: cephadm/services/osd: fix typo (pr#56230, Lorenz Bausch)
doc: Fixes two typos and grammatical errors. Signed-off-by: Sina Ahma… (pr#54775, Sina Ahmadi)
doc: fixing doc/cephfs/fs-volumes (pr#56648, Neeraj Pratap Singh)
doc: remove releases docs (pr#56567, Patrick Donnelly)
doc: specify correct fs type for mkfs (pr#55282, Vladislav Glagolev)
doc: update rgw admin api req params for get user info (pr#55071, Ali Maredia)
doc:start.rst fix typo in hw-recs (pr#55505, Eduardo Roldan)
docs/rados: remove incorrect ceph command (pr#56495, Taha Jahangir)
docs/radosgw: edit admin.rst \\"enable/disable user rate limit\\" (pr#55194, Zac Dover)
docs/rbd: fix typo in arg name (pr#56262, N Balachandran)
docs: Add information about OpenNebula integration (pr#54938, Daniel Clavijo)
librados: make querying pools for selfmanaged snaps reliable (pr#55026, Ilya Dryomov)
librbd: account for discards that truncate in ObjectListSnapsRequest (pr#56213, Ilya Dryomov)
librbd: Append one journal event per image request (pr#54818, Ilya Dryomov, Joshua Baergen)
librbd: don\'t report HOLE_UPDATED when diffing against a hole (pr#54951, Ilya Dryomov)
librbd: fix regressions in ObjectListSnapsRequest (pr#54862, Ilya Dryomov)
librbd: fix split() for SparseExtent and SparseBufferlistExtent (pr#55665, Ilya Dryomov)
librbd: improve rbd_diff_iterate2() performance in fast-diff mode (pr#55427, Ilya Dryomov)
librbd: return ENOENT from Snapshot::get_timestamp for nonexistent snap_id (pr#55474, John Agombar)
make-dist: don\'t use --continue option for wget (pr#55091, Casey Bodley)
MClientRequest: properly handle ceph_mds_request_head_legacy for ext_num_retry, ext_num_fwd, owner_uid, owner_gid (pr#54407, Alexander Mikhalitsyn)
mds,cephfs_mirror: add labelled per-client and replication metrics (issue#63945, pr#55640, Venky Shankar, Jos Collin)
mds/client: check the cephx mds auth access in client side (pr#54468, Xiubo Li, Ramana Raja)
mds/MDBalancer: ignore queued callbacks if MDS is not active (pr#54493, Leonid Usov)
mds/MDSRank: Add set_history_slow_op_size_and_threshold for op_tracker (pr#53357, Yite Gu)
mds: accept human readable values for quotas (issue#55940, pr#53333, Venky Shankar, Dhairya Parmar, dparmar18)
mds: add a command to dump directory information (pr#55987, Jos Collin, Zhansong Gao)
mds: add balance_automate fs setting (pr#54952, Patrick Donnelly)
mds: add debug logs during setxattr ceph.dir.subvolume (pr#56062, Milind Changire)
mds: allow all types of mds caps (pr#52581, Rishabh Dave)
mds: allow lock state to be LOCK_MIX_SYNC in replica for filelock (pr#56049, Xiubo Li)
mds: change priority of mds rss perf counter to useful (pr#55057, sp98)
mds: check file layout in mknod (pr#56031, Xue Yantao)
mds: check relevant caps for fs include root_squash (pr#57343, Patrick Donnelly)
mds: disable `defer_client_eviction_on_laggy_osds\' by default (issue#64685, pr#56196, Venky Shankar)
mds: do not evict clients if OSDs are laggy (pr#52268, Dhairya Parmar, Laura Flores)
mds: do not simplify fragset (pr#54895, Milind Changire)
mds: ensure next replay is queued on req drop (pr#54313, Patrick Donnelly)
mds: ensure snapclient is synced before corruption check (pr#56398, Patrick Donnelly)
mds: fix issuing redundant reintegrate/migrate_stray requests (pr#54467, Xiubo Li)
mds: just wait the client flushes the snap and dirty buffer (pr#55743, Xiubo Li)
mds: optionally forbid to use standby for another fs as last resort (pr#53340, Venky Shankar, Mykola Golub, Luís Henriques)
mds: relax certain asserts in mdlog replay thread (issue#57048, pr#56016, Venky Shankar)
mds: reverse MDSMap encoding of max_xattr_size/bal_rank_mask (pr#55669, Patrick Donnelly)
mds: revert standby-replay trimming changes (pr#54716, Patrick Donnelly)
mds: scrub repair does not clear earlier damage health status (pr#54899, Neeraj Pratap Singh)
mds: set the loner to true for LOCK_EXCL_XSYN (pr#54911, Xiubo Li)
mds: skip sr moves when target is an unlinked dir (pr#56672, Patrick Donnelly, Dan van der Ster)
mds: use explicitly sized types for network and disk encoding (pr#55742, Xiubo Li)
MDSAuthCaps: minor improvements (pr#54185, Rishabh Dave)
MDSAuthCaps: print better error message for perm flag in MDS caps (pr#54945, Rishabh Dave)
mgr/(object_format && nfs/export): enhance nfs export update failure response (pr#55395, Dhairya Parmar, John Mulligan)
mgr/.dashboard: batch backport of cephfs snapshot schedule management (pr#55581, Ivo Almeida)
mgr/cephadm is not defining haproxy tcp healthchecks for Ganesha (pr#56101, avanthakkar)
mgr/cephadm: allow grafana and prometheus to only bind to specific network (pr#56302, Adam King)
mgr/cephadm: Allow idmap overrides in nfs-ganesha configuration (pr#56029, Teoman ONAY)
mgr/cephadm: catch CancelledError in asyncio timeout handler (pr#56103, Adam King)
mgr/cephadm: discovery service (port 8765) fails on ipv6 only clusters (pr#56093, Theofilos Mouratidis)
mgr/cephadm: fix placement with label and host pattern (pr#56107, Adam King)
mgr/cephadm: fix reweighting of OSD when OSD removal is stopped (pr#56094, Adam King)
mgr/cephadm: fixups for asyncio based timeout (pr#55555, Adam King)
mgr/cephadm: make jaeger-collector a dep for jaeger-agent (pr#56089, Adam King)
mgr/cephadm: refresh public_network for config checks before checking (pr#56325, Adam King)
mgr/cephadm: support for regex based host patterns (pr#56221, Adam King)
mgr/cephadm: support for removing host entry from crush map during host removal (pr#56092, Adam King)
mgr/cephadm: update timestamp on repeat daemon/service events (pr#56090, Adam King)
mgr/dashboard/frontend:Ceph dashboard supports multiple languages (pr#56359, TomNewChao)
mgr/dashboard: Add advanced fieldset component (pr#56692, Afreen)
mgr/dashboard: add frontend unit tests for rgw multisite sync status card (pr#55222, Aashish Sharma)
mgr/dashboard: add snap schedule M, Y frequencies (pr#56059, Ivo Almeida)
mgr/dashboard: add support for editing and deleting rgw roles (pr#55541, Nizamudeen A)
mgr/dashboard: add system users to rgw user form (pr#56471, Pedro Gonzalez Gomez)
mgr/dashboard: add Table Schema to grafonnet (pr#56736, Aashish Sharma)
mgr/dashboard: Allow the user to add the access/secret key on zone edit and not on zone creation (pr#56472, Aashish Sharma)
mgr/dashboard: ceph authenticate user from fs (pr#56254, Pedro Gonzalez Gomez)
mgr/dashboard: change deprecated grafana URL in daemon logs (pr#55544, Nizamudeen A)
mgr/dashboard: chartjs and ng2-charts version upgrade (pr#55224, Pedro Gonzalez Gomez)
mgr/dashboard: Consider null values as zero in grafana panels (pr#54541, Aashish Sharma)
mgr/dashboard: create cephfs snapshot clone (pr#55489, Nizamudeen A)
mgr/dashboard: Create realm sets to default (pr#55221, Aashish Sharma)
mgr/dashboard: Create subvol of same name in different group (pr#55369, Afreen)
mgr/dashboard: dashboard area chart unit test (pr#55517, Pedro Gonzalez Gomez)
mgr/dashboard: debugging make check failure (pr#56127, Nizamudeen A)
mgr/dashboard: disable applitools e2e (pr#56215, Nizamudeen A)
mgr/dashboard: fix cephfs name validation (pr#56501, Nizamudeen A)
mgr/dashboard: fix clone unique validator for name validation (pr#56550, Nizamudeen A)
mgr/dashboard: fix e2e failure related to landing page (pr#55124, Pedro Gonzalez Gomez)
mgr/dashboard: fix empty tags (pr#56439, Pedro Gonzalez Gomez)
mgr/dashboard: fix error while accessing roles tab when policy attached (pr#55515, Afreen)
mgr/dashboard: Fix inconsistency in capitalisation of \\"Multi-site\\" (pr#55311, Afreen)
mgr/dashboard: fix M retention frequency display (pr#56363, Ivo Almeida)
mgr/dashboard: fix retention add for subvolume (pr#56370, Ivo Almeida)
mgr/dashboard: fix rgw display name validation (pr#56548, Nizamudeen A)
mgr/dashboard: fix roles page for roles without policies (pr#55827, Nizamudeen A)
mgr/dashboard: fix snap schedule date format (pr#55815, Ivo Almeida)
mgr/dashboard: fix snap schedule list toggle cols (pr#56115, Ivo Almeida)
mgr/dashboard: fix snap schedule time format (pr#56154, Ivo Almeida)
mgr/dashboard: fix subvolume group edit (pr#55811, Ivo Almeida)
mgr/dashboard: fix subvolume group edit size (pr#56385, Ivo Almeida)
mgr/dashboard: fix the jsonschema issue in install-deps (pr#55542, Nizamudeen A)
mgr/dashboard: fix volume creation with multiple hosts (pr#55786, Pedro Gonzalez Gomez)
mgr/dashboard: fixed cephfs mount command (pr#55993, Ivo Almeida)
mgr/dashboard: fixed nfs attach command (pr#56387, Ivo Almeida)
mgr/dashboard: Fixes multisite topology page breadcrumb (pr#55212, Afreen Misbah)
mgr/dashboard: get object bucket policies for a bucket (pr#55361, Nizamudeen A)
mgr/dashboard: get rgw port from ssl_endpoint (pr#54764, Nizamudeen A)
mgr/dashboard: Handle errors for /api/osd/settings (pr#55704, Afreen)
mgr/dashboard: increase the number of plottable graphs in charts (pr#55571, Afreen, Aashish Sharma)
mgr/dashboard: Locking improvements in bucket create form (pr#56560, Afreen)
mgr/dashboard: make ceph logo redirect to dashboard (pr#56557, Afreen)
mgr/dashboard: Mark placement targets as non-required (pr#56621, Afreen)
mgr/dashboard: replace deprecated table panel in grafana with a newer table panel (pr#56682, Aashish Sharma)
mgr/dashboard: replace piechart plugin charts with native pie chart panel (pr#56654, Aashish Sharma)
mgr/dashboard: rgw bucket features (pr#55575, Pedro Gonzalez Gomez)
mgr/dashboard: rm warning/error threshold for cpu usage (pr#56443, Nizamudeen A)
mgr/dashboard: s/active_mds/active_nfs in fs attach form (pr#56546, Nizamudeen A)
mgr/dashboard: sanitize dashboard user creation (pr#56452, Pedro Gonzalez Gomez)
mgr/dashboard: Show the OSDs Out and Down panels as red whenever an OSD is in Out or Down state in Ceph Cluster grafana dashboard (pr#54538, Aashish Sharma)
mgr/dashboard: Simplify authentication protocol (pr#55689, Daniel Persson)
mgr/dashboard: subvolume snapshot management (pr#55186, Nizamudeen A)
mgr/dashboard: update fedora link for dashboard-cephadm-e2e test (pr#54718, Adam King)
mgr/dashboard: upgrade from old \'graph\' type panels to the new \'timeseries\' panel (pr#56652, Aashish Sharma)
mgr/dashboard:Update encryption and tags in bucket form (pr#56707, Afreen)
mgr/dashboard:Use advanced fieldset for rbd image (pr#56710, Afreen)
mgr/nfs: include pseudo in JSON output when nfs export apply -i fails (pr#55394, Dhairya Parmar)
mgr/node-proxy: handle \'None\' statuses returned by RedFish (pr#55999, Guillaume Abrioux)
mgr/pg_autoscaler: add check for norecover flag (pr#55078, Aishwarya Mathuria)
mgr/snap_schedule: add support for monthly snapshots (pr#55208, Milind Changire)
mgr/snap_schedule: exceptions management and subvol support (pr#52751, Milind Changire)
mgr/volumes: fix subvolume group rm
error message (pr#54207, neeraj pratap singh, Neeraj Pratap Singh)
mgr/volumes: support to reject CephFS clones if cloner threads are not available (pr#55692, Rishabh Dave, Venky Shankar, Neeraj Pratap Singh)
mgr: pin pytest to version 7.4.4 (pr#55362, Laura Flores)
mon, doc: overriding ec profile requires --yes-i-really-mean-it (pr#56435, Radoslaw Zarzynski)
mon, osd, *: expose upmap-primary in OSDMap::get_features() (pr#57794, rzarzynski)
mon/ConfigMonitor: Show localized name in \\"config dump --format json\\" output (pr#53888, Sridhar Seshasayee)
mon/ConnectionTracker.cc: disregard connection scores from mon_rank = -1 (pr#55167, Kamoltat)
mon/OSDMonitor: fix get_min_last_epoch_clean() (pr#55867, Matan Breizman)
mon: fix health store size growing infinitely (pr#55548, Wei Wang)
mon: fix mds metadata lost in one case (pr#54316, shimin)
msg: update MOSDOp() to use ceph_tid_t instead of long (pr#55424, Lucian Petrut)
node-proxy: fix RedFishClient.logout() method (pr#56252, Guillaume Abrioux)
node-proxy: refactor entrypoint (backport) (pr#55454, Guillaume Abrioux)
orch: implement hardware monitoring (pr#55405, Guillaume Abrioux, Adam King, Redouane Kachach)
orchestrator: Add summary line to orch device ls output (pr#56098, Paul Cuzner)
orchestrator: Fix representation of CPU threads in host ls --detail command (pr#56097, Paul Cuzner)
os/bluestore: add bluestore fragmentation micros to prometheus (pr#54258, Yite Gu)
os/bluestore: fix free space update after bdev-expand in NCB mode (pr#55777, Igor Fedotov)
os/bluestore: get rid off resulting lba alignment in allocators (pr#54772, Igor Fedotov)
os/kv_test: Fix estimate functions (pr#56197, Adam Kupczyk)
osd/OSD: introduce reset_purged_snaps_last (pr#53972, Matan Breizman)
osd/scrub: increasing max_osd_scrubs to 3 (pr#55173, Ronen Friedman)
osd: Apply randomly selected scheduler type across all OSD shards (pr#54981, Sridhar Seshasayee)
osd: don\'t require RWEXCL lock for stat+write ops (pr#54595, Alice Zhao)
osd: fix Incremental decode for new/old_pg_upmap_primary (pr#55046, Laura Flores)
osd: improve OSD robustness (pr#54783, Igor Fedotov)
osd: log the number of extents for sparse read (pr#54606, Xiubo Li)
osd: Tune snap trim item cost to reflect a PGs\' average object size for mClock scheduler (pr#55040, Sridhar Seshasayee)
pybind/mgr/devicehealth: replace SMART data if exists for same DATETIME (pr#54879, Patrick Donnelly)
pybind/mgr/devicehealth: skip legacy objects that cannot be loaded (pr#56479, Patrick Donnelly)
pybind/mgr/mirroring: drop mon_host from peer_list (pr#55237, Jos Collin)
pybind/rbd: fix compilation with cython3 (pr#54807, Mykola Golub)
python-common/drive_selection: fix limit with existing devices (pr#56096, Adam King)
python-common: fix osdspec_affinity check (pr#56095, Guillaume Abrioux)
qa/cephadm: testing for extra daemon/container features (pr#55957, Adam King)
qa/cephfs: improvements for name generators in test_volumes.py (pr#54729, Rishabh Dave)
qa/distros: remove centos 8 from supported distros (pr#57932, Guillaume Abrioux, Casey Bodley, Adam King, Laura Flores)
qa/suites/fs/nfs: use standard health ignorelist (pr#56392, Patrick Donnelly)
qa/suites/fs/workload: enable snap_schedule early (pr#56424, Patrick Donnelly)
qa/tasks/cephfs/test_misc: switch duration to timeout (pr#55746, Xiubo Li)
qa/tests: added the initial reef-p2p suite (pr#55714, Yuri Weinstein)
qa/workunits/rbd/cli_generic.sh: narrow race window when checking that rbd_support module command fails after blocklisting the module\'s client (pr#54769, Ramana Raja)
qa: fs volume rename
requires fs fail
and refuse\\\\_client\\\\_session
set (issue#64174, pr#56171, Venky Shankar)
qa: Add benign cluster warning from ec-inconsistent-hinfo test to ignorelist (pr#56151, Sridhar Seshasayee)
qa: add centos_latest (9.stream) and ubuntu_20.04 yamls to supported-all-distro (pr#54677, Venky Shankar)
qa: add diff-continuous and compare-mirror-image tests to rbd and krbd suites respectively (pr#55928, Ramana Raja)
qa: Add tests to validate synced images on rbd-mirror (pr#55762, Ilya Dryomov, Ramana Raja)
qa: bump up scrub status command timeout (pr#55915, Milind Changire)
qa: change log-whitelist to log-ignorelist (pr#56396, Patrick Donnelly)
qa: correct usage of DEBUGFS_META_DIR in dedent (pr#56167, Venky Shankar)
qa: do upgrades from quincy and older reef minor releases (pr#55590, Patrick Donnelly)
qa: enhance labeled perf counters test for cephfs-mirror (pr#56211, Jos Collin)
qa: Fix fs/full suite (pr#55829, Kotresh HR)
qa: fix incorrectly using the wait_for_health() helper (issue#57985, pr#54237, Venky Shankar)
qa: fix rank_asok() to handle errors from asok commands (pr#55302, Neeraj Pratap Singh)
qa: ignore container checkpoint/restore related selinux denials for centos9 (issue#64616, pr#56019, Venky Shankar)
qa: remove error string checks and check w/ return value (pr#55943, Venky Shankar)
qa: remove vstart runner from radosgw_admin task (pr#55097, Ali Maredia)
qa: run kernel_untar_build with newer tarball (pr#54711, Milind Changire)
qa: set mds config with config set
for a particular test (issue#57087, pr#56169, Venky Shankar)
qa: use correct imports to resolve fuse_mount and kernel_mount (pr#54714, Milind Changire)
qa: use exisitng ignorelist override list for fs:mirror[-ha] (issue#62482, pr#54766, Venky Shankar)
radosgw-admin: \'zone set\' won\'t overwrite existing default-placement (pr#55061, Casey Bodley)
rbd-nbd: fix resize of images mapped using netlink (pr#55316, Ramana Raja)
reef backport: rook e2e testing related PRs (pr#55375, Redouane Kachach)
RGW - Swift retarget needs bucket set on object (pr#56004, Daniel Gryniewicz)
rgw/auth: Fix the return code returned by AuthStrategy (pr#54794, Pritha Srivastava)
rgw/beast: Enable SSL session-id reuse speedup mechanism (pr#56120, Mark Kogan)
rgw/datalog: RGWDataChangesLog::add_entry() uses null_yield (pr#55655, Casey Bodley)
rgw/iam: admin/system users ignore iam policy parsing errors (pr#54843, Casey Bodley)
rgw/kafka/amqp: fix race conditionn in async completion handlers (pr#54736, Yuval Lifshitz)
rgw/lc: do not add datalog/bilog for some lc actions (pr#55289, Juan Zhu)
rgw/lua: fix CopyFrom crash (pr#54296, Yuval Lifshitz)
rgw/notification: Kafka persistent notifications not retried and removed even when the broker is down (pr#56140, kchheda3)
rgw/putobj: RadosWriter uses part head object for multipart parts (pr#55621, Casey Bodley)
rgw/rest: fix url decode of post params for iam/sts/sns (pr#55356, Casey Bodley)
rgw/S3select: remove assert from csv-parser, adding updates (pr#55969, Gal Salomon)
RGW/STS: when generating keys, take the trailing null character into account (pr#54127, Oguzhan Ozmen)
rgw: add headers to guide cache update in 304 response (pr#55094, Casey Bodley, Ilsoo Byun)
rgw: Add missing empty checks to the split string in is_string_in_set() (pr#56347, Matt Benjamin)
rgw: d3n: fix valgrind reported leak related to libaio worker threads (pr#54852, Mark Kogan)
rgw: do not copy olh attributes in versioning suspended bucket (pr#55606, Juan Zhu)
rgw: fix cloud-sync multi-tenancy scenario (pr#54328, Ionut Balutoiu)
rgw: object lock avoids 32-bit truncation of RetainUntilDate (pr#54674, Casey Bodley)
rgw: only buckets with reshardable layouts need to be considered for resharding (pr#54129, J. Eric Ivancich)
RGW: pubsub publish commit with etag populated (pr#56453, Ali Masarwa)
rgw: RGWSI_SysObj_Cache::remove() invalidates after successful delete (pr#55716, Casey Bodley)
rgw: SignatureDoesNotMatch for certain RGW Admin Ops endpoints w/v4 auth (pr#54791, David.Hall)
Snapshot schedule show subvolume path (pr#56419, Ivo Almeida)
src/common/options: Correct typo in rgw.yaml.in (pr#55445, Anthony D\'Atri)
src/mount: kernel mount command returning misleading error message (pr#55300, Neeraj Pratap Singh)
test/libcephfs: skip flaky timestamp assertion on Windows (pr#54614, Lucian Petrut)
test/rgw: increase timeouts in unittest_rgw_dmclock_scheduler (pr#55790, Casey Bodley)
test: explicitly link to ceph-common for some libcephfs tests (issue#57206, pr#53635, Venky Shankar)
tools/ceph_objectstore_tool: action_on_all_objects_in_pg to skip pgmeta (pr#54693, Matan Breizman)
Tools/rados: Improve Error Messaging for Object Name Resolution (pr#55112, Nitzan Mordechai)
tools/rbd: make \'children\' command support --image-id (pr#55617, Mykola Golub)
use raw_cluster_cmd instead of run_ceph_cmd (pr#55836, Venky Shankar)
win32_deps_build.sh: change Boost URL (pr#55084, Lucian Petrut)
From time to time, our friends over at the Pawsey Supercomputing Research Centre in Australia provide us with the opportunity to test Ceph on hardware that developers normally can’t access.
The Pawsey Supercomputing Research Centre provides integrated research solutions, expertise and computing infrastructure to Australian and international researchers across numerous scientific domains. Pawsey uses Ceph predominantly for object storage, with their Acacia service providing tens of petabytes of Ceph object storage. Acacia consists of two production Ceph clusters: an 11PB cluster supporting general research and a 27PB cluster dedicated to radio astronomy.
This time, one of the goals was to evaluate a different deployment architecture for OSD hosts: single socket 1U servers, connected to 60-drive external SAS enclosures.
We started off by defining some initial questions to help guide our testing efforts.
As you can see, these questions cover a wide variety of topics, so in the interests of avoiding reader fatigue, our observations will be split into 3 blog posts covering:
Short on time? Here’s a teaser of the environment and some key performance results.
Note that prior to 18.2.1, ceph-volume was not able to consume the multipath devices presented by the external enclosure. Check your ceph version and use of device-mapper-multipath if you\'re having issues!
Our first challenge was OSD configuration. The defaults for the block.db allocation are based on the capacity of the OSD - in our case 22TB. However, each OSD node only had 4 x 1.6TB NVMe drives - so the defaults couldn’t be used. Instead we used a spec file that explicitly defined the space for block.db and the ratio of OSDs to NVMe drives.
service_type: osd\\nservice_id: aio-hdds\\nplacement:\\n host_pattern: storage-13-09012\\nspec:\\n block_db_size: 60G\\n db_slots: 15\\n data_devices:\\n rotational: 1\\n db_devices:\\n rotational: 0\\n\\n
Problem solved, but then our first hiccup. We found that OSD deployment to dense nodes can hit timeouts in cephadm. In our case it only happened twice, but it can be annoying! There is a fix for this in Reef which enables the mgr/cephadm/default_cephadm_command_timeout setting to be user defined, but this isn’t planned to ‘land’ until 18.2.3. Bear this in mind if you’re planning to use dense enclosures.
With the cluster deployed, our focus switched to management of the cluster with the CLI and GUI. These were our focus areas:
The good news is that, in general, the UI coped well with this number of hosts and OSDs, and the LED controls worked fine! However there were a couple of stumbling blocks.
A number of UX enhancements were also raised:
Tracker | Component | Description | Status |
---|---|---|---|
64171 | UI | Includes expand/collapse in the crush map view | backlog |
63864 | CLI | Show devices per node summary | Merged |
63865 | UI | CPU thread count is incorrectly reported | Merged |
As mentioned earlier, this hardware is based on 1U servers connected over SAS to a 60 Drive enclosure. Whilst SAS connectivity with Linux is not an issue, we did find some related issues that would be good to bear in mind.
Finally, during our data analysis we hit an issue where our monitoring was reporting way higher put ops than those reported by the warp tool making it difficult to reconcile Ceph data with warp results. A tracker 65131 was raised to explore what’s going on, and the culprit identified: multipart upload. For large objects, the warp client is using multipart for parallelism as you’d expect, and Ceph is reporting each ‘part upload’ as a separate put op, which it technically is. The problem is there isn’t a corresponding counter from RGW that represents the overall completion of the put op - so you can’t easily reconcile the client and RGW views.
Since the OSD nodes are single socket machines, our starting strategy involved using virtual machines to host the RGW daemons, leaving as much CPU as possible for the 60 OSD daemons. Validating this design decision was our next ‘stop’.
Given that PUT workloads are generally more demanding, we used a simple 4MB object workload as a litmus test. The chart below, shows our initial disappointing results. Failing to even deliver 2GiBs throughput shows that there was a major bottleneck in the design!
The Ceph performance statistics revealed no issues at all, which was hardly surprising given the hardware. The culprit wasn\'t Ceph, but the CPU and network constraints on the hypervisors which were hosting the RGW daemons!
Looking at the OSD CPU usage, we found that even though there were 60 daemons per host, the OSD hosts were not under any CPU pressure at all. Next up - a prompt switch to an RGW collocation strategy, relocating the RGW daemons to the OSD hosts. We also took the opportunity to collocate the workload generators to the Ceph Index nodes. Now the same same test delivered a much more appropriate starting point.
As you can see, adopting a collocation strategy delivered a 7x improvement in PUT throughput!
This was our first win, and will influence Pawsey’s production deployment strategy.
In our next post we may set a new record for the number of charts in a Ceph blog, as we dive deeper into the performance of the cluster across a variety of object sizes and Erasure Code (EC) profiles.
This is a hotfix release that resolves several flaws including Prometheus crashes and an encoder fix.
Auto-tiering Ceph Object Storage - PART 2
In this article we’re getting into the fun part of writing a simple Lua script to start automatically tiering (organizing) S3 objects into the right pools by dynamically changing the Storage Class setting on the fly as objects are being uploaded (S3 PUTs).
If you haven’t read PART 1 you want to check that out over here as it lays the groundwork for what we’re doing here in PART 2.
Why Auto-Tiering Matters
Having different tiers (Storage Classes) is really important not just for cost savings but also for performance considerations. As we discussed in PART 1 if you’re uploading millions of small 1K objects it is a generally bad idea to write those objects into an erasure-coded data pool. It’ll be slow, you’ll waste a lot of space due to object padding, you’ll have unhappy users. But the key is to have the objects get assigned to a suitable storage class automatically as users oftentimes will not make the effort to categorize them themselves.
Revisiting our diagram from PART 1 we’ll use this example again in PART 2 as we write up a Lua script.
Before we jump into the Lua though, a big THANK YOU to Yuval Lifshitz and team for implementing the feature we’re discussing here today. That feature of course is the ability to inject Lua scripts into the CephRGW (CephRGW = Ceph Rados Gateway = S3 protocol gateway) so we can do fun stuff like auto-tiering.
I’d also like to highlight the Lua scripting and talk done by Anthony D’Atri and Curt Bruns that you can find on YouTube here that gave me the idea for this series. In that video you’ll see how they developed a Lua script to auto-tier between TLC and QLC based NVMe storage, highly recommended.
A Basic Lua Example
We’re going to start with a basic example and borrow from Curt & Anthony’s Lua script. In this script we’ll assign objects to three different Storage Classes we defined in the example from PART 1. Those Storage Classes are STANDARD (for our objects greater than 1MB), MEDIUM_OBJ (for objects between 16K and 1MB), and SMALL_OBJ for everything less than 16K.
-- Lua script to auto-tier S3 object PUT requests\\n\\n-- exit script quickly if it is not a PUT request\\nif Request == nil or Request.RGWOp ~= \\"put_obj\\"\\nthen\\n return\\nend\\n\\n-- apply StorageClass only if user hasn\'t already assigned a storage-class\\nif Request.HTTP.StorageClass == nil or Request.HTTP.StorageClass == \'\' then\\n if Request.ContentLength < 16384 then\\n Request.HTTP.StorageClass = \\"SMALL_OBJ\\"\\n elseif Request.ContentLength < 1048576 then\\n Request.HTTP.StorageClass = \\"MEDIUM_OBJ\\"\\n else\\n Request.HTTP.StorageClass = \\"STANDARD\\"\\n end\\n RGWDebugLog(\\"applied \'\\" .. Request.HTTP.StorageClass .. \\"\' to object \'\\" .. Request.Object.Name .. \\"\'\\")\\nend\\n
Installing Lua Scripts into CephRGW
Next you’ll save the above Lua script to a file like autotier.lua
and then you can install it into the CephRGW gateways like so:
radosgw-admin script put --infile=./autotier.lua --context=preRequest\\n
There’s no need to restart your CephRGW instances, the script becomes active immediately for all your RGW instances in the zone. Note though, if you’re doing more advanced scripting and are adding a new Lua package then the CephRGW instances will need to be restarted one time on all the nodes in the zone like so:
# sudo systemctl restart ceph-radosgw@radosgw.*\\n
As illustrated by the diagram, object information about each request including the S3 PUT request, is sent to our autotier.lua script. In there we’re able to dynamically update the value of Request.HTTP.StorageClass
to get our objects into the optimal data pools. Now simply upload some objects to any bucket and you’ll see that they’re getting routed to different data pools based on the dynamically assigned StorageClass value applied by the autotier.lua script.
Debugging
To debug the script and see what’s going on via the RGWDebugLog
messages we’ll want to enable debug mode into the CephRGW and that’s done by editing the ceph.conf, adding ‘debug rgw 20’ to the RGW section, then restarting your CephRGW. Here’s what my radosgw section looks like, and you can ignore all of it except for the two debug options added to the end.
[client.radosgw.smu-80]\\n admin_socket = /var/run/ceph/ceph-client.radosgw.smu-80.asok\\n host = smu-80\\n keyring = /etc/ceph/ceph.client.radosgw.keyring\\n log file = /var/log/radosgw/client.radosgw.smu-80.log\\n rgw dns name = smu-80.osnexus.net\\n rgw frontends = beast endpoint=10.0.8.80:7480\\n rgw print continue = false\\n rgw zone = us-east-1\\n debug rgw = 20\\n
Now we restart the CephRGW to apply the debug settings we added to the ceph.conf. There are ways you can dynamically enable debug mode without changing the ceph.conf like (ceph config set client.radosgw.smu-80 debug_rgw 20) but it’s generally best to update the ceph.conf so your debug mode setting is saved between restarts.
# sudo systemctl restart ceph-radosgw@radosgw.*\\n
Last, let’s look at the log, on RHEL you’ll find the log under /var/log/ceph/FSID/ but I’ve got my log file set to go here /var/log/radosgw/client.radosgw.smu-80.log so I use this to view the Lua debug messages:
# tail -f /var/log/radosgw/client.radosgw.smu-80.log | grep Lua\\n
Now I can see all the messages on how the objects are getting tagged as I upload objects into the object store.
2024-02-14T06:08:11.257+0000 7f16a8dc1700 20 Lua INFO: applied \'MEDIUM_OBJ\' to object \'security_features_2023.pdf\'\\n2024-02-14T06:08:12.345+0000 7f171569a700 20 Lua INFO: applied \'MEDIUM_OBJ\' to object \'security_features_2023.pdf\'\\n2024-02-14T06:08:13.389+0000 7f16a55ba700 20 Lua INFO: applied \'MEDIUM_OBJ\' to object \'security_features_2023.pdf\'\\n2024-02-14T06:08:14.465+0000 7f1742ef5700 20 Lua INFO: applied \'MEDIUM_OBJ\' to object \'security_features_2023.pdf\'\\n2024-02-14T06:08:42.928+0000 7f16d7e1f700 20 Lua INFO: applied \'SMALL_OBJ\' to object \'whitepaper.docx\'\\n2024-02-14T06:08:44.012+0000 7f1680570700 20 Lua INFO: applied \'SMALL_OBJ\' to object \'whitepaper.docx\'\\n2024-02-14T06:08:45.056+0000 7f1674558700 20 Lua INFO: applied \'SMALL_OBJ\' to object \'whitepaper.docx\'\\n2024-02-14T06:08:46.092+0000 7f1664d39700 20 Lua INFO: applied \'SMALL_OBJ\' to object \'whitepaper.docx\'\\n
Testing
To test this out you’ll want to upload some objects of various sizes and you’ll see the storage-class tag get applied to them dynamically. Note that if you assign a tag like “PERFORMANCE” to a PUT request and you haven’t configured it then your data will just get routed into the pool associated with the “STANDARD” storage-class, typically rgw.default.bucket.data
if you have a default config.
Summary
Hope you enjoyed this tutorial on auto-tiering Ceph object storage with Lua. In the last part, PART 3 we’re going to deep dive into setting up a Ceph object cluster with three Storage Classes from scratch using QuantaStor 6. We’ll have a companion video on YouTube where we’ll go through setting up everything and then we’ll go into more CephRGW Lua scripting where we’ll organize objects not just by size but by string match to specific file name extensions. Last, thanks to Anthony D’Atri and Yuval Lifshitz for their help in reviewing and proofreading these articles.
S3-compatible object storage systems generally have the ability to store objects into different tiers with different characteristics so you can get the best combination of cost and performance to match the needs of any given application workload. Storage tiers are referred to as ‘Storage Classes’ in S3 parlance with example storage classes at AWS including “STANDARD” for general purpose use and lower storage classes like “DEEP_ARCHIVE” and “GLACIER” for backups and archive use cases.
Ceph’s S3-compatible storage capabilities also includes the ability to create your own Storage Classes and by default it automatically creates a single storage class called “STANDARD” to match the default tier offered by AWS.
In this 3 part blog post we’re going to dive into auto-tiering object storage with Ceph and explore some basic Lua scripting as part of that which I think you’ll find approachable even if you’ve not used or heard of Lua before:
Ceph Object Storage Basics
Ceph object storage clusters consist of two primary storage pools, one for metadata and one for data.
The metadata pool stores the index of all the objects for every bucket and contains “rgw.bucket.index” in the name. Essentially the bucket index pool is a collection of databases, one for each bucket which contains the list of every object in that bucket and information on the location of each chunk of data (RADOS object) that makes up each S3 object.
Data pools typically contain “rgw.buckets.data” in their name and they store all the actual data blocks (RADOS objects) that make up each S3 object in your cluster.
The metadata in the bucket index pool needs to be on fast storage that’s great for small reads and writes (IOPS) as it is essentially a collection of databases. As such (and for various technical reasons beyond this article) this pool must be configured with a replica layout and ideally should be stored on all-flash storage media. Flash storage for the bucket index pool is also important as buckets must resize their bucket index databases (RocksDB based) periodically to make it larger to make more room for more object metadata as a bucket grows. This process is called “resharding” and it all happens automatically behind the scenes but resharding can greatly impact cluster performance if the bucket index pool is on HDD media rather than flash media.
In contrast, the data pool (eg default.rgw.buckets.data) is typically storing large chunks of data that can be written efficiently to HDDs. This is where erasure coding layouts shine and provide one with a big boost in usable capacity (usually 66% or more vs 33% usable with replica=3). Erasure coding also has great write performance when you’re working with objects that are large enough (generally anything 4MB and larger but ideally 64MB and larger) as there’s much less network write amplification when using erasure coding (~125% of client N/S traffic) vs a replica based layout (300% of client N/S traffic).
Object Storage Zones and Zone Groups
A zone contains a complete copy of all of your S3 objects and those can be mirrored in whole or in part to other zones. When you want to mirror everything to another zone you put the zones you want mirrored together into a Zone Group. Typically the name of the Zone Group is also the name of the S3 realm like “us” or “eu” and the zone will have a name like “us-east-1” or “us-west-1” to borrow from some common AWS zone names. When setting up your Ceph object storage for use with products like Veritas Netbackup we recommend using a AWS zone name like “us-east-1” for compatibility as some products specifically look for known AWS zone and realm names. A cluster setup with a zone “us-east-1” and zone group (and realm) of “us” will look like this.
Object Storage Classes
Storage classes provide us with a way to tag objects to go into the data pool of our choosing. When you set up a cluster with a single data pool like you see above you’ll have a single storage class mapped to it called “STANDARD” and your cluster will look like this.
Auto-tiering via Multiple Storage Classes
So now we get to the heart of this article. What if all your data isn’t composed of large objects, what if you have millions or billions of small objects mixed in with large objects? You want to use erasure coding for the large objects but that’ll be wasteful and expensive for the small objects (eg. 1K to 64K). But if you use replica=3 as the layout for the data pool you’ll only get 33% usable capacity and you’ll run out of space and need three times more storage. This is where multiple data pools come to the rescue. Without buying any additional storage we can share the underlying media (OSDs) with the existing pools and make new data pools to give ourselves additional layout options. Here’s an example of where we add two additional data pools and associated storage classes we’ll call SMALL_OBJ for objects < 16K and MEDIUM_OBJ for everything 16K to 1MB would look like:
So now we have a storage class “SMALL_OBJ” that’s only going to use a few gigabytes for every million small objects and will be able to read and write those efficiently. We also have a HDD based “MEDIUM_OBJ” storage class that is also using a replica=3 layout like “SMALL_OBJ” but this pool is on HDD media so it’s less costly and allows us to reasonably store roughly one million 1MB objects in just 1TB of space. For everything else we’ll route it to our erasure-coded default “STANDARD” storage class. Note also that some applications written for AWS S3 won’t accept custom Storage Class names like “SMALL_OBJ” so if you run into compatibility issues, try choosing from pre-defined Storage Class names used by AWS.
Users Don’t Aim
Ok, so you’ve done all the above, you’ve got an optimally configured object storage cluster but now your users are calling and saying it’s slow. So you look into it and you find that your users are not making any effort to categorize their objects into the right Storage Class categories (i.e. by setting the S3 X-Amz-Storage-Class header) when they upload objects. It’s like trying to get everyone to organize and separate their recycling. But in this case we have a secret weapon, that’s Lua, and in the next article we’re going to use a few lines of scripting to put our objects into the right Storage Class every time so that users won’t need to do a thing.
(Following the recycle bin analogy, we’ll be able to just throw the object in the general direction of the bins and it will always land in the correct bin. No aiming required, just like Shane from Stuff Made Here! 😀 )
This is the fifteenth, and expected to be last, backport release in the Pacific series.
ceph config dump --format <json|xml>
output will display the localized option names instead of their normalized version. For example, \\"mgr/prometheus/x/server_port\\" will be displayed instead of \\"mgr/prometheus/server_port\\". This matches the output of the non pretty-print formatted version of the command.
CephFS: MDS evicts clients who are not advancing their request tids, which causes a large buildup of session metadata, resulting in the MDS going read-only due to the RADOS operation exceeding the size threshold. The mds_session_metadata_threshold
config controls the maximum size that an (encoded) session metadata can grow.
RADOS: The get_pool_is_selfmanaged_snaps_mode
C++ API has been deprecated due to its susceptibility to false negative results. Its safer replacement is pool_is_in_selfmanaged_snaps_mode
.
RBD: When diffing against the beginning of time (fromsnapname == NULL
) in fast-diff mode (whole_object == true
with fast-diff
image feature enabled and valid), diff-iterate is now guaranteed to execute locally if exclusive lock is available. This brings a dramatic performance improvement for QEMU live disk synchronization and backup use cases.
[CVE-2023-43040] rgw: Fix bucket validation against POST policies (pr#53758, Joshua Baergen)
admin/doc-requirements: bump Sphinx to 5.0.2 (pr#55258, Nizamudeen A)
blk/kernel: Add O_EXCL for block devices (pr#53567, Adam Kupczyk)
Bluestore: fix bluestore collection_list latency perf counter (pr#52949, Wangwenjuan)
bluestore: Fix problem with volume selector (pr#53587, Adam Kupczyk)
ceph-volume,python-common: Data allocate fraction (pr#53581, Jonas Pfefferle)
ceph-volume: add --osd-id option to raw prepare (pr#52928, Guillaume Abrioux)
ceph-volume: fix a bug in _check_generic_reject_reasons (pr#54707, Kim Minjong, Guillaume Abrioux, Michael English)
ceph-volume: fix raw list for lvm devices (pr#52981, Guillaume Abrioux)
ceph-volume: fix zap_partitions() in devices.lvm.zap (pr#55658, Guillaume Abrioux)
ceph-volume: fix zap_partitions() in devices.lvm.zap (pr#55481, Guillaume Abrioux)
ceph-volume: fixes fallback to stat in is_device and is_partition (pr#54709, Guillaume Abrioux, Teoman ONAY)
ceph: allow xlock state to be LOCK_PREXLOCK when putting it (pr#53662, Xiubo Li)
cephadm: add tcmu-runner to logrotate config (pr#53975, Adam King)
cephadm: Adding support to configure public_network cfg section (pr#52411, Redouane Kachach)
cephadm: allow ports to be opened in firewall during adoption, reconfig, redeploy (pr#52083, Adam King)
cephadm: make custom_configs work for tcmu-runner container (pr#53469, Adam King)
cephadm: run tcmu-runner through script to do restart on failure (pr#53977, Adam King, Raimund Sacherer)
cephfs-journal-tool: disambiguate usage of all keyword (in tool help) (pr#53645, Manish M Yathnalli)
cephfs-mirror: do not run concurrent C_RestartMirroring context (issue#62072, pr#53640, Venky Shankar)
cephfs-top: include the missing fields in --dump output (pr#53453, Jos Collin)
cephfs: upgrade cephfs-shell\'s path wherever necessary (pr#54144, Rishabh Dave)
cephfs_mirror: correctly set top level dir permissions (pr#53270, Milind Changire)
client: always refresh mds feature bits on session open (issue#63188, pr#54245, Venky Shankar)
client: fix sync fs to force flush mdlog for all sessions (pr#53981, Xiubo Li)
client: issue a cap release immediately if no cap exists (pr#52852, Xiubo Li)
client: queue a delay cap flushing if there are ditry caps/snapcaps (pr#54472, Xiubo Li)
cmake/modules/BuildRocksDB.cmake: inherit parent\'s CMAKE_CXX_FLAGS (pr#55500, Kefu Chai)
common/weighted_shuffle: don\'t feed std::discrete_distribution with all-zero weights (pr#55155, Radosław Zarzyński)
common: intrusive_lru destructor add (pr#54558, Ali Maredia)
doc/cephfs: note regarding start time time zone (pr#53576, Milind Changire)
doc/cephfs: write cephfs commands fully in docs (pr#53403, Rishabh Dave)
doc/rados/configuration/bluestore-config-ref: Fix lowcase typo (pr#54696, Adam Kupczyk)
doc/rados: update config for autoscaler (pr#55440, Zac Dover)
doc: clarify use of rados rm
command (pr#51260, J. Eric Ivancich)
doc: discuss the standard multi-tenant CephFS security model (pr#53560, Greg Farnum)
Fixing example of BlueStore resharding (pr#54474, Adam Kupczyk)
isa-l: incorporate fix for aarch64 text relocation (pr#51314, luo rixin)
libcephsqlite: fill 0s in unread portion of buffer (pr#53103, Patrick Donnelly)
librados: make querying pools for selfmanaged snaps reliable (pr#55024, Ilya Dryomov)
librbd: Append one journal event per image request (pr#54820, Joshua Baergen)
librbd: don\'t report HOLE_UPDATED when diffing against a hole (pr#54949, Ilya Dryomov)
librbd: fix regressions in ObjectListSnapsRequest (pr#54860, Ilya Dryomov)
librbd: improve rbd_diff_iterate2() performance in fast-diff mode (pr#55256, Ilya Dryomov)
librbd: kick ExclusiveLock state machine on client being blocklisted when waiting for lock (pr#53295, Ramana Raja)
librbd: make CreatePrimaryRequest remove any unlinked mirror snapshots (pr#53274, Ilya Dryomov)
log: fix the formatting when dumping thread IDs (pr#53465, Radoslaw Zarzynski)
log: Make log_max_recent have an effect again (pr#48311, Joshua Baergen)
make-dist: don\'t use --continue option for wget (pr#55090, Casey Bodley)
make-dist: download liburing from kernel.io instead of github (pr#53197, Laura Flores)
MClientRequest: properly handle ceph_mds_request_head_legacy for ext_num_retry, ext_num_fwd, owner_uid, owner_gid (pr#54410, Alexander Mikhalitsyn)
mds,qa: some balancer debug messages (<=5) not printed when debug_mds is >=5 (pr#53552, Patrick Donnelly)
mds/Server: mark a cap acquisition throttle event in the request (pr#53169, Leonid Usov)
mds: acquire inode snaplock in open (pr#53185, Patrick Donnelly)
mds: add event for batching getattr/lookup (pr#53556, Patrick Donnelly)
mds: adjust pre_segments_size for MDLog when trimming segments for st… (issue#59833, pr#54033, Venky Shankar)
mds: blocklist clients with \\"bloated\\" session metadata (issue#61947, issue#62873, pr#53634, Venky Shankar)
mds: drop locks and retry when lock set changes (pr#53243, Patrick Donnelly)
mds: ensure next replay is queued on req drop (pr#54314, Patrick Donnelly)
mds: fix deadlock between unlinking and linkmerge (pr#53495, Xiubo Li)
mds: fix issuing redundant reintegrate/migrate_stray requests (pr#54517, Xiubo Li)
mds: log message when exiting due to asok command (pr#53550, Patrick Donnelly)
mds: replacing bootstrap session only if handle client session message (pr#53362, Mer Xuanyi)
mds: report clients laggy due laggy OSDs only after checking any OSD is laggy (pr#54120, Dhairya Parmar)
mds: set the loner to true for LOCK_EXCL_XSYN (pr#54912, Xiubo Li)
mds: use variable g_ceph_context directly in MDSAuthCaps (pr#52821, Rishabh Dave)
mgr/BaseMgrModule: Optimize CPython Call in Finish Function (pr#55109, Nitzan Mordechai)
mgr/cephadm: Add \\"networks\\" parameter to orch apply rgw (pr#53974, Teoman ONAY)
mgr/cephadm: ceph orch add fails when ipv6 address is surrounded by square brackets (pr#53978, Teoman ONAY)
mgr/dashboard: add \'omit_usage\' query param to dashboard api \'get rbd\' endpoint (pr#54192, Cory Snyder)
mgr/dashboard: allow tls 1.2 with a config option (pr#53781, Nizamudeen A)
mgr/dashboard: Consider null values as zero in grafana panels (pr#54542, Aashish Sharma)
mgr/dashboard: fix CephPGImbalance alert (pr#49478, Aashish Sharma)
mgr/dashboard: Fix CephPoolGrowthWarning alert (pr#49477, Aashish Sharma)
mgr/dashboard: fix constraints.txt (pr#54652, Ernesto Puerta)
mgr/dashboard: fix rgw page issues when hostname not resolvable (pr#53215, Nizamudeen A)
mgr/dashboard: set CORS header for unauthorized access (pr#53202, Nizamudeen A)
mgr/prometheus: avoid duplicates and deleted entries for rbd_stats_pools (pr#48524, Avan Thakkar)
mgr/prometheus: change pg_repaired_objects name to pool_repaired_objects (pr#48439, Pere Diaz Bou)
mgr/prometheus: fix pool_objects_repaired and daemon_health_metrics format (pr#51692, banuchka)
mgr/rbd_support: fix recursive locking on CreateSnapshotRequests lock (pr#54293, Ramana Raja)
mgr/snap-schedule: use the right way to check the result returned by… (pr#53355, Mer Xuanyi)
mgr/snap_schedule: allow retention spec \'n\' to be user defined (pr#52750, Milind Changire, Jakob Haufe)
mgr/volumes: Fix pending_subvolume_deletions in volume info (pr#53574, Kotresh HR)
mgr: Add one finisher thread per module (pr#51045, Kotresh HR, Patrick Donnelly)
mgr: add throttle policy for DaemonServer (pr#54013, ericqzhao)
mgr: don\'t dump global config holding gil (pr#50194, Mykola Golub)
mgr: fix a race condition in DaemonServer::handle_report() (pr#52993, Radoslaw Zarzynski)
mgr: register OSDs in ms_handle_accept (pr#53189, Patrick Donnelly)
mgr: remove out&down osd from mgr daemons (pr#54553, shimin)
mon/ConfigMonitor: Show localized name in \\"config dump --format json\\" output (pr#53984, Sridhar Seshasayee)
mon/MonClient: resurrect original client_mount_timeout handling (pr#52533, Ilya Dryomov)
mon/Monitor.cc: exit function if !osdmon()->is_writeable() && mon/OSDMonitor: Added extra check before mon.go_recovery_stretch_mode() (pr#51414, Kamoltat)
mon/Monitor: during shutdown don\'t accept new authentication and crea… (pr#55113, Nitzan Mordechai)
mon: add exception handling to ceph health mute (pr#55118, Daniel Radjenovic)
mon: add proxy to cache tier options (pr#50552, tan changzhi)
mon: fix health store size growing infinitely (pr#55472, Wei Wang)
mon: fix iterator mishandling in PGMap::apply_incremental (pr#52555, Oliver Schmidt)
mon: fix mds metadata lost in one case (pr#54318, shimin)
msg/async: initialize worker in RDMAStack::create_worker() and drop Stack::num_workers (pr#55443, Kefu Chai)
msg/AsyncMessenger: re-evaluate the stop condition when woken up in \'wait()\' (pr#53716, Leonid Usov)
nofail option in fstab not supported (pr#52987, Leonid Usov)
os/bluestore: don\'t require bluestore_db_block_size when attaching new (pr#52948, Igor Fedotov)
os/bluestore: get rid off resulting lba alignment in allocators (pr#54434, Igor Fedotov)
osd,bluestore: gracefully handle a failure during meta collection load (pr#53135, Igor Fedotov)
osd/OpRequest: Add detail description for delayed op in osd log file (pr#53693, Yite Gu)
osd/OSD: introduce reset_purged_snaps_last (pr#53970, Matan Breizman)
osd/OSDMap: Check for uneven weights & != 2 buckets post stretch mode (pr#52459, Kamoltat)
osd/scrub: Fix scrub starts messages spamming the cluster log (pr#53430, Prashant D)
osd: don\'t require RWEXCL lock for stat+write ops (pr#54593, Alice Zhao)
osd: ensure async recovery does not drop a pg below min_size (pr#54548, Samuel Just)
osd: fix shard-threads cannot wakeup bug (pr#51262, Jianwei Zhang)
osd: fix use-after-move in build_incremental_map_msg() (pr#54268, Ronen Friedman)
osd: log the number of extents for sparse read (pr#54604, Xiubo Li)
pacifc: Revert \\"mgr/dashboard: unselect rows in datatables\\" (pr#55415, Nizamudeen A)
pybind/mgr/autoscaler: Donot show NEW PG_NUM value if autoscaler is not on (pr#53464, Prashant D)
pybind/mgr/mgr_util: fix to_pretty_timedelta() (pr#51243, Sage Weil)
pybind/mgr/volumes: log mutex locks to help debug deadlocks (pr#53916, Kotresh HR)
pybind/mgr: ceph osd status crash with ZeroDivisionError (pr#46696, Nitzan Mordechai, Kefu Chai)
pybind/rados: don\'t close watch in dealloc if already closed (pr#51259, Tim Serong)
pybind/rados: fix missed changes for PEP484 style type annotations (pr#54361, Igor Fedotov)
pybind/rbd: don\'t produce info on errors in aio_mirror_image_get_info() (pr#54053, Ilya Dryomov)
python-common/drive_group: handle fields outside of \'spec\' even when \'spec\' is provided (pr#52413, Adam King)
python-common/drive_selection: lower log level of limit policy message (pr#52412, Adam King)
qa/distros: backport update from rhel 8.4 -> 8.6 (pr#54901, Casey Bodley, David Galloway)
qa/suites/krbd: stress test for recovering from watch errors (pr#53784, Ilya Dryomov)
qa/suites/orch: whitelist warnings that are expected in test environments (pr#55523, Laura Flores)
qa/suites/rbd: add test to check rbd_support module recovery (pr#54294, Ramana Raja)
qa/suites/upgrade/pacific-p2p: run librbd python API tests from pacific tip (pr#55418, Yuri Weinstein)
qa/suites/upgrade/pacific-p2p: skip TestClsRbd.mirror_snapshot test (pr#53204, Ilya Dryomov)
qa/suites: added more whitelisting + fix typo (pr#55717, Kamoltat)
qa/tasks/cephadm: enable mon_cluster_log_to_file (pr#55429, Dan van der Ster)
qa/upgrade: disable a failing ceph_test_cls_cmpomap test case (pr#55519, Casey Bodley)
qa/upgrade: use ragweed branch for starting ceph release (pr#55382, Casey Bodley)
qa/workunits/rbd/cli_generic.sh: narrow race window when checking that rbd_support module command fails after blocklisting the module\'s client (pr#54771, Ramana Raja)
qa: assign file system affinity for replaced MDS (issue#61764, pr#54039, Venky Shankar)
qa: ignore expected cluster warning from damage tests (pr#53486, Patrick Donnelly)
qa: lengthen shutdown timeout for thrashed MDS (pr#53555, Patrick Donnelly)
qa: pass arg as list to fix test case failure (pr#52763, Dhairya Parmar)
qa: remove duplicate import (pr#53447, Patrick Donnelly)
qa: run kernel_untar_build with newer tarball (pr#54713, Milind Changire)
qa: wait for file to have correct size (pr#52744, Patrick Donnelly)
rados: build minimally when \\"WITH_MGR\\" is off (pr#51250, J. Eric Ivancich)
rados: increase osd_max_write_op_reply_len default to 64 bytes (pr#53470, Matt Benjamin)
RadosGW API: incorrect bucket quota in response to HEAD /{bucket}/?usage (pr#53439, shreyanshjain7174)
radosgw-admin: allow \'bi purge\' to delete index if entrypoint doesn\'t exist (pr#54010, Casey Bodley)
radosgw-admin: don\'t crash on --placement-id without --storage-class (pr#53474, Casey Bodley)
radosgw-admin: fix segfault on pipe modify without source/dest zone specified (pr#51256, caisan)
rbd-nbd: fix stuck with disable request (pr#54256, Prasanna Kumar Kalever)
rgw - Fix NoSuchTagSet error (pr#50533, Daniel Gryniewicz)
rgw/auth: ignoring signatures for HTTP OPTIONS calls (pr#55550, Tobias Urdin)
rgw/beast: add max_header_size option with 16k default, up from 4k (pr#52113, Casey Bodley)
rgw/keystone: EC2Engine uses reject() for ERR_SIGNATURE_NO_MATCH (pr#53764, Casey Bodley)
rgw/notification: remove non x-amz-meta-* attributes from bucket notifications (pr#53376, Juan Zhu)
rgw/putobj: RadosWriter uses part head object for multipart parts (pr#55586, Casey Bodley)
rgw/s3: ListObjectsV2 returns correct object owners (pr#54160, Casey Bodley)
rgw/sts: AssumeRole no longer writes to user metadata (pr#52051, Casey Bodley)
rgw/sts: code for returning an error when an IAM policy (pr#44462, Pritha Srivastava)
rgw/sts: code to fetch certs using .well-known/openid-configuration URL (pr#44464, Pritha Srivastava)
rgw/sts: createbucket op should take session_policies into account (pr#44476, Pritha Srivastava)
rgw/sts: fix read_obj_policy permission evaluation (pr#44471, Pritha Srivastava)
rgw/sts: fixes getsessiontoken authenticated with LDAP (pr#44463, Pritha Srivastava)
rgw/swift: check position of first slash in slo manifest files (pr#51600, Marcio Roberto Starke)
rgw/sync-policy: Correct \\"sync status\\" & \\"sync group\\" commands (pr#53410, Soumya Koduri)
rgw: \'bucket check\' deletes index of multipart meta when its pending_map is nonempty (pr#54016, Huber-ming)
rgw: add radosgw-admin bucket check olh/unlinked commands (pr#53808, Cory Snyder)
rgw: Avoid segfault when OPA authz is enabled (pr#46106, Benoît Knecht)
rgw: beast frontend checks for local_endpoint() errors (pr#54167, Casey Bodley)
rgw: Drain async_processor request queue during shutdown (pr#53472, Soumya Koduri)
rgw: fix 2 null versionID after convert_plain_entry_to_versioned (pr#53400, rui ma, zhuo li)
rgw: Fix Browser POST content-length-range min value (pr#52936, Robin H. Johnson)
rgw: fix FP error when calculating enteries per bi shard (pr#53593, J. Eric Ivancich)
rgw: fix rgw cache invalidation after unregister_watch() error (pr#54014, lichaochao)
rgw: fix SignatureDoesNotMatch when extra headers start with \'x-amz\' (pr#53772, rui ma)
rgw: Fix truncated ListBuckets response (pr#49526, Joshua Baergen)
rgw: fix unwatch crash at radosgw startup (pr#53759, lichaochao)
rgw: fix UploadPartCopy error code when src object not exist and src bucket not exist (pr#53356, yuliyang)
rgw: handle http options CORS with v4 auth (pr#53416, Tobias Urdin)
rgw: improve buffer list utilization in the chunkupload scenario (pr#53775, liubingrun)
rgw: multisite data log flag not used (pr#52055, J. Eric Ivancich)
rgw: pick http_date in case of http_x_amz_date absence (pr#53443, Seena Fallah, Mohamed Awnallah)
rgw: prevent spurious/lost notifications in the index completion thread (pr#49093, Casey Bodley, Yuval Lifshitz)
rgw: retry metadata cache notifications with INVALIDATE_OBJ (pr#52797, Casey Bodley)
rgw: s3 object lock avoids overflow in retention date (pr#52605, Casey Bodley)
rgw: s3website doesn\'t prefetch for web_dir() check (pr#53769, Casey Bodley)
rgw: set keys from from master zone on admin api user create (pr#51602, Ali Maredia)
rgw: Solving the issue of not populating etag in Multipart upload result (pr#51445, Ali Masarwa)
rgw: swift : check for valid key in POST forms (pr#52729, Abhishek Lekshmanan)
rgw: Update \\"CEPH_RGW_DIR_SUGGEST_LOG_OP\\" for remove entries (pr#50540, Soumya Koduri)
rgw: use unique_ptr for flat_map emplace in BucketTrimWatche (pr#52996, Vedansh Bhartia)
rgwlc: prevent lc for one bucket from exceeding time budget (pr#53562, Matt Benjamin)
test/lazy-omap-stats: Various enhancements (pr#50518, Brad Hubbard)
test/librbd: avoid config-related crashes in DiscardWithPruneWriteOverlap (pr#54859, Ilya Dryomov)
test/store_test: adjust physical extents to inject error against (pr#54782, Igor Fedotov)
tools/ceph_objectstore_tool: action_on_all_objects_in_pg to skip pgmeta (pr#54691, Matan Breizman)
tools/ceph_objectstore_tool: Support get/set/superblock (pr#55013, Matan Breizman)
tools/osdmaptool: fix possible segfaults when there are down osds (pr#52203, Mykola Golub)
Tools/rados: Improve Error Messaging for Object Name Resolution (pr#55111, Nitzan Mordechai)
vstart_runner: maintain log level when --debug is passed (pr#52977, Rishabh Dave)
vstart_runner: use FileNotFoundError when os.stat() fails (pr#52978, Rishabh Dave)
win32_deps_build.sh: change Boost URL (pr#55086, Lucian Petrut)
There is no silver bullet regarding RocksDB performance. Now that I got your attention: you might be \\"lucky\\" when you are using upstream Ceph Ubuntu packages.
Mark Nelson found out that before Pull request (PR) was merged, the build process did not properly propagate the CMAKE_BUILD_TYPE
options to external projects built by Ceph - in this case, RocksDB. Before, packages were built with the RelWithDebInfo
to build a \\"performance\\" release package. While it has not been verified, it is possible that upstream Ceph Ubuntu packages have suffered from this since Luminous.
Thanks to Mark Nelson for finding and fixing this issue. Thanks to Kefu Chai for providing a fix for the build system. Thanks to Casey Bodley for taking care of creating the backport trackers. Thanks to my employer BIT for having me work on Ceph, and to Els de Jong and Anthony D\'Atri for editing.
RocksDB performance is sub-optimal when built without RelWithDebInfo
. This can be mitigated by installing \\"peformance\\" package builds. The actual performance increase depends on the cluster, but the RocksDB compaction is reduced by a factor of three. In some cases random 4K write performance is doubled. See these links 1 and 2.
Install a version where this problem is resolved for the release you are running: Pacific, Quincy,Reef
If you are running an EOL version of Ceph you can build it yourself. See documentation, or the short version below
git clone [ceph](https://github.com/ceph/ceph.git)\\ncd ceph\\ngit checkout vYour_Release_Verion\\nadd \\"extraopts += -DCMAKE_BUILD_TYPE=RelWithDebInfo\\" to debian/rules file\\n./do_cmake.sh -DCMAKE_BUILD_TYPE=RelWithDebInfo\\ndpkg-buildpackage -us -uc -j$DOUBLE_NUMBER_OF_CORES_BUILD_HOST 2>&1 | tee ../dpkg-buildpackage.log\\n
Note: add the \\"-b\\" option to dpkg-buildpackage
if you only want binary packages and no source packages. Make sure you have enough file space available, and enough memory, especially when building with a lot of threads. I used a VM with 256 GB of RAM, 64 cores and 300 GB of space and that took around 1 hour and 7 minutes (a full build including source packages).
Make sure you check the dpkg-buildpackage.log and check for DCMAKE_BUILD_TYPE=RelWithDebInfo
like below
cd /home/stefan/ceph/obj-x86_64-linux-gnu/src/rocksdb && /usr/bin/cmake -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DWITH_GFLAGS=OFF -DCMAKE_PREFIX_PATH= -DCMAKE_CXX_COMPILER=/usr/bin/c++ -DWITH_SNAPPY=TRUE -DWITH_LZ4=TRUE -Dlz4_INCLUDE_DIRS=/usr/include -Dlz4_LIBRARIES=/usr/lib/x86_64-linux-gnu/liblz4.so -DWITH_ZLIB=TRUE -DPORTABLE=ON -DCMAKE_AR=/usr/bin/ar -DCMAKE_BUILD_TYPE=RelWithDebInfo -DFAIL_ON_WARNINGS=OFF -DUSE_RTTI=1 \\"-GUnix Makefiles\\" -DCMAKE_C_FLAGS=-Wno-stringop-truncation \\"-DCMAKE_CXX_FLAGS=\'-Wno-deprecated-copy -Wno-pessimizing-move\'\\" \\"-GUnix Makefiles\\" /home/stefan/ceph/src/rocksdb\\n
We installed the rebuilt packages (of which ceph-osd being the most important one) on our storage nodes and the results are striking
And to zoom in on one specific OSD
If you ever find yourself needing to compact your OSDs, you are in for a pleasant surpise. Compacting OSDs brings about three major benefits:
Some graphs from server metrics to show this. Our procedure for compacting OSDs involves a staggered shutdown of OSDs. Once all OSDs are shut down, compaction is performed in parallel (df|grep \\"/var/lib/ceph/osd\\" |awk \'{print $6}\' |cut -d \'-\' -f 2|sort -n|xargs -n 1 -P 10 -I OSD ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-OSD compact
).
Debug package disk IOPS (before)
Performance package disk IOPS (after)
This is a node with SATA SSDs. A node with NVMe disks is even faster. Compaction time is between 3-5 minutes.
Debug package disk IOPS (before)
Performance package disk IOPS (after)
Please note the increasing difference in OSD compaction time between SATA SSDs and NVMe drives. Previously, this gap was not as big due to the performance issues with RocksDB. However, with the introduction of faster and lower-latency disks, this difference has become more pronounced. THis suggests that the most significant perfornance improvements can be seen in clusters equipped with faster disks.
In this specific cluster, the performance packages have reduced the time needed to compact all OSDs by approximately one-third. Previously taking nine and a half hours, the process now completes in six hours.
While we hoped to see the performance double this was not the case. However we still saw some significant improvements in performance. To detect gray failure We have virtual machines running on our cloud to continously (at 5 minute intervals) monitor performance as experienced from within the virtual machine. One of these tests is a FIO 4K random write test (single threaded, queue depth of 1).
4K random write on CephFS (before)
4K random write on CephFS (after)
We have also performed other fio benchmarks and the main benefits are that standard deviation is lower, and tail latencies have decreased.
Okay, that might be way too much in practice, but to catch issues like this it\'s good to have a couple of different tests performed.
Over time, conducting these tests can lead to more standardized metrics, such as an \\"IO/Watt\\" or \\"throughput/Watt\\" ratio, which would allow for easier comparison across different tests. Perhaps we could develop Ceph-specific tests for use with tools like the Phoronix test-suite?
Although not an issue in this particular case, it is worth mentioning that there could be performance regressions related to specific CPU architectures. For example, having both an ARM64 and an x86-64 performance cluster could reveal discrepancies tied to specific build options. This approach helps catch such regressions early on.
Thank you for reading, and if you have any questions or would like to talk more about Ceph, please feel free to reach out.
I can\'t believe they figured it out first. That was the thought going through my head back in mid-December after several weeks of 12-hour days debugging why this cluster was slow. This was probably the most intense performance analysis I\'d done since Inktank. Half-forgotten superstitions from the 90s about appeasing SCSI gods flitted through my consciousness. The 90s? Man, I\'m getting old. We were about two-thirds of the way through the work that would let us start over at the beginning. Speaking of which, I\'ll start over at the beginning.
Back in 2023 (I almost said earlier this year until I remembered we\'re in 2024), Clyso was approached by a fairly hip and cutting edge company that wanted to transition their HDD backed Ceph cluster to a 10 petabyte NVMe deployment. They were immediately interesting. They had no specific need for RBD, RGW, or CephFS. They had put together their own hardware design, but to my delight approached us for feedback before actually purchasing anything. They had slightly unusual requirements. The cluster had to be spread across 17 racks with 4U of space available in each. Power, cooling, density, and vendor preference were all factors. The new nodes needed to be migrated into the existing cluster with no service interruption. The network however was already built, and it\'s a beast. It\'s one of the fastest Ethernet setups I\'ve ever seen. I knew from the beginning that I wanted to help them build this cluster. I also knew we\'d need to do a pre-production burn-in and that it would be the perfect opportunity to showcase what Ceph can do on a system like this. What follows is the story of how we built and tested that cluster and how far we were able to push it.
I would first like to thank our amazing customer who made all of this possible. You were a pleasure to work with! Thank you as well for allowing us here at Clyso to share this experience with the Ceph community. It is through this sharing of knowledge that we make the world a better place. Thank you to IBM/Red Hat and Samsung for providing the Ceph community with the hardware used for comparison testing. It was invaluable to be able to evaluate the numbers we were getting against previous tests from the lab. Thank you to all of the Ceph contributors who have worked tirelessly to make Ceph great! Finally, thank you especially to Anthony D\'Atri and Lee-Ann Pullar for their amazing copyediting skills!
When the customer first approached Clyso, they proposed a configuration utilizing 34 dual-socket 2U nodes spread across 17 racks. We provided a couple of alternative configurations from multiple vendors with a focus on smaller nodes. Ultimately they decided to go with a Dell architecture we designed, which quoted at roughly 13% cheaper than the original configuration despite having several key advantages. The new configuration has less memory per OSD (still comfortably 12GiB each), but faster memory throughput. It also provides more aggregate CPU resources, significantly more aggregate network throughput, a simpler single-socket configuration, and utilizes the newest generation of AMD processors and DDR5 RAM. By employing smaller nodes, we halved the impact of a node failure on cluster recovery.
The customer indicated they would like to limit the added per-rack power consumption to around 1000-1500 watts. With 4 of these nodes per rack, the aggregate TDP is estimated to be at least 1120 Watts plus base power usage, CPU overage peaks, and power supply inefficiency. IE it\'s likely we\'re pushing it a bit under load, but we don\'t expect significant deviation beyond the acceptable range. If worse came to worst, we estimated we could shave off roughly 100 watts per rack by lowering the processor cTDP.
Specs for the system are shown below:
Nodes | 68 x Dell PowerEdge R6615 |
---|---|
CPU | 1 x AMD EPYC 9454P 48C/96T |
Memory | 192GiB DDR5 |
Network | 2 x 100GbE Mellanox ConnectX-6 |
NVMe | 10 x Dell 15.36TB Enterprise NVMe Read Intensive AG |
OS Version | Ubuntu 20.04.6 (Focal) |
Ceph Version | Quincy v17.2.7 (Upstream Deb Packages) |
An additional benefit of utilizing 1U Dell servers is that they are essentially a newer refresh of the systems David Galloway and I designed for the upstream Ceph performance lab. These systems have been tested in a variety of articles over the past couple of years. It turns out that there was a major performance-impacting issue that came out during testing that did not affect the previous generation of hardware in the upstream lab but did affect this new hardware. We\'ll talk about that more later.
Without getting into too many details, I will reiterate that the customer\'s network configuration is very well-designed and quite fast. It easily has enough aggregate throughput across all 17 racks to let a cluster of this scale really stretch its legs.
To do the burn-in testing, ephemeral Ceph clusters were deployed and FIO tests were launched using CBT. CBT was configured to deploy Ceph with several modified settings. OSDs were assigned an 8GB osd_memory_target
. In production, a higher osd_memory_target
should be acceptable. The customer had no need to test block or S3 workloads, so one might assume that RADOS bench
would be the natural benchmark choice. In my experience, testing at a large scale with RADOS bench
is tricky. It\'s tough to determine how many instances are needed to saturate the cluster at given thread counts. I\'ve run into issues in the past where multiple concurrent pools were needed to scale performance. I also didn\'t have any preexisting RADOS bench
tests handy to compare against. Instead, we opted to do burn-in testing using the same librbd
backed FIO testing we\'ve used in the upstream lab. This allowed us to partition the cluster into smaller chunks and compare results with previously published results. FIO is also very well known and well-trusted.
A major benefit of the librbd
engine in FIO (versus utilizing FIO with kernel RBD) is that there are no issues with stale mount points potentially requiring system reboots. We did not have IPMI access to this cluster and we were under a tight deadline to complete tests. For that reason, we ultimately skipped kernel RBD tests. Based on previous testing, however, we expect the aggregate performance to be roughly similar given sufficient clients. We were able, however, to test both 3X replication and 6+2 erasure coding. We also tested msgr V2 in both unencrypted and secure mode using the following Ceph options:
ms_client_mode = secure\\nms_cluster_mode = secure\\nms_service_mode = secure\\nms_mon_client_mode = secure\\nms_mon_cluster_mode = secure\\nms_mon_service_mode = secure\\n
OSDs were allowed to use all cores on the nodes. FIO was configured to first pre-fill RBD volume(s) with large writes, followed by 4MB and 4KB IO tests for 300 seconds each (60 seconds during debugging runs). Certain background processes, such as scrub, deep scrub, PG autoscaling, and PG balancing were disabled.
Later in this article, you\'ll see some eye-popping PG counts being tested. This is intentional. We know from previous upstream lab testing that the PG count can have a dramatic effect on performance. Some of this is due to clumpiness in random distributions at low sample (PG) counts. This potentially can be mitigated in part through additional balancing. Less commonly discussed is PG lock contention inside the OSD. We\'ve observed that on very fast clusters, PG lock contention can play a significant role in overall performance. This unfortunately is less easily mitigated without increasing PG counts. How much does PG count actually matter?
With just 60 OSDs, Random read performance scales all the way up to 16384 PGs on an RBD pool using 3X replication. Writes top out much earlier, but still benefits from up to 2048 PGs.
Let me be clear: You shouldn\'t go out and blindly configure a production Ceph cluster to use PG counts as high as we are testing here. That\'s especially true given some of the other defaults in Ceph for things like PG log lengths and PG stat updates. I do, however, want to encourage the community to start thinking about whether the conventional wisdom of 100 PGs per OSD continues to make sense. I would like us to rethink what we need to do to achieve higher PG counts per OSD while keeping overhead and memory usage in check. I dream about a future where 1000 PGs per OSD isn\'t out of the ordinary, PG logs are auto-scaled on a per-pool basis, and PG autoscaling is a far more seldom-used operation.
We were first able to log into the new hardware the week after Thanksgiving in the US. The plan was to spend a week or two doing burn-in validation tests and then integrate the new hardware into the existing cluster. We hoped to finish the migration in time for the new year if everything went to plan. Sadly, we ran into trouble right at the start. The initial low-level performance tests looked good. Iperf network testing showed us hitting just under 200Gb/s per node. Random sampling of a couple of the nodes showed reasonable baseline performance from the NVMe drives. One issue we immediately observed was that the operating system on all 68 nodes was accidentally deployed on 2 of the OSD drives instead of the internal Dell BOSS m.2 boot drives. We had planned to compare results for a 30 OSD configuration (3 nodes, 10 OSDs per node) against the results from the upstream lab (5 nodes, 6 OSDs per node). Instead, we ended up testing 8 NVMe drives per node. The first Ceph results were far lower than what we had hoped to see, even given the reduced OSD count.
The only result that was even close to being tolerable was for random reads, and that still wasn\'t great. Clearly, something was going on. We stopped running 3-node tests and started looking at single-node, and even single OSD configurations.
That\'s when things started to get weird.
As we ran different combinations of 8-OSD and 1-OSD tests on individual nodes in the cluster, we saw wildly different behavior, but it took several days of testing to really understand the pattern of what we were seeing. Systems that initially performed well in single-OSD tests stopped performing well after multi-OSD tests, only to start working well again hours later. 8-OSD tests would occasionally show signs of performing well, but then perform terribly for all subsequent tests until the system was rebooted. We were eventually able to discern a pattern on fresh boot that we could roughly repeat across different nodes in the cluster:
Step | OSDS | 4MB Randread (MB/s) | 4MB Randwrite (MB/s) |
---|---|---|---|
Boot | |||
1 | 1 OSD | 5716 | 3998 |
2 | 8 OSDs | 3190 | 2494 |
3 | 1 OSD | 523 | 3794 |
4 | 8 OSDs | 2319 | 2931 |
5 | 1 OSD | 551 | 3796 |
20-30 minute pause | |||
6 | 1 OSD | 637 | 3724 |
20-30 minute pause | |||
7 | 1 OSD | 609 | 3860 |
20-30 minute pause | |||
8 | 1 OSD | 362 | 3972 |
20-30 minute pause | |||
9 | 1 OSD | 6581 | 3998 |
20-30 minute pause | |||
10 | 1 OSD | 6350 | 3999 |
20-30 minute pause | |||
11 | 1 OSD | 6536 | 4001 |
The initial single-OSD test looked fantastic for large reads and writes and showed nearly the same throughput we saw when running FIO tests directly against the drives. As soon as we ran the 8-OSD test, however, we observed a performance drop. Subsequent single-OSD tests continued to perform poorly until several hours later when they recovered. So long as a multi-OSD test was not introduced, performance remained high.
Confusingly, we were unable to invoke the same behavior when running FIO tests directly against the drives. Just as confusing, we saw that during the 8 OSD test, a single OSD would use significantly more CPU than the others:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND \\n511067 root 20 0 9360000 7.2g 33792 S 1180 3.8 15:24.32 ceph-osd \\n515664 root 20 0 9357488 7.2g 34560 S 523.6 3.8 13:43.86 ceph-osd \\n513323 root 20 0 9145820 6.4g 34560 S 460.0 3.4 13:01.12 ceph-osd \\n514147 root 20 0 9026592 6.6g 33792 S 378.7 3.5 9:56.59 ceph-osd \\n516488 root 20 0 9188244 6.8g 34560 S 378.4 3.6 10:29.23 ceph-osd \\n518236 root 20 0 9390772 6.9g 33792 S 361.0 3.7 9:45.85 ceph-osd \\n511779 root 20 0 8329696 6.1g 33024 S 331.1 3.3 10:07.18 ceph-osd \\n516974 root 20 0 8984584 6.7g 34560 S 301.6 3.6 9:26.60 ceph-osd \\n
A wallclock profile of the OSD under load showed significant time spent in io_submit, which is what we typically see when the kernel starts blocking because a drive\'s queue becomes full.
+ 31.00% BlueStore::readv(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, g...\\n + 31.00% BlueStore::_do_readv(BlueStore::Collection*, boost::intrusive_ptr<Blu...\\n + 24.00% KernelDevice::aio_submit(IOContext*)\\n |+ 24.00% aio_queue_t::submit_batch(std::_List_iterator<aio_t>, std::_List_it...\\n | + 24.00% io_submit\\n | + 24.00% syscall\\n
Why would running an 8 OSD test cause the kernel to start blocking in io_submit during future single OSD tests? It didn\'t make very much sense. Initially, we suspected throttling. We saw that with the default cooling profile in the bios, several of the core complexes on the CPU were reaching up to 96 degrees Celsius. We theorized that perhaps we were hitting thermal limits on either the CPU or the NVMe drives during the 8-OSD tests. Perhaps that left the system in a degraded state for a period of time before recovering. Unfortunately, that theory didn\'t pan out. AMD/Dell confirmed that we shouldn\'t be hitting throttling even at those temperatures, and we were able to disprove the theory by running the systems with the fans running at 100% and a lower cTDP for the processor. Those changes kept them consistently around 70 degrees Celsius under load without fixing the problem.
For over a week, we looked at everything from bios settings, NVMe multipath, low-level NVMe debugging, changing kernel/Ubuntu versions, and checking every single kernel, OS, and Ceph setting we could think of. None these things fully resolved the issue.
We even performed blktrace and iowatcher analysis during \\"good\\" and \\"bad\\" single OSD tests, and could directly observe the slow IO completion behavior:
Timestamp (good) | Offset+Length (good) | Timestamp (bad) | Offset+Length (bad) |
---|---|---|---|
10.00002043 | 1067699792 + 256 [0] | 10.0013855 | 1206277696 + 512 [0] |
10.00002109 | 1153233168 + 136 [0] | 10.00138801 | 1033429056 + 1896 [0] |
10.00016955 | 984818880 + 8 [0] | 10.00209283 | 1031056448 + 1536 [0] |
10.00018827 | 1164427968 + 1936 [0] | 10.00327372 | 1220466752 + 2048 [0] |
10.0003024 | 1084064456 + 1928 [0] | 10.00328869 | 1060912704 + 2048 [0] |
10.00044238 | 1067699280 + 512 [0] | 10.01285746 | 1003849920 + 2048 [0] |
10.00046659 | 1040160848 + 128 [0] | 10.0128617 | 1096765888 + 768 [0] |
10.00053302 | 1153233312 + 1712 [0] | 10.01286317 | 1060914752 + 720 [0] |
10.00056482 | 1153229312 + 2000 [0] | 10.01287147 | 1188736704 + 512 [0] |
10.00058707 | 1067694160 + 64 [0] | 10.01287216 | 1220468800 + 1152 [0] |
10.00080624 | 1067698000 + 336 [0] | 10.01287812 | 1188735936 + 128 [0] |
10.00111046 | 1145660112 + 2048 [0] | 10.01287894 | 1188735168 + 256 [0] |
10.00118455 | 1067698344 + 424 [0] | 10.0128807 | 1188737984 + 256 [0] |
10.00121413 | 984815728 + 208 [0] | 10.01288286 | 1217374144 + 1152 [0] |
At this point, we started getting the hardware vendors involved. Ultimately it turned out to be unnecessary. There was one minor, and two major fixes that got things back on track.
The first fix was an easy one, but only got us a modest 10-20% performance gain. Many years ago it was discovered (Either by Nick Fisk or Stephen Blinick if I recall) that Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. A quick check of the bios on these nodes showed that they weren\'t running in maximum performance mode which disables c-states. This was a nice win but not enough to get the results where we wanted them.
By the time I was digging into the blktrace results shown above, I was about 95% sure that we were either seeing an issue with the NVMe drives or something related to the PCIe root complex since these systems don\'t have PCIe switches in them. I was busy digging into technical manuals and trying to find ways to debug/profile the hardware. A very clever engineer working for the customer offered to help out. I set up a test environment for him so he could repeat some of the same testing on an alternate set of nodes and he hit a home run.
While I had focused primarily on wallclock profiles and was now digging into trying to debug the hardware, he wanted to understand if there was anything interesting happening kernel side (which in retrospect was the obvious next move!). He ran a perf profile during a bad run and made a very astute discovery:
77.37% tp_osd_tp [kernel.kallsyms] [k] native_queued_spin_lock_slowpath\\n |\\n ---native_queued_spin_lock_slowpath\\n | \\n --77.36%--_raw_spin_lock_irqsave\\n | \\n |--61.10%--alloc_iova\\n | alloc_iova_fast\\n | iommu_dma_alloc_iova.isra.0\\n | iommu_dma_map_sg\\n | __dma_map_sg_attrs\\n | dma_map_sg_attrs\\n | nvme_map_data\\n | nvme_queue_rq\\n | __blk_mq_try_issue_directly\\n | blk_mq_request_issue_directly\\n | blk_mq_try_issue_list_directly\\n | blk_mq_sched_insert_requests\\n | blk_mq_flush_plug_list\\n | blk_flush_plug_list\\n | | \\n | |--56.54%--blk_mq_submit_bio\\n
A huge amount of time is spent in the kernel contending on a spin lock while updating the IOMMU mappings. He disabled IOMMU in the kernel and immediately saw a huge increase in performance during the 8-node tests. We repeated those tests multiple times and repeatedly saw much better 4MB read/write performance. Score one for the customer. There was however still an issue with 4KB random writes.
After being beaten to the punch by the customer on the IOMMU issue, I was almost grateful that we had an additional problem to solve. 4K random write performance had improved with the first two fixes but was still significantly worse than the upstream lab (even given the reduced node/drive counts). I also noticed that compaction was far slower than expected in RocksDB. There previously have been two significant cases that presented similarly and appeared to be relevant:
Historically this customer used the upstream Ceph Ubuntu packages and we were still using them here (rather than self-compiling or using cephadm with containers). I verified that TCMalloc was compiled in. That ruled out the first issue. Next, I dug out the upstream build logs for the 17.2.7 Ubuntu packages. That\'s when I noticed that we were not, in fact, building RocksDB with the correct compile flags. It\'s not clear how long that\'s been going on, but we\'ve had general build performance issues going back as far as 2018.
It turns out that Canonical fixed this for their own builds as did Gentoo after seeing the note I wrote in do_cmake.sh over 6 years ago. It\'s quite unfortunate that our upstream Deb builds have suffered with this for as long as they have, however, it at least doesn\'t appear to affect anyone using cephadm on Debian/Ubuntu with the upstream containers. With the issue understood, we built custom 17.2.7 packages with a fix in place. Compaction time dropped by around 3X and 4K random write performance doubled (Though it\'s a bit tough to make out in the graph):
4KB random write performance was still lower than I wanted it to be, but at least now we were in roughly the right ballpark given that we had fewer OSDs, only 3/5 the number of nodes, and fewer (though faster) cores per OSD. At this point, we were nearing winter break. The customer wanted to redeploy the OS to the correct boot drives and update the deployment with all of the fixes and tunings we had discovered. The plan was to take the holiday break off and then spend the first week of the new year finishing the burn-in tests. Hopefully, we could start migrating the cluster the following week.
On the morning of January 2nd, I logged into Slack and was greeted by a scene I\'ll describe as moderately controlled chaos. A completely different cluster we are involved in was having a major outage. Without getting too into the details, it took 3 days to pull that cluster back from the brink and get it into a stable and relatively healthy state. It wasn\'t until Friday that I was able to get back to performance testing. I was able to secure an extra day for testing on Monday, but this meant I was under a huge time crunch to showcase that the cluster could perform well under load before we started the data migration process.
I worked all day on Friday to re-deploy CBT and recreate the tests we ran previously. This time I was able to use all 10 of the drives in each node. I also bumped up the number of clients to maintain an average of roughly 1 FIO client with an io_depth of 128 per OSD. The first 3 node test looked good. With 10 OSDs per node, We were achieving roughly proportional (IE higher) performance relative to the previous tests. I knew I wasn\'t going to have much time to do proper scaling tests, so I immediately bumped up from 3 nodes to 10 nodes. I also scaled the PG count at the same time and used CBT to deploy a new cluster. At 3 nodes I saw 63GiB/s for 4MB random reads. At 10 Nodes, I saw 213.5GiB/s. That\'s almost linear scaling at 98.4%. It was at this point that I knew that things were finally taking a turn for the better. Of the 68 nodes for this cluster, only 63 were up at that time. The rest were down for maintenance to fix various issues. I split the cluster roughly in half, with 32 nodes (320 OSDs) in one half, and 31 client nodes running 10 FIO processes each in the other half. I watched as CBT built the cluster over roughly a 7-8 minute period. The initial write prefill looked really good. My heart soared. We were reading data at 635 GiB/s. We broke 15 million 4k random read IOPS. While this may not seem impressive compared to the individual NVMe drives, these were the highest numbers I had ever seen for a ~300 OSD Ceph cluster.
I also plotted both average and tail latency for the scaling tests. Both looked consistent. This was likely due to scaling the PG count and the FIO client count at the same time as OSDs. These tests are very IO-heavy however. We have so much client traffic that we are likely well into the inflection point where performance doesn\'t increase while latency continues to grow as more IO is added.
I showed these results to my colleague Dan van der Ster who previously had built the Ceph infrastructure at CERN. He bet me a beer (Better be a good one Dan!) if I could hit 1 TiB/s. I told him that had been my plan since the beginning.
I had no additional client nodes to test the cluster with at full capacity, so the only real option was to co-locate FIO processes on the same nodes as the OSDs. On one hand, this provides a very slight network advantage. Clients will be able to communicate with local OSDs 1/63rd of the time. On the other hand, we know from previous testing that co-locating FIO clients on OSD nodes isn\'t free. There\'s often a performance hit, and it wasn\'t remotely clear to me how much of a hit a cluster of this scale would take.
I built a new CBT configuration targeting the 63 nodes I had available. Deploying the cluster with CBT took about 15 minutes to stand up all 630 OSDs and build the pool. I waited with bated breath and watched the results as they came in.
Around 950GiB/s. So very very close. It was late on Friday night at this point, so I wrapped up and turned in for the night. On Saturday morning I logged in and threw a couple of tuning options at the cluster: Lowering OSD shards and async messenger threads while also applying the Reef RocksDB tunings. As you can see, we actually hurt read performance a little while helping write performance. In fact, random write performance improved by nearly 20%. After further testing, it looked like the reef tunings were benign though only helping a little bit in the write tests. The bigger effect seemed to be coming from shard/thread changes. At this point, I had to take a break and wasn\'t able to get back to working on the cluster again until Sunday night. I tried to go to bed, but I knew that I was down to the last 24 hours before we needed to wrap this up. At around midnight I gave up on sleep and got back to work.
I mentioned earlier that we know that the PG count can affect performance. I decided to keep the \\"tuned\\" configuration from earlier but doubled the number of PGs. In the first set of tests, I had dropped the ratio of clients to OSDs down given that we were co-locating them on the OSD nodes. Now I tried scaling them up again. 4MB random read performance improved slightly as the number of clients grew, while small random read IOPS degraded. Once we hit 8 FIO processes per node (504 total), sequential write performance dropped through the floor.
To understand what happened, I reran the write test and watched “ceph -s” output:
services:\\n mon: 3 daemons, quorum a,b,c (age 42m)\\n mgr: a(active, since 42m)\\n osd: 630 osds: 630 up (since 24m), 630 in (since 25m)\\n flags noscrub,nodeep-scrub\\n \\n data:\\n pools: 2 pools, 131073 pgs\\n objects: 4.13M objects, 16 TiB\\n usage: 48 TiB used, 8.2 PiB / 8.2 PiB avail\\n pgs: 129422 active+clean\\n 1651 active+clean+laggy\\n \\n io:\\n client: 0 B/s rd, 1.7 GiB/s wr, 1 op/s rd, 446 op/s wr\\n
As soon as I threw 504 FIO processes doing 4MB writes at the cluster, some of the PGs started going active+clean+laggy. Performance tanked and the cluster didn\'t recover from that state until the workload was completed. What\'s worse, more PGs went laggy over time even though the throughput was only a small fraction of what the cluster was capable of. Since then, we\'ve found a couple of reports of laggy PGs on the mailing list along with a couple of suggestions that might fix them. It\'s not clear if those ideas would have helped here. We do know that IO will temporarily be paused when PGs go into a laggy state and that this happens because a replica hasn\'t acknowledged new leases from the primary in time. After discussing the issue with other Ceph developers, we think this could possibly be an issue with locking in the OSD or having lease messages competing with work in the same async msgr threads.
Despite being distracted by the laggy PG issue, I wanted to refocus on hitting 1.0TiB/s. Lack of sleep was finally catching up with me. At some point I had doubled the PG count again to 256K, just to see if it had any effect at all on the laggy PG issue. That put us solidly toward the upper end of the curve we showed earlier, though frankly, I don\'t think it actually mattered much. I decided to switch back to the default OSD shard counts and continue testing with 504 FIO client processes. I did however scale the number of async messenger threads. There were two big takeaways. The first is that dropping down to 1 async messenger allowed us to avoid PGs going laggy and achieve “OK” write throughput with 504 clients. It also dramatically hurt the performance of 4MB reads. The second: Ceph\'s defaults were actually ideal for 4MB reads. With 8 shards, 2 threads per shard, and 3 msgr threads, we finally broke 1TiB/s. Here\'s the view I had at around 4 AM Monday morning as the final set of tests for the night ran:
services:\\n mon: 3 daemons, quorum a,b,c (age 30m)\\n mgr: a(active, since 30m)\\n osd: 630 osds: 630 up (since 12m), 630 in (since 12m)\\n flags noscrub,nodeep-scrub\\n \\n data:\\n pools: 2 pools, 262145 pgs\\n objects: 4.13M objects, 16 TiB\\n usage: 48 TiB used, 8.2 PiB / 8.2 PiB avail\\n pgs: 262145 active+clean\\n \\n io:\\n client: 1.0 TiB/s rd, 6.1 KiB/s wr, 266.15k op/s rd, 6 op/s wr\\n
and the graphs from the FIO results:
After finally seeing the magical \\"1.0 TiB/s\\" screen I had been waiting weeks to see, I finally went to sleep. Nevertheless, I got up several hours later. There was still work to be done. All of the testing we had done so far was with 3X replication, but the customer would be migrating this hardware into an existing cluster deployed with 6+2 erasure coding. We needed to get some idea of what this cluster was capable of in the configuration they would be using.
I reconfigured the cluster again and ran through new tests. I picked PG/shard/client values from the earlier tests that appeared to work well. Performance was good, but I saw that the async messenger threads were working very hard. I decided to try increasing them beyond the defaults to see if they might help given the added network traffic.
We could achieve well over 500GiB/s for reads and nearly 400GiB/s for writes with 4-5 async msgr threads. But why are the read results so much slower with EC than with replication? With replication, the primary OSD for a PG only has to read local data and send it to the client. The network overhead is essentially 1X. With 6+2 erasure coding, the primary must read 5 of the 6 chunks from replicas before it can then send the constructed object to the client. The overall network overhead for the request is roughly (1 + 5/6)X*. That\'s why we see slightly better than half the performance of 3X replication for reads. We have the opposite situation for writes. With 3X replication, the client sends the object to the primary, which then further sends copies over the network to two secondaries. This results in an aggregate network overhead of 3X. In the EC case, we only need to send 7/8 chunks to the secondaries (almost, but not quite, the same as the read case). For large writes, performance is actually faster.
* Originally this article stated that 7/8 chunks had to be fetched for reads. The correct value is 5/6 chunks, unless fast reads are enabled. In that case it would be 7/6 chunks. Thanks to Joshua Baergen for catching this!
IOPS however, are another story. For very small reads and writes, Ceph will contact all participating OSDs in a PG for that object even when the data they store isn\'t relevant for the operation. For instance, if you are doing 4K reads and the data you are interested in is entirely stored in a single chunk on one of the OSDs, Ceph will still fetch data from all OSDs participating in the stripe. In the summer of 2023, Clyso resurrected a PR from Xiaofei Cui that implements partial stripe reads for erasure coding to avoid this extra work. The effect is dramatic:
It\'s not clear yet if we will be able to get this merged for Squid, though Radoslaw Zarzynski, core lead for the Ceph project, has offered to help try to get this over the finish line.
Finally, we wanted to provide the customer with a rough idea of how much msgr-level encryption would impact their cluster if they decided to use it. The adrenaline of the previous night had long faded and I was dead tired at this point. I managed to run through both 3X replication and 6+2 erasure coding tests with msgr v2 encryption enabled and compared it against our previous test results.
The biggest hit is to large reads. They drop from ~1 TiB/s to around 750 GiB/s. Everything else sees a more modest, though consistent hit. At this point, I had to stop. I really wanted to do PG scaling tests and even kernel RBD tests. It was time, though, to hand the systems back to the customer for re-imaging and then to one of my excellent colleagues at Clyso for integration.
So what\'s happened with this cluster since the end of the testing? All hardware was re-imaged and the new OSDs were deployed into the customer\'s existing HDD cluster. Dan\'s upmap-remapped script is being used to control the migration process and we\'ve migrated around 80% of the existing data to the NVMe backed OSDs. By next week, the cluster should be fully migrated to the new NVMe based nodes. We\'ve opted not to employ all of the tuning we\'ve done here, at least not at first. Initially, we\'ll make sure the cluster behaves well under the existing, mostly default, configuration. We now have a mountain of data we can use to tune the system further if the customer hits any performance issues.
Since there was a ton of data and charts here, I want to recap some of the highlights. Here\'s an outline of the best numbers we were able to achieve on this cluster:
30 OSDs (3x) | 100 OSDs (3x) | 320 OSDs (3x) | 630 OSDs (3x) | 630 OSDs (EC62) | |
---|---|---|---|---|---|
Co-Located Fio | No | No | No | Yes | Yes |
4MB Read | 63 GiB/s | 214 GiB/s | 635 GiB/s | 1025 GiB/s | 547 GiB/s |
4MB Write | 15 GiB/s | 46 GiB/s | 133 GiB/s | 270 GiB/s | 387 GiB/s |
4KB Rand Read | 1.9M IOPS | 5.8M IOPS | 16.6M IOPS | 25.5M IOPS | 3.4M IOPS |
4KB Rand Write | 248K IOPS | 745K IOPS | 2.4M IOPS | 4.9M IOPS | 936K IOPS |
What\'s next? We need to figure out how to fix the laggy PG issue during writes. We can\'t have Ceph falling apart when the write workload scales up. Beyond that, we learned through this exercise that Ceph is perfectly capable of saturating 2x 100GbE NICs. To push the throughput envelope further we will need 200GbE+ when using 10 NVMe drives per node or more. IOPS is more nuanced. We know that PG count can have a big effect. We also know that the general OSD threading model is playing a big role. We consistently hit a wall at around 400-600K random read IOPS per node and we\'ve seen it in multiple deployments. Part of this may be how the async msgr interfaces with the kernel and part of this may be how OSD threads wake up when new work is put into the shard queues. I\'ve modified the OSD code in the past to achieve better results under heavy load, but at the expense of low-load latency. Ultimately, I suspect improving IOPS will take a multi-pronged approach and a rewrite of some of the OSD threading code.
To my knowledge, these are the fastest single-cluster Ceph results ever published and the first time a Ceph cluster has achieved 1 TiB/s. I think Ceph is capable of quite a bit more. If you have a faster cluster out there, I encourage you to publish your results! Thank you for reading, and if you have any questions or would like to talk more about Ceph performance, please feel free to reach out.
This is the first backport release in the Reef series, and the first with Debian packages, for Debian Bookworm. We recommend all users update to this release.
RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in a multi-site deployment. Previously, the replicas of such objects were corrupted on decryption. A new command, radosgw-admin bucket resync encrypted multipart
, can be used to identify these original multipart uploads. The LastModified
timestamp of any identified object is incremented by 1ns to cause peer zones to replicate it again. For multi-site deployments that make any use of Server-Side Encryption, we recommended running this command against every bucket in every zone after all zones have upgraded.
CEPHFS: MDS now evicts clients which are not advancing their request tids (transaction IDs), which causes a large buildup of session metadata, resulting in the MDS going read-only due to the RADOS operation exceeding the size threshold. mds_session_metadata_threshold
config controls the maximum size that an (encoded) session metadata can grow.
RGW: New tools have been added to radosgw-admin
for identifying and correcting issues with versioned bucket indexes. Historical bugs with the versioned bucket index transaction workflow made it possible for the index to accumulate extraneous \\"book-keeping\\" olh (object logical head) entries and plain placeholder entries. In some specific scenarios where clients made concurrent requests referencing the same object key, it was likely that a lot of extra index entries would accumulate. When a significant number of these entries are present in a single bucket index shard, they can cause high bucket listing latencies and lifecycle processing failures. To check whether a versioned bucket has unnecessary olh entries, users can now run radosgw-admin bucket check olh
. If the --fix
flag is used, the extra entries will be safely removed. A distinct issue from the one described thus far, it is also possible that some versioned buckets are maintaining extra unlinked objects that are not listable from the S3/ Swift APIs. These extra objects are typically a result of PUT requests that exited abnormally, in the middle of a bucket index transaction - so the client would not have received a successful response. Bugs in prior releases made these unlinked objects easy to reproduce with any PUT request that was made on a bucket that was actively resharding. Besides the extra space that these hidden, unlinked objects consume, there can be another side effect in certain scenarios, caused by the nature of the failure mode that produced them, where a client of a bucket that was a victim of this bug may find the object associated with the key to be in an inconsistent state. To check whether a versioned bucket has unlinked entries, users can now run radosgw-admin bucket check unlinked
. If the --fix
flag is used, the unlinked objects will be safely removed. Finally, a third issue made it possible for versioned bucket index stats to be accounted inaccurately. The tooling for recalculating versioned bucket stats also had a bug, and was not previously capable of fixing these inaccuracies. This release resolves those issues and users can now expect that the existing radosgw-admin bucket check
command will produce correct results. We recommend that users with versioned buckets, especially those that existed on prior releases, use these new tools to check whether their buckets are affected and to clean them up accordingly.
mgr/snap-schedule: For clusters with multiple CephFS file systems, all the snap-schedule commands now expect the \'--fs\' argument.
RADOS: A POOL_APP_NOT_ENABLED health warning will now be reported if the application is not enabled for the pool irrespective of whether the pool is in use or not. Always tag a pool with an application using ceph osd pool application enable
command to avoid reporting of POOL_APP_NOT_ENABLED health warning for that pool. The user might temporarily mute this warning using ceph health mute POOL_APP_NOT_ENABLED
.
Dashboard: An overview page for RGW to show the overall status of RGW components.
Dashboard: Added management support for RGW Multi-site and CephFS Subvolumes and groups.
Dashboard: Fixed few bugs and issues around the new dashboard page including the broken layout, some metrics giving wrong values and introduced a popover to display details when there are HEALTH_WARN or HEALTH_ERR.
Dashboard: Fixed several issues in Ceph dashboard on Rook-backed clusters, and improved the user experience on the Rook environment.
.github: Clarify checklist details (pr#54130, Anthony D\'Atri)
[CVE-2023-43040] rgw: Fix bucket validation against POST policies (pr#53756, Joshua Baergen)
Adding rollback mechanism to handle bootstrap failures (pr#53864, Adam King, Redouane Kachach)
backport of rook orchestrator fixes and e2e automated testing (pr#54224, Redouane Kachach)
Bluestore: fix bluestore collection_list latency perf counter (pr#52950, Wangwenjuan)
build: Remove ceph-libboost* packages in install-deps (pr#52769, Adam Emerson)
ceph-volume/cephadm: support lv devices in inventory (pr#53286, Guillaume Abrioux)
ceph-volume: add --osd-id option to raw prepare (pr#52927, Guillaume Abrioux)
ceph-volume: fix a regression in raw list
(pr#54521, Guillaume Abrioux)
ceph-volume: fix mpath device support (pr#53539, Guillaume Abrioux)
ceph-volume: fix raw list for lvm devices (pr#52619, Guillaume Abrioux)
ceph-volume: fix raw list for lvm devices (pr#52980, Guillaume Abrioux)
ceph-volume: Revert \\"ceph-volume: fix raw list for lvm devices\\" (pr#54429, Matthew Booth, Guillaume Abrioux)
ceph: allow xlock state to be LOCK_PREXLOCK when putting it (pr#53661, Xiubo Li)
ceph_fs.h: add separate owner\\\\_{u,g}id fields (pr#53138, Alexander Mikhalitsyn)
ceph_volume: support encrypted volumes for lvm new-db/new-wal/migrate commands (pr#52875, Igor Fedotov)
cephadm batch backport Aug 23 (pr#53124, Adam King, Luis Domingues, John Mulligan, Redouane Kachach)
cephadm: add a --dry-run option to cephadm shell (pr#54220, John Mulligan)
cephadm: add tcmu-runner to logrotate config (pr#53122, Adam King)
cephadm: Adding support to configure public_network cfg section (pr#53110, Redouane Kachach)
cephadm: delete /tmp/cephadm-when removing the cluster (pr#53109, Redouane Kachach)
cephadm: Fix extra_container_args for iSCSI (pr#53010, Raimund Sacherer)
cephadm: fix haproxy version with certain containers (pr#53751, Adam King)
cephadm: make custom_configs work for tcmu-runner container (pr#53404, Adam King)
cephadm: run tcmu-runner through script to do restart on failure (pr#53866, Adam King)
cephadm: support for CA signed keys (pr#53121, Adam King)
cephfs-journal-tool: disambiguate usage of all keyword (in tool help) (pr#53646, Manish M Yathnalli)
cephfs-mirror: do not run concurrent C_RestartMirroring context (issue#62072, pr#53638, Venky Shankar)
cephfs: implement snapdiff (pr#53229, Igor Fedotov, Lucian Petrut, Denis Barahtanov)
cephfs_mirror: correctly set top level dir permissions (pr#53271, Milind Changire)
client: always refresh mds feature bits on session open (issue#63188, pr#54146, Venky Shankar)
client: correct quota check in Client::_rename() (pr#52578, Rishabh Dave)
client: do not send metrics until the MDS rank is ready (pr#52501, Xiubo Li)
client: force sending cap revoke ack always (pr#52507, Xiubo Li)
client: issue a cap release immediately if no cap exists (pr#52850, Xiubo Li)
client: move the Inode to new auth mds session when changing auth cap (pr#53666, Xiubo Li)
client: trigger to flush the buffer when making snapshot (pr#52497, Xiubo Li)
client: wait rename to finish (pr#52504, Xiubo Li)
cmake: ensure fmtlib is at least 8.1.1 (pr#52970, Abhishek Lekshmanan)
Consider setting \\"bulk\\" autoscale pool flag when automatically creating a data pool for CephFS (pr#52899, Leonid Usov)
crimson/admin/admin_socket: remove path file if it exists (pr#53964, Matan Breizman)
crimson/ertr: assert on invocability of func provided to safe_then() (pr#53958, Radosław Zarzyński)
crimson/mgr: Fix config show command (pr#53954, Aishwarya Mathuria)
crimson/net: consolidate messenger implementations and enable multi-shard UTs (pr#54095, Yingxin Cheng)
crimson/net: set TCP_NODELAY according to ms_tcp_nodelay (pr#54063, Xuehan Xu)
crimson/net: support connections in multiple shards (pr#53949, Yingxin Cheng)
crimson/os/object_data_handler: splitting right side doesn\'t mean splitting only one extent (pr#54061, Xuehan Xu)
crimson/os/seastore/backref_manager: scan backref entries by journal seq (pr#53939, Zhang Song)
crimson/os/seastore/btree: should add left\'s size when merging levels… (pr#53946, Xuehan Xu)
crimson/os/seastore/cache: don\'t add EXIST_CLEAN extents to lru (pr#54098, Xuehan Xu)
crimson/os/seastore/cached_extent: add prepare_commit interface (pr#53941, Xuehan Xu)
crimson/os/seastore/cbj: fix a potential overflow bug on segment_seq (pr#53968, Myoungwon Oh)
crimson/os/seastore/collection_manager: fill CollectionNode::decoded on clean reads (pr#53956, Xuehan Xu)
crimson/os/seastore/journal/cbj: generalize scan_valid_records() (pr#53961, Myoungwon Oh, Yingxin Cheng)
crimson/os/seastore/omap_manager: correct editor settings (pr#53947, Zhang Song)
crimson/os/seastore/omap_manager: fix the entry leak issue in BtreeOMapManager::omap_list() (pr#53962, Xuehan Xu)
crimson/os/seastore/onode_manager: populate value recorders of onodes to be erased (pr#53966, Xuehan Xu)
crimson/os/seastore/rbm: make rbm support multiple shards (pr#53952, Myoungwon Oh)
crimson/os/seastore/transaction_manager: data loss issues (pr#53955, Xuehan Xu)
crimson/os/seastore/transaction_manager: move intermediate_key by \\"remap_offset\\" when remapping the \\"back\\" half of the original pin (pr#54140, Xuehan Xu)
crimson/os/seastore/zbd: zbdsegmentmanager write path fixes (pr#54062, Aravind Ramesh)
crimson/os/seastore: add metrics about total invalidated transactions (pr#53953, Zhang Song)
crimson/os/seastore: create page aligned bufferptr in copy ctor of CachedExtent (pr#54097, Zhang Song)
crimson/os/seastore: enable SMR HDD (pr#53935, Aravind Ramesh)
crimson/os/seastore: fix ceph_assert in segment_manager.h (pr#53938, Aravind Ramesh)
crimson/os/seastore: fix daggling reference of oid in SeaStore::Shard::stat() (pr#53960, Xuehan Xu)
crimson/os/seastore: fix in check_node (pr#53945, Xinyu Huang)
crimson/os/seastore: OP_CLONE in seastore (pr#54092, xuxuehan, Xuehan Xu)
crimson/os/seastore: realize lazy read in split overwrite with overwrite refactor (pr#53951, Xinyu Huang)
crimson/os/seastore: retire_extent_addr clean up (pr#53959, Xinyu Huang)
crimson/osd/heartbeat: Improve maybe_share_osdmap behavior (pr#53940, Samuel Just)
crimson/osd/lsan_suppressions.cc: Add MallocExtension::Initialize() (pr#54057, Mark Nelson, Matan Breizman)
crimson/osd/lsan_suppressions: add MallocExtension::Register (pr#54139, Matan Breizman)
crimson/osd/object_context: consider clones found as long as they\'re in SnapSet::clones (pr#53965, Xuehan Xu)
crimson/osd/osd_operations: add pipeline to LogMissingRequest to sync it (pr#53957, Xuehan Xu)
crimson/osd/osd_operations: consistent naming to pipeline users (pr#54060, Matan Breizman)
crimson/osd/pg: check if backfill_state exists when judging objects\' (pr#53963, Xuehan Xu)
crimson/osd/watch: Add logs around Watch/Notify (pr#53950, Matan Breizman)
crimson/osd: add embedded suppression ruleset for LSan (pr#53937, Radoslaw Zarzynski)
crimson/osd: cleanup and drop OSD::ShardDispatcher (pr#54138, Yingxin Cheng)
Crimson/osd: Disable concurrent MOSDMap handling (pr#53944, Matan Breizman)
crimson/osd: don\'t ignore start_pg_operation returned future (pr#53948, Matan Breizman)
crimson/osd: fix ENOENT on accessing RadosGW user\'s index of buckets (pr#53942, Radoslaw Zarzynski)
crimson/osd: fix Notify life-time mismanagement in Watch::notify_ack (pr#53943, Radoslaw Zarzynski)
crimson/osd: fixes and cleanups around multi-core OSD (pr#54091, Yingxin Cheng)
Crimson/osd: support multicore osd (pr#54058, chunmei)
crimson/tools/perf_crimson_msgr: integrate multi-core msgr with various improvements (pr#54059, Yingxin Cheng)
crimson/tools/perf_crimson_msgr: randomize client nonce (pr#54093, Yingxin Cheng)
crimson/tools/perf_staged_fltree: fix compile error (pr#54096, Myoungwon Oh)
crimson/vstart: default seastore_device_size will be out of space f… (pr#53969, chunmei)
crimson: Enable tcmalloc when using seastar (pr#54105, Mark Nelson, Matan Breizman)
debian/control: add docker-ce as recommends for cephadm package (pr#52908, Adam King)
Debian: update to dh compat 12, fix more serious packaging errors, correct copyright syntax (pr#53654, Matthew Vernon)
doc/architecture.rst - edit a sentence (pr#53372, Zac Dover)
doc/architecture.rst - edit up to \\"Cluster Map\\" (pr#53366, Zac Dover)
doc/architecture: \\"Edit HA Auth\\" (pr#53619, Zac Dover)
doc/architecture: \\"Edit HA Auth\\" (one of several) (pr#53585, Zac Dover)
doc/architecture: \\"Edit HA Auth\\" (one of several) (pr#53491, Zac Dover)
doc/architecture: edit \\"Calculating PG IDs\\" (pr#53748, Zac Dover)
doc/architecture: edit \\"Cluster Map\\" (pr#53434, Zac Dover)
doc/architecture: edit \\"Data Scrubbing\\" (pr#53730, Zac Dover)
doc/architecture: Edit \\"HA Auth\\" (pr#53488, Zac Dover)
doc/architecture: edit \\"HA Authentication\\" (pr#53632, Zac Dover)
doc/architecture: edit \\"High Avail. Monitors\\" (pr#53451, Zac Dover)
doc/architecture: edit \\"OSD Membership and Status\\" (pr#53727, Zac Dover)
doc/architecture: edit \\"OSDs service clients directly\\" (pr#53686, Zac Dover)
doc/architecture: edit \\"Peering and Sets\\" (pr#53871, Zac Dover)
doc/architecture: edit \\"Replication\\" (pr#53738, Zac Dover)
doc/architecture: edit \\"SDEH\\" (pr#53659, Zac Dover)
doc/architecture: edit several sections (pr#53742, Zac Dover)
doc/architecture: repair RBD sentence (pr#53877, Zac Dover)
doc/ceph-volume: explain idempotence (pr#54233, Zac Dover)
doc/ceph-volume: improve front matter (pr#54235, Zac Dover)
doc/cephadm/services: remove excess rendered indentation in osd.rst (pr#54323, Ville Ojamo)
doc/cephadm: add ssh note to install.rst (pr#53199, Zac Dover)
doc/cephadm: edit \\"Adding Hosts\\" in install.rst (pr#53224, Zac Dover)
doc/cephadm: edit sentence in mgr.rst (pr#53164, Zac Dover)
doc/cephadm: edit troubleshooting.rst (1 of x) (pr#54283, Zac Dover)
doc/cephadm: edit troubleshooting.rst (2 of x) (pr#54320, Zac Dover)
doc/cephadm: fix typo in cephadm initial crush location section (pr#52887, John Mulligan)
doc/cephadm: fix typo in set ssh key command (pr#54388, Piotr Parczewski)
doc/cephadm: update cephadm reef version (pr#53162, Rongqi Sun)
doc/cephfs: edit mount-using-fuse.rst (pr#54353, Jaanus Torp)
doc/cephfs: write cephfs commands fully in docs (pr#53402, Rishabh Dave)
doc/config: edit \\"ceph-conf.rst\\" (pr#54463, Zac Dover)
doc/configuration: edit \\"bg\\" in mon-config-ref.rst (pr#53347, Zac Dover)
doc/dev/release-checklist: check telemetry validation (pr#52805, Yaarit Hatuka)
doc/dev: Fix typos in files cephfs-mirroring.rst and deduplication.rst (pr#53519, Daniel Parkes)
doc/dev: remove cache-pool (pr#54007, Zac Dover)
doc/glossary: add \\"primary affinity\\" to glossary (pr#53427, Zac Dover)
doc/glossary: add \\"Quorum\\" to glossary (pr#54509, Zac Dover)
doc/glossary: improve \\"BlueStore\\" entry (pr#54265, Zac Dover)
doc/man/8/ceph-monstore-tool: add documentation (pr#52872, Matan Breizman)
doc/man/8: improve radosgw-admin.rst (pr#53267, Anthony D\'Atri)
doc/man: edit ceph-monstore-tool.rst (pr#53476, Zac Dover)
doc/man: radosgw-admin.rst typo (pr#53315, Zac Dover)
doc/man: remove docs about support for unix domain sockets (pr#53312, Zac Dover)
doc/man: s/kvstore-tool/monstore-tool/ (pr#53536, Zac Dover)
doc/rados/configuration: Avoid repeating \\"support\\" in msgr2.rst (pr#52998, Ville Ojamo)
doc/rados: add bulk flag to pools.rst (pr#53317, Zac Dover)
doc/rados: edit \\"troubleshooting-mon\\" (pr#54502, Zac Dover)
doc/rados: edit memory-profiling.rst (pr#53932, Zac Dover)
doc/rados: edit operations/add-or-rm-mons (1 of x) (pr#52889, Zac Dover)
doc/rados: edit operations/add-or-rm-mons (2 of x) (pr#52825, Zac Dover)
doc/rados: edit ops/control.rst (1 of x) (pr#53811, zdover23, Zac Dover)
doc/rados: edit ops/control.rst (2 of x) (pr#53815, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (1 of x) (pr#54418, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (2 of x) (pr#54421, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (3 of x) (pr#54438, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (4 of x) (pr#54443, Zac Dover)
doc/rados: edit t-mon \\"common issues\\" (5 of x) (pr#54455, Zac Dover)
doc/rados: edit t-mon.rst text (pr#54349, Zac Dover)
doc/rados: edit t-shooting-mon.rst (pr#54427, Zac Dover)
doc/rados: edit troubleshooting-mon.rst (2 of x) (pr#52839, Zac Dover)
doc/rados: edit troubleshooting-mon.rst (3 of x) (pr#53879, Zac Dover)
doc/rados: edit troubleshooting-mon.rst (4 of x) (pr#53897, Zac Dover)
doc/rados: edit troubleshooting-osd (1 of x) (pr#53982, Zac Dover)
doc/rados: Edit troubleshooting-osd (2 of x) (pr#54000, Zac Dover)
doc/rados: Edit troubleshooting-osd (3 of x) (pr#54026, Zac Dover)
doc/rados: edit troubleshooting-pg (2 of x) (pr#54114, Zac Dover)
doc/rados: edit troubleshooting-pg.rst (pr#54228, Zac Dover)
doc/rados: edit troubleshooting-pg.rst (1 of x) (pr#54073, Zac Dover)
doc/rados: edit troubleshooting.rst (pr#53837, Zac Dover)
doc/rados: edit troubleshooting/community.rst (pr#53881, Zac Dover)
doc/rados: format \\"initial troubleshooting\\" (pr#54477, Zac Dover)
doc/rados: format Q&A list in t-mon.rst (pr#54345, Zac Dover)
doc/rados: format Q&A list in tshooting-mon.rst (pr#54366, Zac Dover)
doc/rados: improve \\"scrubbing\\" explanation (pr#54270, Zac Dover)
doc/rados: parallelize t-mon headings (pr#54461, Zac Dover)
doc/rados: remove cache-tiering-related keys (pr#54227, Zac Dover)
doc/rados: remove FileStore material (in Reef) (pr#54008, Zac Dover)
doc/rados: remove HitSet-related key information (pr#54217, Zac Dover)
doc/rados: update monitoring-osd-pg.rst (pr#52958, Zac Dover)
doc/radosgw: Improve dynamicresharding.rst (pr#54368, Anthony D\'Atri)
doc/radosgw: Improve language and formatting in config-ref.rst (pr#52835, Ville Ojamo)
doc/radosgw: multisite - edit \\"migrating a single-site\\" (pr#53261, Qi Tao)
doc/radosgw: update rate limit management (pr#52910, Zac Dover)
doc/README.md - edit \\"Building Ceph\\" (pr#53057, Zac Dover)
doc/README.md - improve \\"Running a test cluster\\" (pr#53258, Zac Dover)
doc/rgw: correct statement about default zone features (pr#52833, Casey Bodley)
doc/rgw: pubsub capabilities reference was removed from docs (pr#54137, Yuval Lifshitz)
doc/rgw: several response headers are supported (pr#52803, Casey Bodley)
doc/start: correct ABC test chart (pr#53256, Dmitry Kvashnin)
doc/start: edit os-recommendations.rst (pr#53179, Zac Dover)
doc/start: fix typo in hardware-recommendations.rst (pr#54480, Anthony D\'Atri)
doc/start: Modernize and clarify hardware-recommendations.rst (pr#54071, Anthony D\'Atri)
doc/start: refactor ABC test chart (pr#53094, Zac Dover)
doc/start: update \\"platforms\\" table (pr#53075, Zac Dover)
doc/start: update linking conventions (pr#52912, Zac Dover)
doc/start: update linking conventions (pr#52841, Zac Dover)
doc/troubleshooting: edit cpu-profiling.rst (pr#53059, Zac Dover)
doc: Add a note on possible deadlock on volume deletion (pr#52946, Kotresh HR)
doc: add note for removing (automatic) partitioning policy (pr#53569, Venky Shankar)
doc: Add Reef 18.2.0 release notes (pr#52905, Zac Dover)
doc: Add warning on manual CRUSH rule removal (pr#53420, Alvin Owyong)
doc: clarify upmap balancer documentation (pr#53004, Laura Flores)
doc: correct option name (pr#53128, Patrick Donnelly)
doc: do not recommend pulling cephadm from git (pr#52997, John Mulligan)
doc: Documentation about main Ceph metrics (pr#54111, Juan Miguel Olmo Martínez)
doc: edit README.md - contributing code (pr#53049, Zac Dover)
doc: expand and consolidate mds placement (pr#53146, Patrick Donnelly)
doc: Fix doc for mds cap acquisition throttle (pr#53024, Kotresh HR)
doc: improve submodule update command - README.md (pr#53000, Zac Dover)
doc: make instructions to get an updated cephadm common (pr#53260, John Mulligan)
doc: remove egg fragment from dev/developer_guide/running-tests-locally (pr#53853, Dhairya Parmar)
doc: Update dynamicresharding.rst (pr#54329, Aliaksei Makarau)
doc: Update mClock QOS documentation to discard osd_mclock_cost_per\\\\_* (pr#54079, tanchangzhi)
doc: update rados.cc (pr#52967, Zac Dover)
doc: update test cluster commands in README.md (pr#53349, Zac Dover)
exporter: add ceph_daemon labels to labeled counters as well (pr#53695, avanthakkar)
exposed the open api and telemetry links in details card (pr#53142, cloudbehl, dpandit)
libcephsqlite: fill 0s in unread portion of buffer (pr#53101, Patrick Donnelly)
librbd: kick ExclusiveLock state machine on client being blocklisted when waiting for lock (pr#53293, Ramana Raja)
librbd: kick ExclusiveLock state machine stalled waiting for lock from reacquire_lock() (pr#53919, Ramana Raja)
librbd: make CreatePrimaryRequest remove any unlinked mirror snapshots (pr#53276, Ilya Dryomov)
MClientRequest: properly handle ceph_mds_request_head_legacy for ext_num_retry, ext_num_fwd, owner_uid, owner_gid (pr#54407, Alexander Mikhalitsyn)
MDS imported_inodes metric is not updated (pr#51698, Yongseok Oh)
mds/FSMap: allow upgrades if no up mds (pr#53851, Patrick Donnelly)
mds/Server: mark a cap acquisition throttle event in the request (pr#53168, Leonid Usov)
mds: acquire inode snaplock in open (pr#53183, Patrick Donnelly)
mds: add event for batching getattr/lookup (pr#53558, Patrick Donnelly)
mds: adjust pre_segments_size for MDLog when trimming segments for st… (issue#59833, pr#54035, Venky Shankar)
mds: blocklist clients with \\"bloated\\" session metadata (issue#62873, issue#61947, pr#53329, Venky Shankar)
mds: do not send split_realms for CEPH_SNAP_OP_UPDATE msg (pr#52847, Xiubo Li)
mds: drop locks and retry when lock set changes (pr#53241, Patrick Donnelly)
mds: dump locks when printing mutation ops (pr#52975, Patrick Donnelly)
mds: fix deadlock between unlinking and linkmerge (pr#53497, Xiubo Li)
mds: fix stray evaluation using scrub and introduce new option (pr#50813, Dhairya Parmar)
mds: Fix the linkmerge assert check (pr#52724, Kotresh HR)
mds: log message when exiting due to asok command (pr#53548, Patrick Donnelly)
mds: MDLog::_recovery_thread: handle the errors gracefully (pr#52512, Jos Collin)
mds: session ls command appears twice in command listing (pr#52515, Neeraj Pratap Singh)
mds: skip forwarding request if the session were removed (pr#52846, Xiubo Li)
mds: update mdlog perf counters during replay (pr#52681, Patrick Donnelly)
mds: use variable g_ceph_context directly in MDSAuthCaps (pr#52819, Rishabh Dave)
mgr/cephadm: Add \\"networks\\" parameter to orch apply rgw (pr#53120, Teoman ONAY)
mgr/cephadm: add ability to zap OSDs\' devices while draining host (pr#53869, Adam King)
mgr/cephadm: add is_host\\\\_functions to HostCache (pr#53118, Adam King)
mgr/cephadm: Adding sort-by support for ceph orch ps (pr#53867, Redouane Kachach)
mgr/cephadm: allow draining host without removing conf/keyring files (pr#53123, Adam King)
mgr/cephadm: also don\'t write client files/tuned profiles to maintenance hosts (pr#53111, Adam King)
mgr/cephadm: ceph orch add fails when ipv6 address is surrounded by square brackets (pr#53870, Teoman ONAY)
mgr/cephadm: don\'t use image tag in orch upgrade ls (pr#53865, Adam King)
mgr/cephadm: fix default image base in reef (pr#53922, Adam King)
mgr/cephadm: fix REFRESHED column of orch ps being unpopulated (pr#53741, Adam King)
mgr/cephadm: fix upgrades with nvmeof (pr#53924, Adam King)
mgr/cephadm: removing double quotes from the generated nvmeof config (pr#53868, Redouane Kachach)
mgr/cephadm: show meaningful messages when failing to execute cmds (pr#53106, Redouane Kachach)
mgr/cephadm: storing prometheus/alertmanager credentials in monstore (pr#53119, Redouane Kachach)
mgr/cephadm: validate host label before removing (pr#53112, Redouane Kachach)
mgr/dashboard: add e2e tests for cephfs management (pr#53190, Nizamudeen A)
mgr/dashboard: Add more decimals in latency graph (pr#52727, Pedro Gonzalez Gomez)
mgr/dashboard: add port and zone endpoints to import realm token form in rgw multisite (pr#54118, Aashish Sharma)
mgr/dashboard: add validator for size field in the forms (pr#53378, Nizamudeen A)
mgr/dashboard: align charts of landing page (pr#53543, Pedro Gonzalez Gomez)
mgr/dashboard: allow PUT in CORS (pr#52705, Nizamudeen A)
mgr/dashboard: allow tls 1.2 with a config option (pr#53780, Nizamudeen A)
mgr/dashboard: Block Ui fails in angular with target es2022 (pr#54260, Aashish Sharma)
mgr/dashboard: cephfs volume and subvolume management (pr#53017, Pedro Gonzalez Gomez, Nizamudeen A, Pere Diaz Bou)
mgr/dashboard: cephfs volume rm and rename (pr#53026, avanthakkar)
mgr/dashboard: cleanup rbd-mirror process in dashboard e2e (pr#53220, Nizamudeen A)
mgr/dashboard: cluster upgrade management (batch backport) (pr#53016, avanthakkar, Nizamudeen A)
mgr/dashboard: Dashboard RGW multisite configuration (pr#52922, Aashish Sharma, Pedro Gonzalez Gomez, Avan Thakkar, avanthakkar)
mgr/dashboard: disable hosts field while editing the filesystem (pr#54069, Nizamudeen A)
mgr/dashboard: disable promote on mirroring not enabled (pr#52536, Pedro Gonzalez Gomez)
mgr/dashboard: disable protect if layering is not enabled on the image (pr#53173, avanthakkar)
mgr/dashboard: display the groups in cephfs subvolume tab (pr#53394, Pedro Gonzalez Gomez)
mgr/dashboard: empty grafana panels for performance of daemons (pr#52774, Avan Thakkar, avanthakkar)
mgr/dashboard: enable protect option if layering enabled (pr#53795, avanthakkar)
mgr/dashboard: fix cephfs create form validator (pr#53219, Nizamudeen A)
mgr/dashboard: fix cephfs form validator (pr#53778, Nizamudeen A)
mgr/dashboard: fix cephfs forms validations (pr#53831, Nizamudeen A)
mgr/dashboard: fix image columns naming (pr#53254, Pedro Gonzalez Gomez)
mgr/dashboard: fix progress bar color visibility (pr#53209, Nizamudeen A)
mgr/dashboard: fix prometheus queries subscriptions (pr#53669, Pedro Gonzalez Gomez)
mgr/dashboard: fix rgw multi-site import form helper (pr#54395, Aashish Sharma)
mgr/dashboard: fix rgw multisite error when no rgw entity is present (pr#54261, Aashish Sharma)
mgr/dashboard: fix rgw page issues when hostname not resolvable (pr#53214, Nizamudeen A)
mgr/dashboard: fix rgw port manipulation error in dashboard (pr#53392, Nizamudeen A)
mgr/dashboard: fix the landing page layout issues (issue#62961, pr#53835, Nizamudeen A)
mgr/dashboard: Fix user/bucket count in rgw overview dashboard (pr#53818, Aashish Sharma)
mgr/dashboard: fixed edit user quota form error (pr#54223, Ivo Almeida)
mgr/dashboard: images -> edit -> disable checkboxes for layering and deef-flatten (pr#53388, avanthakkar)
mgr/dashboard: minor usability improvements (pr#53143, cloudbehl)
mgr/dashboard: n/a entries behind primary snapshot mode (pr#53223, Pere Diaz Bou)
mgr/dashboard: Object gateway inventory card incorrect Buckets and user count (pr#53382, Aashish Sharma)
mgr/dashboard: Object gateway sync status cards keeps loading when multisite is not configured (pr#53381, Aashish Sharma)
mgr/dashboard: paginate hosts (pr#52918, Pere Diaz Bou)
mgr/dashboard: rbd image hide usage bar when disk usage is not provided (pr#53810, Pedro Gonzalez Gomez)
mgr/dashboard: remove empty popover when there are no health warns (pr#53652, Nizamudeen A)
mgr/dashboard: remove green tick on old password field (pr#53386, Nizamudeen A)
mgr/dashboard: remove unnecessary failing hosts e2e (pr#53458, Pedro Gonzalez Gomez)
mgr/dashboard: remove used and total used columns in favor of usage bar (pr#53304, Pedro Gonzalez Gomez)
mgr/dashboard: replace sync progress bar with last synced timestamp in rgw multisite sync status card (pr#53379, Aashish Sharma)
mgr/dashboard: RGW Details card cleanup (pr#53020, Nizamudeen A, cloudbehl)
mgr/dashboard: Rgw Multi-site naming improvements (pr#53806, Aashish Sharma)
mgr/dashboard: rgw multisite topology view shows blank table for multisite entities (pr#53380, Aashish Sharma)
mgr/dashboard: set CORS header for unauthorized access (pr#53201, Nizamudeen A)
mgr/dashboard: show a message to restart the rgw daemons after moving from single-site to multi-site (pr#53805, Aashish Sharma)
mgr/dashboard: subvolume rm with snapshots (pr#53233, Pedro Gonzalez Gomez)
mgr/dashboard: update rgw multisite import form helper info (pr#54253, Aashish Sharma)
mgr/dashboard: upgrade angular v14 and v15 (pr#52662, Nizamudeen A)
mgr/rbd_support: fix recursive locking on CreateSnapshotRequests lock (pr#54289, Ramana Raja)
mgr/snap_schedule: allow retention spec \'n\' to be user defined (pr#52748, Milind Changire, Jakob Haufe)
mgr/snap_schedule: make fs argument mandatory if more than one filesystem exists (pr#54094, Milind Changire)
mgr/volumes: Fix pending_subvolume_deletions in volume info (pr#53572, Kotresh HR)
mgr: register OSDs in ms_handle_accept (pr#53187, Patrick Donnelly)
mon, qa: issue pool application warning even if pool is empty (pr#53041, Prashant D)
mon/ConfigMonitor: update crush_location from osd entity (pr#52466, Didier Gazen)
mon/MDSMonitor: plug paxos when maybe manipulating osdmap (pr#52246, Patrick Donnelly)
mon/MonClient: resurrect original client_mount_timeout handling (pr#52535, Ilya Dryomov)
mon/OSDMonitor: do not propose on error in prepare_update (pr#53186, Patrick Donnelly)
mon: fix iterator mishandling in PGMap::apply_incremental (pr#52554, Oliver Schmidt)
msgr: AsyncMessenger add faulted connections metrics (pr#53033, Pere Diaz Bou)
os/bluestore: don\'t require bluestore_db_block_size when attaching new (pr#52942, Igor Fedotov)
os/bluestore: get rid off resulting lba alignment in allocators (pr#54772, Igor Fedotov)
osd/OpRequest: Add detail description for delayed op in osd log file (pr#53688, Yite Gu)
osd/OSDMap: Check for uneven weights & != 2 buckets post stretch mode (pr#52457, Kamoltat)
osd/scheduler/mClockScheduler: Use same profile and client ids for all clients to ensure allocated QoS limit consumption (pr#53093, Sridhar Seshasayee)
osd: fix logic in check_pg_upmaps (pr#54276, Laura Flores)
osd: fix read balancer logic to avoid redundant primary assignment (pr#53820, Laura Flores)
osd: fix use-after-move in build_incremental_map_msg() (pr#54267, Ronen Friedman)
osd: fix: slow scheduling when item_cost is large (pr#53861, Jrchyang Yu)
Overview graph improvements (pr#53090, cloudbehl)
pybind/mgr/devicehealth: do not crash if db not ready (pr#52213, Patrick Donnelly)
pybind/mgr/pg_autoscaler: Cut back osdmap.get_pools calls (pr#52767, Kamoltat)
pybind/mgr/pg_autoscaler: fix warn when not too few pgs (pr#53674, Kamoltat)
pybind/mgr/pg_autoscaler: noautoscale flag retains individual pool configs (pr#53658, Kamoltat)
pybind/mgr/pg_autoscaler: Reorderd if statement for the func: _maybe_adjust (pr#53429, Kamoltat)
pybind/mgr/pg_autoscaler: Use bytes_used for actual_raw_used (pr#53534, Kamoltat)
pybind/mgr/volumes: log mutex locks to help debug deadlocks (pr#53918, Kotresh HR)
pybind/mgr: reopen database handle on blocklist (pr#52460, Patrick Donnelly)
pybind/rbd: don\'t produce info on errors in aio_mirror_image_get_info() (pr#54055, Ilya Dryomov)
python-common/drive_group: handle fields outside of \'spec\' even when \'spec\' is provided (pr#53115, Adam King)
python-common/drive_selection: lower log level of limit policy message (pr#53114, Adam King)
python-common: drive_selection: fix KeyError when osdspec_affinity is not set (pr#53159, Guillaume Abrioux)
qa/cephfs: fix build failure for mdtest project (pr#53827, Rishabh Dave)
qa/cephfs: fix ior project build failure (pr#53825, Rishabh Dave)
qa/cephfs: switch to python3 for centos stream 9 (pr#53624, Xiubo Li)
qa/rgw: add new POOL_APP_NOT_ENABLED failures to log-ignorelist (pr#53896, Casey Bodley)
qa/smoke,orch,perf-basic: add POOL_APP_NOT_ENABLED to ignorelist (pr#54376, Prashant D)
qa/standalone/osd/divergent-prior.sh: Divergent test 3 with pg_autoscale_mode on pick divergent osd (pr#52721, Nitzan Mordechai)
qa/suites/crimson-rados: add centos9 to supported distros (pr#54020, Matan Breizman)
qa/suites/crimson-rados: bring backfill testing (pr#54021, Radoslaw Zarzynski, Matan Breizman)
qa/suites/crimson-rados: Use centos8 for testing (pr#54019, Matan Breizman)
qa/suites/krbd: stress test for recovering from watch errors (pr#53786, Ilya Dryomov)
qa/suites/rbd: add test to check rbd_support module recovery (pr#54291, Ramana Raja)
qa/suites/rbd: drop cache tiering workload tests (pr#53996, Ilya Dryomov)
qa/suites/upgrade: enable default RBD image features (pr#53352, Ilya Dryomov)
qa/suites/upgrade: fix env indentation in stress-split upgrade tests (pr#53921, Laura Flores)
qa/suites/{rbd,krbd}: disable POOL_APP_NOT_ENABLED health check (pr#53599, Ilya Dryomov)
qa/tests: added - (POOL_APP_NOT_ENABLED) to the ignore list (pr#54436, Yuri Weinstein)
qa: add centos_latest (9.stream) and ubuntu_20.04 yamls to supported-all-distro (pr#54677, Venky Shankar)
qa: add POOL_APP_NOT_ENABLED to ignorelist for cephfs tests (issue#62482, issue#62508, pr#54380, Venky Shankar, Patrick Donnelly)
qa: assign file system affinity for replaced MDS (issue#61764, pr#54037, Venky Shankar)
qa: descrease pgbench scale factor to 32 for postgresql database test (pr#53627, Xiubo Li)
qa: fix cephfs-mirror unwinding and \'fs volume create/rm\' order (pr#52656, Jos Collin)
qa: fix keystone in rgw/crypt/barbican.yaml (pr#53412, Ali Maredia)
qa: ignore expected cluster warning from damage tests (pr#53484, Patrick Donnelly)
qa: lengthen shutdown timeout for thrashed MDS (pr#53553, Patrick Donnelly)
qa: move nfs (mgr/nfs) related tests to fs suite (pr#53906, Dhairya Parmar, Venky Shankar)
qa: wait for file to have correct size (pr#52742, Patrick Donnelly)
qa: wait for MDSMonitor tick to replace daemons (pr#52235, Patrick Donnelly)
RadosGW API: incorrect bucket quota in response to HEAD /{bucket}/?usage (pr#53437, shreyanshjain7174)
rbd-mirror: fix image replayer shut down description on force promote (pr#52880, Prasanna Kumar Kalever)
rbd-mirror: fix race preventing local image deletion (pr#52627, N Balachandran)
rbd-nbd: fix stuck with disable request (pr#54254, Prasanna Kumar Kalever)
read balancer documentation (pr#52777, Laura Flores)
Rgw overview dashboard backport (pr#53065, Aashish Sharma)
rgw/amqp: remove possible race conditions with the amqp connections (pr#53516, Yuval Lifshitz)
rgw/amqp: skip idleness tests since it needs to sleep longer than 30s (pr#53506, Yuval Lifshitz)
rgw/crypt: apply rgw_crypt_default_encryption_key by default (pr#52796, Casey Bodley)
rgw/crypt: don\'t deref null manifest_bl (pr#53590, Casey Bodley)
rgw/kafka: failed to reconnect to broker after idle timeout (pr#53513, Yuval Lifshitz)
rgw/kafka: make sure that destroy is called after connection is removed (pr#53515, Yuval Lifshitz)
rgw/keystone: EC2Engine uses reject() for ERR_SIGNATURE_NO_MATCH (pr#53762, Casey Bodley)
rgw/multisite[archive zone]: fix storing of bucket instance info in the new bucket entrypoint (pr#53466, Shilpa Jagannath)
rgw/notification: pass in bytes_transferred to populate object_size in sync notification (pr#53377, Juan Zhu)
rgw/notification: remove non x-amz-meta-* attributes from bucket notifications (pr#53375, Juan Zhu)
rgw/notifications: allow cross tenant notification management (pr#53510, Yuval Lifshitz)
rgw/s3: ListObjectsV2 returns correct object owners (pr#54161, Casey Bodley)
rgw/s3select: fix per QE defect (pr#54163, galsalomon66)
rgw/s3select: s3select fixes related to Trino/TPCDS benchmark and QE tests (pr#53034, galsalomon66)
rgw/sal: get_placement_target_names() returns void (pr#53584, Casey Bodley)
rgw/sync-policy: Correct \\"sync status\\" & \\"sync group\\" commands (pr#53395, Soumya Koduri)
rgw/upgrade: point upgrade suites to ragweed ceph-reef branch (pr#53797, Shilpa Jagannath)
RGW: add admin interfaces to get and delete notifications by bucket (pr#53509, Ali Masarwa)
rgw: add radosgw-admin bucket check olh/unlinked commands (pr#53823, Cory Snyder)
rgw: add versioning info to radosgw-admin bucket stats output (pr#54191, Cory Snyder)
RGW: bucket notification - hide auto generated topics when listing topics (pr#53507, Ali Masarwa)
rgw: don\'t dereference nullopt in DeleteMultiObj (pr#54124, Casey Bodley)
rgw: fetch_remote_obj() preserves original part lengths for BlockDecrypt (pr#52816, Casey Bodley)
rgw: fetch_remote_obj() uses uncompressed size for encrypted objects (pr#54371, Casey Bodley)
rgw: fix 2 null versionID after convert_plain_entry_to_versioned (pr#53398, rui ma, zhuo li)
rgw: fix multipart upload object leaks due to re-upload (pr#52615, J. Eric Ivancich)
rgw: fix rgw rate limiting RGWRateLimitInfo class decode_json max_rea… (pr#53765, xiangrui meng)
rgw: fix SignatureDoesNotMatch when extra headers start with \'x-amz\' (pr#53770, rui ma)
rgw: fix unwatch crash at radosgw startup (pr#53760, lichaochao)
rgw: handle http options CORS with v4 auth (pr#53413, Tobias Urdin)
rgw: improve buffer list utilization in the chunkupload scenario (pr#53773, liubingrun)
rgw: pick http_date in case of http_x_amz_date absence (pr#53440, Seena Fallah, Mohamed Awnallah)
rgw: retry metadata cache notifications with INVALIDATE_OBJ (pr#52798, Casey Bodley)
rgw: s3 object lock avoids overflow in retention date (pr#52604, Casey Bodley)
rgw: s3website doesn\'t prefetch for web_dir() check (pr#53767, Casey Bodley)
RGW: Solving the issue of not populating etag in Multipart upload result (pr#51447, Ali Masarwa)
RGW:notifications: persistent topics are not deleted via radosgw-admin (pr#53514, Ali Masarwa)
src/mon/Monitor: Fix set_elector_disallowed_leaders (pr#54003, Kamoltat)
test/crimson/seastore/rbm: add sub-tests regarding RBM to the existing tests (pr#53967, Myoungwon Oh)
test/TestOSDMap: don\'t use the deprecated std::random_shuffle method (pr#52737, Leonid Usov)
valgrind: UninitCondition under __run_exit_handlers suppression (pr#53681, Mark Kogan)
xfstests_dev: install extra packages from powertools repo for xfsprogs (pr#52843, Xiubo Li)
Hello Ceph community! Over the past year or so we\'ve been hearing from more and more people who are interested in using encryption with Ceph but don\'t know what kind of performance impact to expect. Today we\'ll look at both Ceph\'s on-disk and over-the-wire encryption performance under a couple of different workloads. For our readers who may not be familiar what what these terms mean, let\'s review:
On-Disk encryption: This is also sometimes called encryption-at-rest. Data is encrypted when it is written to persistent storage. In Ceph, this is done using LUKS and dm-crypt to fully encrypt the underlying block device(s) that BlueStore uses to store data. This fully encrypts all data stored in Ceph regardless of wheter it\'s block, object, or file data.
Over-the-wire encryption: Data is encrypted when it is sent over the network. In Ceph, this is done by optionally enabling the \\"secure\\" ms mode for messenger version 2 clients. As of Ceph Reef v18.2.0, ms secure mode utilizes 128-bit AES encryption.
Encryption can also be performed at higher levels. For instance, RGW can encrypt objects itself before sending them to the OSDs. For the purposes of this article however, we\'ll be focusing on RBD block performance and will utilize the two options listed above.
First, thank you to Clyso for funding this work to benefit the Ceph community. Thank you as well to IBM/Red Hat and Samsung for providing the upstream Ceph community with the hardware used for this testing. Thank you as well to all of the Ceph developers who have worked tirelessly to make Ceph great! Finally, a special thank you to Lee-Ann Pullar for reviewing this article!
Nodes | 10 x Dell PowerEdge R6515 |
---|---|
CPU | 1 x AMD EPYC 7742 64C/128T |
Memory | 128GiB DDR4 |
Network | 1 x 100GbE Mellanox ConnectX-6 |
NVMe | 6 x 4TB Samsung PM983 |
OS Version | CentOS Stream release 8 |
Ceph Version | Reef v18.2.0 (built from source) |
Five of the nodes were configured to host OSDs and 5 of the nodes were configured as client nodes. All nodes are located on the same Juniper QFX5200 switch and connected with a single 100GbE QSFP28 link. Ceph was deployed and FIO tests were launched using CBT. An important OS-level optimization on Intel systems is setting the TuneD profile to either \\"latency-performance\\" or \\"network-latency\\". This primarily helps by avoiding latency spikes associated with CPU C/P state transitions. AMD Rome-based systems do not appear to be as sensitive in this regard, and I have not confirmed that TuneD is actually restricting C/P state transitions on AMD processors. The TuneD profile was nevertheless set to \\"network-latency\\" for these tests.
CBT was configured to deploy Ceph with several modified settings. OSDs were assigned a 16GB osd_memory_target to guarantee a high onode hit rate and eliminate confounding performance impacts. RBD cache was disabled as it can hurt rather than help performance with OSDs backed by fast NVMe drives. Secure mode for msgr V2 was enabled for associated tests using the following configuration options:
ms_client_mode = secure\\nms_cluster_mode = secure\\nms_service_mode = secure\\nms_mon_client_mode = secure\\nms_mon_cluster_mode = secure\\nms_mon_service_mode = secure\\n
Three sets of tests were performed as follows with the cluster rebuilt between each set of tests:
Test Set | Client Processes | io_depth |
---|---|---|
Multi-client | 10 fio processes per node, 5 client nodes | 128 per process |
Single-client | 1 fio process on 1 client node | 128 |
Single-client sync | 1 fio process on 1 client node | 1 |
OSDs were allowed to use all cores on the nodes. FIO was configured to first pre-fill RBD volume(s) with large writes, followed by 4MB and 4KB IO tests for 300 seconds each. Certain background processes, such as scrub, deep scrub, PG autoscaling, and PG balancing were disabled. Finally, an RBD pool with a static 16384 PGs (higher than typically recommended) and 3x replication was used.
Ceph utilizes a tool called LUKS to encrypt the block device(s) that BlueStore writes data to. There are several tuning options available that may help improve performance. Before diving into the full set of tests, let\'s look at a couple of those options.
Option | Description |
---|---|
--perf-submit_from_crypt_cpus | Disable offloading writes to a separate thread after encryption. There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly. The default is to offload write bios to the same thread. This option is only relevant for open action. NOTE: This option is available only for low-level dm-crypt performance tuning, use only if you need a change to default dm-crypt behavior. Needs kernel 4.0 or later. |
--sector-size <bytes> | Set sector size for use with disk encryption. It must be a power of two and in range 512 - 4096 bytes. The default is 512 bytes sectors. This option is available only in the LUKS2 mode. |
--perf-no_read_workqueue --perf-no_write_workqueue | Bypass dm-crypt internal workqueue and process read or write requests synchronously. This option is only relevant for open action. NOTE: These options are available only for low-level dm-crypt performance tuning, use only if you need a change to default dm-crypt behavior. Needs kernel 5.9 or later. |
Unfortunately, the kernel available in Centos Stream 8 does not support disabling the read and write workqueues. Thus We were not able to test those options. We were however able to test the others and did so with an initial focus on the multi-client test configuration.
IO Size | Configuration | Random Reads | Random Writes |
---|---|---|---|
4MB | Unencrypted | 100.00% | 100.00% |
4MB | LUKS | 98.31% | 54.72% |
4MB | LUKS, 4k sectors | 98.56% | 52.67% |
4MB | LUKS, submit_from_crypt_cpus | 98.87% | 69.26% |
Right out of the gate, we see that deploying OSDs on LUKS has only a minor impact on 4MB read performance but a major impact on write performance. This may be slightly misleading since we are network-limited in the read tests at ~11GB/s per node. For write tests, however, the performance impact is significant. The good news is that using the --perf_submit_from_crypt_cpus
option mitigates some of the performance loss.
While we couldn\'t test the workqueue related performance options in these tests, there is already a PR in Ceph to enable those options that has been tested by Josh Baergen:
One of the observations that Josh made is that the workqueue options may help improve CPU usage. While we don\'t have those numbers, let\'s look at CPU usage in the tests we were able to run.
During reads, system CPU usage increases dramatically when LUKS is utilized (up to 2X CPU usage) In the write tests the overall CPU consumption appears to be similar or even lower, but not when performance is taken into account. The CPU usage per IO is actually higher overall. In both the read and write tests, utilizing a 4K sector size for LUKS appears to result in a slight CPU usage drop.
IO Size | Configuration | Random Reads | Random Writes |
---|---|---|---|
4KB | Unencrypted | 100.00% | 100.00% |
4KB | LUKS | 89.63% | 83.19% |
4KB | LUKS, 4k sectors | 89.73% | 84.14% |
4KB | LUKS, submit_from_crypt_cpus | 89.35% | 83.04% |
The performance impact for 4KB random IOs was lower than for larger 4MB IOs. 4KB reads suffered roughly a 11-12% hit while writes took closer to a 20% hit.
Aggregate CPU usage was pretty close, however, system usage was slightly higher while user usage was lower (correlated with the lower performance when LUKS is enabled).
The general takeaway from these tests is that the --perf-submit_from_crypt_cpus
option can improve LUKS large write throughput and is likely worth using and we\'ll be using it for the remainder of the tests in this article. 4K sectors may also be worth enabling for devices that support them natively and may help slightly improve CPU usage in some cases.
Now that we\'ve done some base-level testing for LUKS, let\'s see what effect the msgr V2 secure mode has with the same high concurrency workload.
IO Size | Configuration | Random Reads | Random Writes |
---|---|---|---|
4MB | Unencrypted | 100.00% | 100.00% |
4MB | LUKS | 98.87% | 69.26% |
4MB | LUKS, 4k sectors | 94.90% | 87.43% |
4MB | LUKS, submit_from_crypt_cpus | 91.54% | 64.63% |
There is an additive overhead associated with enabling msgr v2 secure mode, however, the bigger effect in these tests is from LUKS.
LUKS again results in significant CPU usage overhead. Interestingly, enabling secure messenger appears to decrease system CPU consumption when using unencrypted block volumes (IE no LUKS). It\'s not entirely clear why this is the case, though potentially a slight reduction in IO workload hitting the block layer might explain it.
IO Size | Configuration | Random Reads | Random Writes |
---|---|---|---|
4KB | Unencrypted | 100.00% | 100.00% |
4KB | LUKS | 89.35% | 83.04% |
4KB | LUKS, 4k sectors | 100.25% | 94.96% |
4KB | LUKS, submit_from_crypt_cpus | 89.18% | 81.82% |
The biggest effect is again from LUKS.
The biggest CPU usage effects also come from LUKS, with slightly more system CPU and slightly less user CPU (in association with lower performance).
One of the effects of throwing so much IO at this cluster is that we see high-latency events. In this case, we are actually pushing things a little harder on the read side and seeing worst cast latencies as high as ~900ms. The good news is that neither LUKS nor secure mode are having a significant effect on tail latency in this saturation workload. We also know from our previous article Ceph Reef - 1 or 2 OSDs per NVMe? that putting multiple OSDs on a single NVMe drive can significantly reduce tail latency at the expense of higher resource consumption.
Last year we looked at QEMU/KVM performance with Msgr V2 AES Encryption and saw that encryption did have a notable effect on single-client performance. We ran several single-client tests here to verify if that is still the case in Reef.
IO Size | Configuration | Random Reads | Random Writes |
---|---|---|---|
4MB | Unencrypted | 100.00% | 100.00% |
4MB | LUKS | 98.80% | 96.65% |
4MB | LUKS, 4k sectors | 87.51% | 73.84% |
4MB | LUKS, submit_from_crypt_cpus | 87.36% | 72.01% |
Earlier in the multi-client tests we observed that LUKS generally was having the biggest impact. It greatly increased CPU consumption for large IOs and degraded the performance of large writes and small random IO. In these single-client large IO workloads, LUKS is having very little impact. Enabling messenger encryption however, is having a significant impact on single-client performance while it had a much smaller impact on overall cluster performance in the multi-client tests.
IO Size | Configuration | Random Reads | Random Writes |
---|---|---|---|
4KB | Unencrypted | 100.00% | 100.00% |
4KB | LUKS | 94.95% | 98.63% |
4KB | LUKS, 4k sectors | 97.68% | 99.92% |
4KB | LUKS, submit_from_crypt_cpus | 96.54% | 99.49% |
The impact of both LUKS and messenger-level encryption on single-client small IO however is minimal, usually only a couple of percent. What about latency?
The previous multi-client tests showed significant latency spikes due to how hard we were pushing the OSDs. In these single-client tests, latency is far more even. Typical latency for reads hovers right around 0.9ms with spikes not exceeding 1.15ms. The write case is more interesting. Latency was still significantly lower, typically around 1.5ms. Spikes were typically below 2.5ms, though in the case where both on-disk and messenger level encryption was used grew closer to 3.5ms. The nature of the spikes appears to be cyclical over time, however, and that pattern repeats across multiple tests on completely rebuilt clusters. This effect warrants additional investigation.
In the previous single-client tests we still utilized a high io_depth to allow many IOs to stay in flight. This allows multiple OSDs to service IOs concurrently and improve performance. Some applications however require that IOs be handled sequentially. One IO must complete before the next one can be written or read. The etcd journal is a good example of this kind of workload and is typically entirely latency-bound. Each IO must complete at least one round-trip network transfer along with whatever other overhead is required for servicing it from the disk.
IO Size | Configuration | Random Reads | Random Writes |
---|---|---|---|
4KB | Unencrypted | 100.00% | 100.00% |
4KB | LUKS | 95.87 | 87.88% |
4KB | LUKS, 4k sectors | 100.89% | 94.64% |
4KB | LUKS, submit_from_crypt_cpus | 95.29% | 86.31% |
The biggest effect in this case came from LUKS. 4KB sync reads were slightly slower, while 4KB sync writes showed larger degradation.
Latency followed a similar pattern, with the unencrypted results showing slightly faster response times than the encrypted ones. Read latencies hovered around 0.13ms, while write latencies hovered around 0.4ms with occasional spikes up to around 0.5ms.
In this article we looked at Ceph performance with both on-disk and over-the-wire encryption in a variety of different RBD test scenarios. The results of those tests showcased several nuanced conclusions.
On-disk Encryption (LUKS) | Over-the-wire Encryption (secure msgr) |
---|---|
* 2x OSD CPU Usage for large IOs | * Low effect on CPU consumption |
* High large write cluster impact | * Low-Moderate cluster impact |
* Moderate small IO cluster impact | * High single-client large IO impact |
* Low single-client impact (high io_depth) | * Low small sync IO impact |
* Moderate small sync write impact | |
* Partially mitigated with tuning |
In general, we expect users to see increased CPU consumption when using on-disk encryption. The greatest impact is expected to be on large IOs. Thankfully, Ceph typically uses more CPU during small IO workloads. Customers who have designed their OSD CPU specifications around IOPS requirements will not likely suffer a serious performance impact due to a lack of CPU. Large writes do see a significant performance impact with on-disk encryption, however, this can be partially mitigated and there is on-going work that may mitigate it further. On-disk encryption also has a moderate effect on small synchronous write performance. Over-the-wire encryption\'s biggest performance impact is on maximum single-client throughput. It does have a small performance impact in other cases as well, though the effect is usually minor. As always, I encourage you to test for yourself and see if your findings match what we saw here. Thank you for reading, and if you have any questions or would like to talk more about Ceph performance, please feel free to reach out.