Project

General

Profile

Feature #42286

Introduction of tier local mode

Added by Honggang Yang over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Introduction

Based on kiizawa’s patch(#18211), we implemented a new cache tier mode - local mode. In this mode, an osd is configured to manage two data devices, one is fast device, one is slow device. Hot objects are promoted from slow device to fast device, and demoted from fast device to slow device when they become cold.

This work is based on ceph v12.2.5. I'm glad to port it to master branch if needed.

https://github.com/yanghonggang/ceph/commits/wip-tier-new

Advantages of tier local mode

local mode tier has the following advantages:
objects migaration can be accomplished inside the osd without network traffic overhead.
don't need to create extra cache pool like what we did in pool tier
there is only one copy of object whether on fast device or on slow device. So the total capacity of an osd is the sum of fast device’s size and slow device’s size.
fast devices can be used to accelerate all pools build upon the fast + slow devices.
user/caller can use hint request to indicate the position of the object.

Introduction to related modules

A. Object access statistics
We can reuse the existed HitSet mechanism.
B. demote agent
We can modify the existed pool tier’s demote agent to fit our purpose.
C. migration
PrimaryLogPG layer triggers a demotion by issue a set_alloc_hint request to os layer
do_op can call trigger a promote op by issue a set_alloc_hint request to os layer
promte: set_alloc_hint(..., fast_flag)
demote: set_alloc_hint(..., flags_with_fast_flag_cleared)

The migration action is accomplished by the bluestore/filestore layer. For now, only bluestore migration is supported. This work is based on kiizawa’s patch(#18211), and I fixed some serious problems.

D. fast device usage info statistics
int64_tnum_bytes_fast; // objects in bytes on fast tier
int64_t num_objects_fast; // number of objects on fast tier

Apart from the work above, we also need:
E: rados tool support
- rados put: add a --fast parameter to place object on the fast device
- rados ls: add a --more parameter to list the objects' position
- cache-demote-all: demote all objects to slow device
F: deploy tool support
I add a --block.fast option to specify the fast device.

# ceph-disk prepare --osd-id 1 --block.db /dev/nvme0n1 --block.wal /dev/nvme0n1 --block.fast /dev/nvme0n1 /dev/sdi

sdi             8:128  0 558.4G  0 disk 
|-sdi1          8:129  0   100M  0 part /var/lib/ceph/osd/ceph-1
`-sdi2          8:130  0 558.3G  0 part 
sdj             8:144  0 558.4G  0 disk 
|-sdj1          8:145  0   100M  0 part /var/lib/ceph/osd/ceph-0
`-sdj2          8:146  0 558.3G  0 part 
sdk             8:160  0 558.4G  0 disk 
|-sdk1          8:161  0   100M  0 part 
`-sdk2          8:162  0 558.3G  0 part 
nvme0n1       259:0    0 349.3G  0 disk 
|-nvme0n1p1   259:1    0     8G  0 part <------<<< db
|-nvme0n1p2   259:2    0   576M  0 part <------<<< wal
`-nvme0n1p3   259:3    0     1G  0 part <-------<<< fast

how to use local mode tier

Setup ceph cluster with vstart.sh:

$ CEPH_NUM_MON=1 CEPH_NUM_OSD=1 CEPH_NUM_MDS=0 CEPH_NUM_MGR=1  CEPH_NUM_RGW=0 ../src/vstart.sh  -X -l -b -n --create_fast_dev
$ ls dev/osd0/ -l
total 360
-rw-r--r-- 1 ubuntu ubuntu 10737418240 Sep 15 21:47 block
lrwxrwxrwx 1 ubuntu ubuntu          54 Sep 15 21:47 block.db -> /home/ubuntu/work/my-tier/build/dev/osd0/block.db.file
-rw-r--r-- 1 ubuntu ubuntu    67108864 Sep 15 21:47 block.db.file
-rw-r--r-- 1 ubuntu ubuntu  1073741824 Sep 15 21:47 block.fast
lrwxrwxrwx 1 ubuntu ubuntu          55 Sep 15 21:47 block.wal -> /home/ubuntu/work/my-tier/build/dev/osd0/block.wal.file
-rw-r--r-- 1 ubuntu ubuntu  1048576000 Sep 15 21:47 block.wal.file
-rw-r--r-- 1 ubuntu ubuntu           2 Sep 15 21:47 bluefs

Create a pool:

$ ceph osd pool create testpool 8 8

Enable tier local mode:

$ ceph osd tier cache-mode testpool local --yes-i-really-mean-it
set cache-mode for pool 'testpool' to local

Hitset settings:

$ ceph osd pool set testpool hit_set_type bloom
$ ceph osd pool set testpool hit_set_count 4
$ ceph osd pool set testpool hit_set_period 10
$ ceph osd pool set testpool min_read_recency_for_promote 3

Put an object and trigger a promotion:

$ rados -p testpool put myobj Makefile
$ rados -p testpool ls —more
myobj    slow
$ for i in {0..2}; do rados -p testpool stat myobj; rados -p testpool ls --more; sleep 8; done 2>/dev/null
testpool/myobj mtime 2019-09-15 22:23:05.000000, size 251749 on_fast 0
myobj    slow
testpool/myobj mtime 2019-09-15 22:23:05.000000, size 251749 on_fast 0
myobj    slow
testpool/myobj mtime 2019-09-15 22:23:49.000000, size 251749 on_fast 1
myobj    fast

Check pool usage info:

$ rados df 2>/dev/null
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR    TUSED TOBJECTS
testpool  247k      17      0     17                  0       0        0     25  0     17 1722k  245k        1

total_objects    17
total_used       1027M
total_avail      9212M
total_space      10240M

Performance evaluation

In order to evaluate the performance of the tier local mode, I set up a mysql db based on a rbd volume and use sysbench to test its performance.

local mode tier:

block: 560G hdd
db: 20G ssd
fast: 30G ssd
cache_target_dirty_ratio 0.7

default:
block: 560G hdd
db: 50GB ssd
bcache(writeback mode):
writeback_percent 40(I want set it to 70, but its max available value is 40 :( )
block: bcache0(560G hdd + 30G ssd)
db: 20G

bench scripts:

# cat rw-bench.sh
sysbench /usr/share/sysbench/oltp_read_write.lua  \
--threads=20 \
--mysql_storage_engine=innodb \
--mysql_host=localhost \
--mysql_db=test \
--mysql_user=root \
--mysql_password= --db_driver=mysql \
--tables=200 \
--table_size=1000000 \
--time=7200 \
$1

arch.png View (282 KB) Honggang Yang, 10/11/2019 03:12 PM

bluestore.png View (96.5 KB) Honggang Yang, 10/11/2019 03:15 PM

tb.png View (53.6 KB) Honggang Yang, 10/11/2019 03:17 PM

cas-vs-local.jpeg View (218 KB) Honggang Yang, 10/14/2019 08:26 AM

History

#2 Updated by Honggang Yang over 4 years ago

After the sysbench prepare operation is completed, about 48883MB of db data is generated.
So in the sysbench run stage, eviction was taking place.

#3 Updated by Honggang Yang over 4 years ago

I also compared local mode tier with intel CAS

- default: no tier
- tiering: local tier mode
- CAS: intel CAS

#4 Updated by Honggang Yang over 4 years ago

Honggang Yang wrote:

I also compared local mode tier with intel CAS

- default: no tier
- tiering: local tier mode
- CAS: intel CAS

random_distribution=zipf:1.1

Also available in: Atom PDF