Configuring cachebench parameters
Command line parametersβ
Cachebench takes command line parameters to control its behavior. The following are the semantics of the command line parameters:
JSON test configurationβ
--json_test_config
is the most important command line parameter that is needed for specifying the workload and cache configuration for cachebench. See the section below on JSON config for more details.
Watching progressβ
While the benchmark runs, you can monitor the progress so far. The interval for progress update can be configured using the --progress
and specifying a duration in seconds.
Recording periodic statsβ
While the benchmark runs, you can have cachebench output a snapshot of its internal stats to a file periodically. To do this, you have to pass a suitable file location to --progress_stats_file
. cachebench will appendd stats to this file every --progress
interval.
Stopping after a certain durationβ
If you would like cachebench to terminate after running for X hours, you can use --timeout_seconds
to pass a suitable timeout.
Sample JSON test configβ
Cachebench takes in a json config file that provides the workload and cache configuration. The following is a sample json config file:
{
"cache_config" : {
"cacheSizeMB" : 512,
"poolRebalanceIntervalSec" : 1,
"moveOnSlabRelease" : false,
"numPools" : 2,
"poolSizes" : [0.3, 0.7]
},
"test_config" : {
"numOps" : 100000,
"numThreads" : 32,
"numKeys" : 1000000,
"distribution" : "range",
"opDelayBatch" : 1,
"opDelayNs" : 200,
"keySizeRange" : [1, 8, 64],
"keySizeRangeProbability" : [0.3, 0.7],
"valSizeRange" : [1, 32, 10240, 409200],
"valSizeRangeProbability" : [0.1, 0.2, 0.7],
"getRatio" : 0.15,
"setRatio" : 0.8,
"delRatio" : 0.05,
"keyPoolDistribution": [0.4, 0.6],
"opPoolDistribution" : [0.5, 0.5]
}
}
This config file controls the parameters for the cache and the generated synthetic workload in two separate sections.
Tuning workload parametersβ
You can tune the workload parameters by modifying the test_config
portion of the json file. The workload generator operates over a key space and their associated sizes. It generates cachebench operations to be executed for those keys.
Duration of replayβ
To run cachebench operation for longer, increase the numOps
appropriately in the config file.
Number of benchmark threadsβ
You can adjust numThreads
to run the benchmark with more threads. Running with more threads should increase throughput until you run out of cpu or hit other bottlenecks from resource contention. For in-memory workloads, it is not recommended to set this beyond the hardware concurrency supported on your machine.
Number of keys in cacheβ
To adjust the working set size of the cache, you can increase or decrease the numKeys
that the workload picks from.
Operation ratiosβ
Cachebench picks operation types by its specified popularity ratios. The following list the supported operation types:
getRatio
Generates a get request resulting infind
API call.setRatio
Generates a set request by overriding any previous version of the key if it exists. This results in a call to theallocate()
API, followed by a call to theinsertOrReplace()
API.delRatio
Generates a remove request to remove a key from the cache.addChainedRatio
Generates operations that allocate a chained allocation and adds it to the existing key. If the key is not present, it is created.loneGetRatio
Generates a get request for a key that is definitely not present in the cache to simulate one-hit-wonders or churn.
In conjuction with these operations, enableLookaside
emulates a behavior where missing keys are set in the cache. When this is used, setRatio
is usually not configured.
Workload generatorβ
You can configure three types of workload generators through the generator
parameter by specifying the corresponding identifier string.
- workload Generates keys and popularity ahead of time. This is the generator with the lowest run time overhead and hence is useful for measuring maximum throughput. The generator however consumes memory to keep keys and generated cache operations in memory and is not suitable when your memory footprint needs to be contained.
- online Generates keys and popularity online. Has very low overhead in terms of memory, but consumes marginal CPU to generate synthetic workload.
- replay Replays a trace file passed in. Tracefile should contain lines with csv separated key, size, and number of accesses.
Popularity and Size distributionβ
cachebench supports generating synthetic workloads using a few techniques. The technique is configured through the distribution argument. Based on the selected technique there can be additional parameters that can be configured. The supported techniques are
default Generates popularity of keys through a discrete distribution specified in popularityBuckets and popularityWeights parameter. Discrete sizes are generated through a discrete distribution specified through
valSizeRange
andvalSizeRangeProbability
. The value size configuration can be provided inline as an array or through avalSizeDistFile
in json format.normal Uses normal workload distribution for popularity of keys as opposed to discrete popularity buckets. For value sizes, it supports both discrete and continuous value size distribution. To use discrete value size distribution, the
valSizeRangeProbabilithy
should have same number of values asvalSizeRange
array. WhenvalSizeRangeProbabilithy
contains one less member thanvalSizeRange
, we interpret the probability as corresponding to each interval invalSizeRange
and use a piecewise_constant_distribution.
In all above setups, cachebench overrides the valSizeRange
and vaSizeRangeProbability
from inline json array if valSizeDistFile
is present.
Throttling the benchmarkβ
To measure the performance of HW at a certain throughput, cachebench can be artificially throttled by specifying a non-zero opDelayNs
, that is applied every opDelayBatch
worth of operations per thread. To run un-throttled, set opDelayNs
to zero.
Consistency checkingβ
You can enable runtime consistency checking of the APIs through cachebench. In this mode, cachebench validates the correctness semantics of API. This is useful when you make a cache to CacheLib and want to validate any data races resulting in incorrect API semantics.
Populating itemsβ
You can enable populateItem to fill cache items with random bytes. When consistency mode is enabled, we populate the item automatically with unique values for validation.
Tuning DRAM cache parametersβ
The cache_config
section specifies knobs to control how the cache is configured. The following options are available to configure the DRAM cache parameters. DRAM cache parameters come into play when using hybrid cache as well as stand-alone DRAM cache mode.
DRAM cache sizeβ
You can set cacheSizeMB
to specify the size of the DRAM cache.
Allocator type and its eviction parametersβ
CacheLib supports LruAllocator and Lru2QAllocator to choose from. You can specify this by setting the allocator to "LRU" or "LRU-2Q". Based on the type you choose you can configure the corresponding properties of DRAM eviction.
Common options for LruAllocator and Lru2QAllocator:
lruRefreshSec
Seconds since last access that initiates a bump on access.lruRefreshRatio
Lru refresh time specified as a ratio of the eviction age.lruUpdateOnRead
Controls if read accesses lead to updating LRU position.lruUpdateOnWrite
Controls if write accesss lead to updating LRU position.tryLockUpdate
Skips updating the LRU position on contention.
Options for LruAllocator:
lruIpSpec
Insertion point expressed as power of two.
Options for Lru2QAllocator:
lru2qHotPct
Percentage of LRU dedicated for hot itemslru2qColdPct
Percentage of LRU dedicated for cold items.
For more details on the semantics of these parameters, see the documentation in Eviction Policy guide.
Poolsβ
The DRAM cache can be split into multiple pools. To create the pools, you need to specify numPools
to the required number of pools and set poolSizes
array to represent the relative sizes of the pools.
When using pools, you can tune the workload to generate a different workload per pool. To split the keys and operations across pools, specify the following:
- Breakdown of keys through
keyPoolDistribution
array where each value represents the relative footprint of keys fromnumKeys
. - Breakdown of operations per pool through
opPoolDistribution
where each value represents the relative footprint ofnumOps
across pools.
You can specify a seperate array of workload config that describes the key, size and popularity distribution per pool through poolDistributionConfig
. If not specified, the global configuration is applied across all the pools.
Allocation sizesβ
You can specify custom allocation sizes by passing in an allocSizes
array. If allocSizes
is not present, we use default allocation sizes with a factor of 1.5, starting from 64 bytes to 1MB. To control allocation sizes through alloc factor, you can specify allocFactor
as a double and set minAllocSize
and maxAllocSize
.
Access config parametersβ
CacheLib uses a hashtable to index keys. The configuration of the hashtable can have a big impact on throughput. htBucketPower
controls the number of hashtable buckets and htLockPower
configures the number of locks. Usually, these should be configured in conjunction with the observed numItems in DRAM when the cache warms up. See
Pool rebalancingβ
To enable cachelib pool rebalancing techniques, you can set poolRebalanceIntervalSec
. The default strategy is to randomly release a slab to test for correctness. You can configure this to your preference by setting rebalanceStrategy
as "tail-age" or "hits". You can also specify rebalanceMinSlabs
and rebalanceDiffRatio
to configure this further per documentation in Pool rebalancing guide.
Hybrid cache parametersβ
Hybrid cache parameters are configured under the cache_config
section. To enable hybrid cache for cachebench, you need to specify a non-zero value to the nvmCacheSizeMB
parameter.
Storage file/device/directory path infoβ
You can configure hybrid cache in multiple modes. By default, if you set only nvmCacheSizeMB
and nothing else, cachebench will use an in-memory file device for simplicity. This is often used to test correctness quickly. To use an actual non-volatile medium, you can configure nvmCachePaths
, which is taken as an array of strings.
If nvmCachePaths
is set to a single element array that is a directory, cachebench will create a suitable file inside the path and clean it up upon exit. Instead if nvmCachePaths
is single element array referring to a file or a raw device, cachebench will use it as is and leave it as is upon exit. If the file specified is a regular file and is not to the specified size, CacheLib will try to fallocate to the necessary size. If more than one path is specified, CacheLib will use software RAID-0 across them and treat each file to be of nvmCacheSizeMB
. By default, CacheLib uses direct io.
Monitoring write amplificationβ
CacheBench can monitor the write-amplification of supported underlying devices if you specify them through writeAmpDeviceList
as an array of device paths. If the device is unsupported, an exception is logged, but the test proceeds. If this is empty, no monitoring is performed.
Storage engine parametersβ
Set the following parameters to control the performance of the hybrid cache storage engine. See Hybrid Cache for more details.
navyReaderThreads
andnavyWriterThreads
Control the reader and writer thread pools.navyAdmissionWriteRateMB
Throttle limit for logical write rate to maintain device endurance limit.navyMaxConcurrentInserts
Throttle limit for in-flight hybrid cache writes.navyParcelMemoryMB
Throttle limit for the memory footprint of in-flight writes.navyDataChecksum
Enables check-summing data in addition to the headers.navyEncryption
Enables transparent device level encryption.navyReqOrderShardsPower
Number of shards used for request ordering. The default is 21, corresponding to 2 million shards. The more shards, the less false positives and better concurrency. But this plateus beyond a certain number.truncateItemToOriginalAllocSizeInNvm
Truncates item to allocated size to optimize write performance.deviceMaxWriteSize
This controls the largest IO size we will write to the device. Any IO above this size will be split up into multiple IOs.deviceEnableFDP
This enables the use of FDP in Navy which results in a segregation of BigHash and BlockCache writes to the SSD.
Small item engine parametersβ
Use the following options to tune the performance of the Small Item engine (BigHash): BigHash operates a FIFO cache on SSD and is optimized for caching small objects.
navySmallItemMaxSize
Object size threshold for small item engine.navyBigHashSizePct
When non-zero enables small item engine and its relative size.navyBigHashBucketSize
Bucket size for small item engine.navyBloomFilterPerBucketSize
Size in bytes for the bloom filter per bucket.
Large item engine parametersβ
Use the following options to tune the Large Item engine (BlockCache): Block cache is designed for caching objects that are around or larger than device block size. It can support variety of eviction policies from FIFO/LRU/SegmentedFIFO and can operate with stacked mode or size classes.
navyBlockSize
Underlying device block size for IO alignment.navySegmentedFifoSegmentRatio
By default Navy uses coarse grained LRU. To use FIFO, this parameter is set to an array with single value. To use segmented FIFO, this parameter is configured to control the number of segments by specifying their ratios.navyHitsReinsertionThreshold
Control the threshold for reinserting items by their number of hits.navyProbabilityReinsertionThreshold
Control the probability based reinsertion of items.navyNumInmemBuffers
Number of memory buffers used to optimize write performance.navyCleanRegions
When un-buffered, the size of the clean regions pool.navyRegionSizeMB
This controls the region size to use for BlockCache. If not specified, 16MB will be used. See Configure HybridCache for more details.