Configuring cachebench parameters
Command line parametersβ
Cachebench takes command line parameters to control its behavior. The following are the semantics of the command line parameters:
JSON test configurationβ
--json_test_config is the most important command line parameter that is needed for specifying the workload and cache configuration for cachebench. See the section below on JSON config for more details.
Watching progressβ
While the benchmark runs, you can monitor the progress so far. The interval for progress update can be configured using the --progress and specifying a duration in seconds.
Recording periodic statsβ
While the benchmark runs, you can have cachebench output a snapshot of its internal stats to a file periodically. To do this, you have to pass a suitable file location to --progress_stats_file. cachebench will appendd stats to this file every --progress interval.
Stopping after a certain durationβ
If you would like cachebench to terminate after running for X hours, you can use --timeout_seconds to pass a suitable timeout.
Sample JSON test configβ
Cachebench takes in a json config file that provides the workload and cache configuration. The following is a sample json config file:
{
"cache_config" : {
"cacheSizeMB" : 512,
"poolRebalanceIntervalSec" : 1,
"moveOnSlabRelease" : false,
"numPools" : 2,
"poolSizes" : [0.3, 0.7]
},
"test_config" : {
"numOps" : 100000,
"numThreads" : 32,
"numKeys" : 1000000,
"distribution" : "range",
"opDelayBatch" : 1,
"opDelayNs" : 200,
"keySizeRange" : [1, 8, 64],
"keySizeRangeProbability" : [0.3, 0.7],
"valSizeRange" : [1, 32, 10240, 409200],
"valSizeRangeProbability" : [0.1, 0.2, 0.7],
"getRatio" : 0.15,
"setRatio" : 0.8,
"delRatio" : 0.05,
"keyPoolDistribution": [0.4, 0.6],
"opPoolDistribution" : [0.5, 0.5]
}
}
This config file controls the parameters for the cache and the generated synthetic workload in two separate sections.
Tuning workload parametersβ
You can tune the workload parameters by modifying the test_config portion of the json file. The workload generator operates over a key space and their associated sizes. It generates cachebench operations to be executed for those keys.
Duration of replayβ
To run cachebench operation for longer, increase the numOps appropriately in the config file.
Number of benchmark threadsβ
You can adjust numThreads to run the benchmark with more threads. Running with more threads should increase throughput until you run out of cpu or hit other bottlenecks from resource contention. For in-memory workloads, it is not recommended to set this beyond the hardware concurrency supported on your machine.
Number of keys in cacheβ
To adjust the working set size of the cache, you can increase or decrease the numKeys that the workload picks from.
Operation ratiosβ
Cachebench picks operation types by its specified popularity ratios. The following list the supported operation types:
getRatioGenerates a get request resulting infindAPI call.setRatioGenerates a set request by overriding any previous version of the key if it exists. This results in a call to theallocate()API, followed by a call to theinsertOrReplace()API.delRatioGenerates a remove request to remove a key from the cache.addChainedRatioGenerates operations that allocate a chained allocation and adds it to the existing key. If the key is not present, it is created.loneGetRatioGenerates a get request for a key that is definitely not present in the cache to simulate one-hit-wonders or churn.
In conjuction with these operations, enableLookaside emulates a behavior where missing keys are set in the cache. When this is used, setRatio is usually not configured.
Workload generatorβ
You can configure three types of workload generators through the generator parameter by specifying the corresponding identifier string.
- workload Generates keys and popularity ahead of time. This is the generator with the lowest run time overhead and hence is useful for measuring maximum throughput. The generator however consumes memory to keep keys and generated cache operations in memory and is not suitable when your memory footprint needs to be contained.
- online Generates keys and popularity online. Has very low overhead in terms of memory, but consumes marginal CPU to generate synthetic workload.
- replay Replays a trace file passed in. Tracefile should contain lines with csv separated key, size, and number of accesses.
Popularity and Size distributionβ
cachebench supports generating synthetic workloads using a few techniques. The technique is configured through the distribution argument. Based on the selected technique there can be additional parameters that can be configured. The supported techniques are
default Generates popularity of keys through a discrete distribution specified in popularityBuckets and popularityWeights parameter. Discrete sizes are generated through a discrete distribution specified through
valSizeRangeandvalSizeRangeProbability. The value size configuration can be provided inline as an array or through avalSizeDistFilein json format.normal Uses normal workload distribution for popularity of keys as opposed to discrete popularity buckets. For value sizes, it supports both discrete and continuous value size distribution. To use discrete value size distribution, the
valSizeRangeProbabilithyshould have same number of values asvalSizeRangearray. WhenvalSizeRangeProbabilithycontains one less member thanvalSizeRange, we interpret the probability as corresponding to each interval invalSizeRangeand use a piecewise_constant_distribution.
In all above setups, cachebench overrides the valSizeRange and vaSizeRangeProbability from inline json array if valSizeDistFile is present.
Throttling the benchmarkβ
To measure the performance of HW at a certain throughput, cachebench can be artificially throttled by specifying a non-zero opDelayNs, that is applied every opDelayBatch worth of operations per thread. To run un-throttled, set opDelayNs to zero.
Consistency checkingβ
You can enable runtime consistency checking of the APIs through cachebench. In this mode, cachebench validates the correctness semantics of API. This is useful when you make a cache to CacheLib and want to validate any data races resulting in incorrect API semantics.
Populating itemsβ
You can enable populateItem to fill cache items with random bytes. When consistency mode is enabled, we populate the item automatically with unique values for validation.
Tuning DRAM cache parametersβ
The cache_config section specifies knobs to control how the cache is configured. The following options are available to configure the DRAM cache parameters. DRAM cache parameters come into play when using hybrid cache as well as stand-alone DRAM cache mode.
DRAM cache sizeβ
You can set cacheSizeMB to specify the size of the DRAM cache.
Allocator type and its eviction parametersβ
CacheLib supports LruAllocator and Lru2QAllocator to choose from. You can specify this by setting the allocator to "LRU" or "LRU-2Q". Based on the type you choose you can configure the corresponding properties of DRAM eviction.
Common options for LruAllocator and Lru2QAllocator:
lruRefreshSecSeconds since last access that initiates a bump on access.lruRefreshRatioLru refresh time specified as a ratio of the eviction age.lruUpdateOnReadControls if read accesses lead to updating LRU position.lruUpdateOnWriteControls if write accesss lead to updating LRU position.tryLockUpdateSkips updating the LRU position on contention.
Options for LruAllocator:
lruIpSpecInsertion point expressed as power of two.
Options for Lru2QAllocator:
lru2qHotPctPercentage of LRU dedicated for hot itemslru2qColdPctPercentage of LRU dedicated for cold items.
For more details on the semantics of these parameters, see the documentation in Eviction Policy guide.
Poolsβ
The DRAM cache can be split into multiple pools. To create the pools, you need to specify numPools to the required number of pools and set poolSizes array to represent the relative sizes of the pools.
When using pools, you can tune the workload to generate a different workload per pool. To split the keys and operations across pools, specify the following:
- Breakdown of keys through
keyPoolDistributionarray where each value represents the relative footprint of keys fromnumKeys. - Breakdown of operations per pool through
opPoolDistributionwhere each value represents the relative footprint ofnumOpsacross pools.
You can specify a seperate array of workload config that describes the key, size and popularity distribution per pool through poolDistributionConfig. If not specified, the global configuration is applied across all the pools.
Allocation sizesβ
You can specify custom allocation sizes by passing in an allocSizes array. If allocSizes is not present, we use default allocation sizes with a factor of 1.5, starting from 64 bytes to 1MB. To control allocation sizes through alloc factor, you can specify allocFactor as a double and set minAllocSize and maxAllocSize.
Access config parametersβ
CacheLib uses a hashtable to index keys. The configuration of the hashtable can have a big impact on throughput. htBucketPower controls the number of hashtable buckets and htLockPower configures the number of locks. Usually, these should be configured in conjunction with the observed numItems in DRAM when the cache warms up. See
Pool rebalancingβ
To enable cachelib pool rebalancing techniques, you can set poolRebalanceIntervalSec. The default strategy is to randomly release a slab to test for correctness. You can configure this to your preference by setting rebalanceStrategy as "tail-age" or "hits". You can also specify rebalanceMinSlabs and rebalanceDiffRatio to configure this further per documentation in Pool rebalancing guide.
Hybrid cache parametersβ
Hybrid cache parameters are configured under the cache_config section. To enable hybrid cache for cachebench, you need to specify a non-zero value to the nvmCacheSizeMB parameter.
Storage file/device/directory path infoβ
You can configure hybrid cache in multiple modes. By default, if you set only nvmCacheSizeMB and nothing else, cachebench will use an in-memory file device for simplicity. This is often used to test correctness quickly. To use an actual non-volatile medium, you can configure nvmCachePaths, which is taken as an array of strings.
If nvmCachePaths is set to a single element array that is a directory, cachebench will create a suitable file inside the path and clean it up upon exit. Instead if nvmCachePaths is single element array referring to a file or a raw device, cachebench will use it as is and leave it as is upon exit. If the file specified is a regular file and is not to the specified size, CacheLib will try to fallocate to the necessary size. If more than one path is specified, CacheLib will use software RAID-0 across them and treat each file to be of nvmCacheSizeMB. By default, CacheLib uses direct io.
Monitoring write amplificationβ
CacheBench can monitor the write-amplification of supported underlying devices if you specify them through writeAmpDeviceList as an array of device paths. If the device is unsupported, an exception is logged, but the test proceeds. If this is empty, no monitoring is performed.
Storage engine parametersβ
Set the following parameters to control the performance of the hybrid cache storage engine. See Hybrid Cache for more details.
navyReaderThreadsandnavyWriterThreadsControl the reader and writer thread pools.navyAdmissionWriteRateMBThrottle limit for logical write rate to maintain device endurance limit.navyMaxConcurrentInsertsThrottle limit for in-flight hybrid cache writes.navyParcelMemoryMBThrottle limit for the memory footprint of in-flight writes.navyDataChecksumEnables check-summing data in addition to the headers.navyEncryptionEnables transparent device level encryption.navyReqOrderShardsPowerNumber of shards used for request ordering. The default is 21, corresponding to 2 million shards. The more shards, the less false positives and better concurrency. But this plateus beyond a certain number.truncateItemToOriginalAllocSizeInNvmTruncates item to allocated size to optimize write performance.deviceMaxWriteSizeThis controls the largest IO size we will write to the device. Any IO above this size will be split up into multiple IOs.deviceEnableFDPThis enables the use of FDP in Navy which results in a segregation of BigHash and BlockCache writes to the SSD.
Small item engine parametersβ
Use the following options to tune the performance of the Small Item engine (BigHash): BigHash operates a FIFO cache on SSD and is optimized for caching small objects.
navySmallItemMaxSizeObject size threshold for small item engine.navyBigHashSizePctWhen non-zero enables small item engine and its relative size.navyBigHashBucketSizeBucket size for small item engine.navyBloomFilterPerBucketSizeSize in bytes for the bloom filter per bucket.
Large item engine parametersβ
Use the following options to tune the Large Item engine (BlockCache): Block cache is designed for caching objects that are around or larger than device block size. It can support variety of eviction policies from FIFO/LRU/SegmentedFIFO and can operate with stacked mode or size classes.
navyBlockSizeUnderlying device block size for IO alignment.navySegmentedFifoSegmentRatioBy default Navy uses coarse grained LRU. To use FIFO, this parameter is set to an array with single value. To use segmented FIFO, this parameter is configured to control the number of segments by specifying their ratios.navyHitsReinsertionThresholdControl the threshold for reinserting items by their number of hits.navyProbabilityReinsertionThresholdControl the probability based reinsertion of items.navyNumInmemBuffersNumber of memory buffers used to optimize write performance.navyCleanRegionsWhen un-buffered, the size of the clean regions pool.navyRegionSizeMBThis controls the region size to use for BlockCache. If not specified, 16MB will be used. See Configure HybridCache for more details.