Keys vs Values: what is swapped out?
Virtual Memory technical specificationThis document details the internals of the Redis Virtual Memory subsystem. The intended audience is not the final user but programmers willing to understand or modify the Virtual Memory implementation. Keys vs Values: what is swapped out?The goal of the VM subsystem is to free memory transferring Redis Objects from memory to disk. This is a very generic command,but specifically,Redis transfers only objects associated with?values. In order to understand better this concept we'll show,using the DEBUG command,how a key holding a value looks from the point of view of the Redis internals:
As you can see from the above output,the Redis top level hash table maps Redis Objects (keys) to other Redis Objects (values). The Virtual Memory is only able to swap?values?on disk,the objects associated to?keys?are always taken in memory: this trade off guarantees very good lookup performances,as one of the main design goals of the Redis VM is to have performances similar to Redis with VM disabled when the part of the dataset frequently used fits in RAM. How does a swapped value looks like internallyWhen an object is swapped out,this is what happens in the hash table entry:
So you may wonder where we store the information that a given value (associated to a given key) was swapped out. Just in the key object! This is how the Redis Object structure?robj?looks like:
As you can see there are a few fields about VM. The most important one is?storage,that can be one of this values:
If an object is swapped on disk (REDISVMSWAPPED or REDISVMLOADING),how do we know where it is stored,what type it is,and so forth? That's simple: the?vtype?field is set to the original type of the Redis object swapped,while the?vm?field (that is a?redisObjectVM?structure) holds information about the location of the object. This is the definition of this additional structure:
As you can see the structure contains the page at which the object is located in the swap file,the number of pages used,and the last access time of the object (this is very useful for the algorithm that select what object is a good candidate for swapping,as we want to transfer on disk objects that are rarely accessed). As you can see,while all the other fields are using unused bytes in the old Redis Object structure (we had some free bit due to natural memory alignment concerns),the?vm?field is new,and indeed uses additional memory. Should we pay such a memory cost even when VM is disabled? No! This is the code to create a new Redis Object:
As you can see if the VM system is not enabled we allocate just? The Swap FileThe next step in order to understand how the VM subsystem works is understanding how objects are stored inside the swap file. The good news is that's not some kind of special format,we just use the same format used to store the objects in .rdb files,that are the usual dump files produced by Redis using the SAVE command. The swap file is composed of a given number of pages,where every page size is a given number of bytes. This parameters can be changed in redis.conf,since different Redis instances may work better with different values: it depends on the actual data you store inside it. The following are the default values:
Redis takes a "bitmap" (an contiguous array of bits set to zero or one) in memory,every bit represent a page of the swap file on disk: if a given bit is set to 1,it represents a page that is already used (there is some Redis Object stored there),while if the corresponding bit is zero,the page is free. Taking this bitmap (that will call the page table) in memory is a huge win in terms of performances,and the memory used is small: we just need 1 bit for every page on disk. For instance in the example below 134217728 pages of 32 bytes each (4GB swap file) is using just 16 MB of RAM for the page table. Transferring objects from memory to swapIn order to transfer an object from memory to disk we need to perform the following steps (assuming non threaded VM,just a simple blocking approach):
As you can guess once the object was correctly written in the swap file,it is freed from memory,the storage field in the associated key is set to REDISVMSWAPPED,and the used pages are marked as used in the page table. Loading objects back in memoryLoading an object from swap to memory is simpler,as we already know where the object is located and how many pages it is using. We also know the type of the object (the loading functions are required to know this information,as there is no header or any other information about the object type on disk),but this is stored in the?vtype?field of the associated key as already seen above. Calling the function? The return value of the function is the loaded Redis Object itself,that we'll have to set again as value in the main hash table (instead of the NULL value we put in place of the object pointer when the value was originally swapped out). How blocking VM worksNow we have all the building blocks in order to describe how the blocking VM works. First of all,an important detail about configuration. In order to enable blocking VM in Redis? We also need to introduce another important VM parameter,? Blocking VM swappingSwapping of object from memory to disk happens in the cron function. This function used to be called every second,while in the recent Redis versions on git it is called every 100 milliseconds (that is,10 times per second). If this function detects we are out of memory,the memory used is greater than the vm-max-memory setting,it starts transferring objects from memory to disk in a loop calling the function? vmSwapOneObject acts performing the following steps:
The function is called again and again until one of the following happens: there is no way to swap more objects because either the swap file is full or nearly all the objects are already transferred on disk,or simply the memory usage is already under the vm-max-memory parameter. What values to swap when we are out of memory?Understanding what's a good candidate for swapping is not too hard. A few objects at random are sampled,and for each their?swappability?is commuted as:
The age is the number of seconds the key was not requested,while sizeinmemory is a fast estimation of the amount of memory (in bytes) used by the object in memory. So we try to swap out objects that are rarely accessed,and we try to swap bigger objects over smaller one,but the latter is a less important factor (because of the logarithmic function used). This is because we don't want bigger objects to be swapped out and in too often as the bigger the object the more I/O and CPU is required in order to transfer it. Blocking VM loadingWhat happens if an operation against a key associated with a swapped out object is requested? For instance Redis may just happen to process the following command:
If the value object of the? So this is what happens:
This is pretty straightforward,but things will get more?interesting?with the threads. From the point of view of the blocking VM the only real problem is the saving of the dataset using another process,handling BGSAVE and BGREWRITEAOF commands. Background saving when VM is activeThe default Redis way to persist on disk is to create .rdb files using a child process. Redis calls the fork() system call in order to create a child,that has the exact copy of the in memory dataset,since fork duplicates the whole program memory space (actually thanks to a technique called Copy on Write memory pages are shared between the parent and child process,so the fork() call will not require too much memory). In the child process we have a copy of the dataset in a given point in the time. Other commands issued by clients will just be served by the parent process and will not modify the child data. The child process will just store the whole dataset into the dump.rdb file and finally will exit. But what happens when the VM is active? Values can be swapped out so we don't have all the data in memory,and we need to access the swap file in order to retrieve the swapped values. While child process is saving the swap file is shared between the parent and child process,since:
In order to avoid problems while both the processes are accessing the same swap file we do a simple thing,not allowing values to be swapped out in the parent process while a background saving is in progress. This way both the processes will access the swap file in read only. This approach has the problem that while the child process is saving no new values can be transferred on the swap file even if Redis is using more memory than the max memory parameters dictates. This is usually not a problem as the background saving will terminate in a short amount of time and if still needed a percentage of values will be swapped on disk ASAP. An alternative to this scenario is to enable the Append Only File that will have this problem only when a log rewrite is performed using the BGREWRITEAOF command. The problem with the blocking VMThe problem of blocking VM is that... it's blocking :) This is not a problem when Redis is used in batch processing activities,but for real-time usage one of the good points of Redis is the low latency. The blocking VM will have bad latency behaviors as when a client is accessing a swapped out value,or when Redis needs to swap out values,no other clients will be served in the meantime. Swapping out keys should happen in background. Similarly when a client is accessing a swapped out value other clients accessing in memory values should be served mostly as fast as when VM is disabled. Only the clients dealing with swapped out keys should be delayed. All this limitations called for a non-blocking VM implementation. Threaded VMThere are basically three main ways to turn the blocking VM into a non blocking one. * 1: One way is obvious,and in my opionion,not a good idea at all,turning Redis itself into a theaded server: if every request is served by a different thread automatically other clients don't need to wait for blocked ones. Redis is fast,exports atomic operations,has no locks,and is just 10k lines of code,?because?it is single threaded,so this was not an option for me. * 2: Using non-blocking I/O against the swap file. After all you can think Redis already event-loop based,why don't just handle disk I/O in a non-blocking fashion? I also discarded this possiblity because of two main reasons. One is that non blocking file operations,unlike sockets,are an incompatibility nightmare. It's not just like calling select,you need to use OS-specific things. The other problem is that the I/O is just one part of the time consumed to handle VM,another big part is the CPU used in order to encode/decode data to/from the swap file. This is I picked option three,that is... * 3: Using I/O threads,a pool of threads handling the swap I/O operations. This is what the Redis VM is using,so let's detail how this works. I/O ThreadsThe threaded VM design goals where the following,in order of importance:
The above goals resulted in an implementation where the Redis main thread (the one serving actual clients) and the I/O threads communicate using a queue of jobs,with a single mutex. Basically when main thread requires some work done in the background by some I/O thread,it pushes an I/O job structure in the? This is how the?
There are just three type of jobs that an I/O thread can perform (the type is specified by the?
The main thread delegates just the above three tasks. All the rest is handled by the main thread itself,for instance finding a suitable range of free pages in the swap file page table (that is a fast operation),deciding what object to swap,altering the storage field of a Redis object to reflect the current state of a value. Non blocking VM as probabilistic enhancement of blocking VMSo now we have a way to request background jobs dealing with slow VM operations. How to add this to the mix of the rest of the work done by the main thread? While blocking VM was aware that an object was swapped out just when the object was looked up,this is too late for us: in C it is not trivial to start a background job in the middle of the command,leave the function,and re-enter in the same point the computation when the I/O thread finished what we requested (that is,no co-routines or continuations or alike). Fortunately there was a much,much simpler way to do this. And we love simple things: basically consider the VM implementation a blocking one,but add an optimization (using non the no blocking VM operations we are able to perform) to make the blocking?very?unlikely. This is what we do:
So you can think at this as a blocked VM that almost always happen to have the right keys in memory,since we pause clients that are going to issue commands about swapped out values until this values are loaded. If the function checking what argument is a key fails in some way,there is no problem: the lookup function will see that a given key is associated to a swapped out value and will block loading it. So our non blocking VM reverts to a blocking one when it is not possible to anticipate what keys are touched. For instance in the case of the SORT command used together with the GET or BY options,it is not trivial to know beforehand what keys will be requested,so at least in the first implementation,SORT BY/GET resorts to the blocking VM implementation. Blocking clients on swapped keysHow to block clients? To suspend a client in an event-loop based server is pretty trivial. All we do is canceling its read handler. Sometimes we do something different (for instance for BLPOP) that is just marking the client as blocked,but not processing new data (just accumulating the new data into input buffers). Aborting I/O jobsThere is something hard to solve about the interactions between our blocking and non blocking VM,what happens if a blocking operation starts about a key that is also "interested" by a non blocking operation at the same time? For instance while SORT BY is executed,a few keys are being loaded in a blocking manner by the sort command. At the same time,another client may request the same keys with a simple?GET key?command,that will trigger the creation of an I/O job to load the key in background. The only simple way to deal with this problem is to be able to kill I/O jobs in the main thread,so that if a key that we want to load or swap in a blocking way is in the REDISVMLOADING or REDISVMSWAPPING state (that is,there is an I/O job about this key),we can just kill the I/O job about this key,and go ahead with the blocking operation we want to perform. This is not as trivial as it is. In a given moment an I/O job can be in one of the following three queues:
Questions?This document is in no way complete,the only way to get the whole picture is reading the source code,but it should be a good introduction in order to make the code review / understanding a lot simpler. Something is not clear about this page? Please leave a comment and I'll try to address the issue possibly integrating the answer in this document. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |