NBS System is the first Magento Enterprise hosting provider in France; our system administrators thus have the opportunity to minutely explore this solution, within many configurations.
One of them, Julien Khamis, thus discovered, two weeks ago, a bug in Magento platforms using Redis and several Apache services.
Context and symptoms:
During a Magento cache flush, and in platforms using Redis for cache storage, we notice a strong rise in load of the frontal servers. This load is caused by Apache processes, mainly consuming CPU time.
It should be noted that platforms with only one Apache service are not impacted by the bug.
During a flush of its cache, Magento drops a lock notifying that it is normal that the cache is no longer accessible since it was flushed, and that its re-generation has already been taken over.
Usually, like on platforms using Memcache, this lock is dropped on MySQL. This action has a low cost, is instantaneous, and is, most importantly, independent from any database. The lock is thus instantly taken into account by all Apache processes looking to execute a request; they will thus simply wait for the cache to be created again, then will send the requested pages to the website’s visitors, using the new cache.
Lock creation with Redis
When a platform uses Redis to store its cache, the process is different. After the cache flush, during the first request sent to a frontal server requesting an access to the cache, the Apache process serving this request will go through the following steps:
- Ask Redis if the cache exists. In our case, it does not, the process goes on to step 2.
- Ask Redis if there is a lock. Since it is the first request after the cache flush, the lock does not exist yet; the process goes on to step 3.
- Have Redis create a lock.
- At the same time, the Apache process starts re-creating the cache. For that, it will browse all .xml files of Magento’s configuration directory (app/etc/). It will allow the process to generate the configuration key (vital to the display of all pages), and to concatenate them into a single file in Magento’s cache.
- Once the cache is created, insert the cache into Redis, and finally…
- Ask Redis to withdraw the lock.
The problem comes here from one of Redis’s features: it is a “mono-threaded” service, which means it can only treat one action at a time. The requests it receives are thus piled into a processing queue.
Thus, between steps 1 and 3, a certain amount of time has passed; enough time for many clients to send requests needing the cache to exist. Let us imagine, to better explain it, that there are three requests. The Apache processes treating these requests follow the same guideline as the first one: they ask wether the cache exist, then if a lock exists…
There lies the problem: the first request for the creation of a lock still has not been treated by Redis, and it is still in the queue. A request loop has been created in Redis. With three requests, the queue could look like this:
- Cache 1?
- Cache 2?
- Lock 1?
- Cache 3?
- Lock 2?
- Create lock 1.
- Lock 3?
- Create lock 2.
Let us follow the second request: its request concerning the existence of a lock has been made before the creation of the first lock. Thus, even if the latter has been created, the Apache process will still ask Redis to create a lock, which will in time replace the first one.
It is only after request 3 that Redis’ response to the question “Lock?” will be “yes”: for all requests after that, the Apache processes will simply wait for the final cache. This also consumes CPU time.
It is one of the causes of the rise in load of the servers. On the one hand, the numerous Apache processes (one per request) have to wait for Redis’ answer before starting another action, and this response time depends on the number of requests Redis has to deal with before reaching their own. On the other hand, the requests that do not have to create a lock are waiting for the cache. This wait (io wait) consumes CPU time.
The second great cause for the rise in load of the servers is linked to the re-creation of the cache itself. Indeed, when an Apache process manages to create a lock on Redis, it starts to generate the new cache at the same time. To do that, as we mentioned earlier, it browses the .xml files of Magento configuration directory. However, these files are only accessible by one process at a time. Not only does it take time and resources to read these files, but the great number of processes that do not yet have access to them are waiting, which, we say it again, consumes resources. Since all the Apache processes of the loop we mentioned earlier are trying to read the .xml files at the same time, the rise in load is particularly strong.
And it is not over yet…
Once an Apache process gathered the informations from the .xml files, and thus re-created the cache, it will insert it in Redis, and ask the service to withdraw the lock. It will then be able to answer the initial request of the client, and the process will stop.
It will not stop, however, the execution of the thousands of other processes doing exactly the same things… All the Apache processes that put a lock on Redis will also insert the cache they re-created, and ask Redis to withdraw the lock. There again, these requests will be queued, as such:
- Insert cache 1
- Insert cache 2
- Withdraw lock 1
- Withdraw lock 2
That means that even if a process managed to insert its cache into Redis, it will not stop and free its resources at the same moment, since Redis will have other requests to deal with before withdrawing its lock and thus “freeing” the process.
The icing on the cake is that all processes that received a “yes” to their question about the existence of a lock will also regularly send requests to Redis, asking it to serve them the cache. More steps in the processing queue…
It can take up to hours for all Apache processes to be over. Of course, the cache being unaccessible from the time of its flush until the withdrawal of the last lock, the website is unavailable for all this time…
In order to stop this cache re-creation loop, the system administrator, with the approval of his/her client, can manually stop the Apache services of the frontal servers, except one. It will limit the number of Apache processes running at the same time, and Redis will thus reach the last lock withdrawal request, the one that marks the availability of the cache, much faster! The load will go down and the other Apache services can, from this moment on, be launched again.
The only possible patching to this day is to place the lock outside Redis, as it is the case with Memcache. It can be placed on a database, or in any other place accessible by all the frontal servers at the same time, like on the share media for instance. The lock will thus be visible by all Apache processes, which will prevent the launch of a cache recreation loop.
This modification can be made by the website developper (for instance the developing agency), and will have no negative impact on the website. A partner with a good knowledge of Magento had this patching in place for one of our clients in less than a day, and the website did not went through any slowdowns following this change.
For more information, do not hesitate to contact us.
Source: Julien Khamis