Real-World Performance of Python LRU Caching
Within the functools standard library, there is a treasure trove of easy-to-implement decorators to improve your functions. One of these decorators is the least-recently-used (LRU) cache. This handy function wrapper will remember the run-state of the latest x number of distinct function calls, this is handy when you have an time-expensive function that is regularly called with the same parameters.
In IO bound scenarios this makes a lot of sense, such retrieving data from a remote server. The limitation is the ‘return / runtime’ state of this function must be time-invariant, as there is no built-in aging for the cache and it will last for the lifetime of your execution. In addition the call arguments to the function must be hash-able and immutable, which in some unusual cases may be a problem to accommodate.
What if you don’t have a niche time-invariant IO bound scenario, how much extra performance can you get caching the output of a simple algorithmic helper function? Well that’s what I am going to find out today.
I am currently developing a locally hosted web-app, where the back-end server communicates with micro-controllers using my UART OS (https://nulltek.xyz/wiki/doku.php?id=uart_aos). In this paradigm distinct binary packets and checksums need to be computed during runtime based upon the operations the user is requesting. Now in most cases the cost of doing this is low, but it is extremely repetitive because most applications of this system involves the same commands being executed frequently. Now add in the support for the automation-API layer, where commands are being scheduled by a script, and perhaps we do have an appropriate use-case for the LRU-cache.
There are two major functions involved in producing a packet: get_npc_packet
and get_npc_checksum
. The former calling the latter as part of it’s execution. Now while a 1 byte longitudinal-redundancy checksum only has 256 possible output values, there is an infinite number of input conditions so by itself it is not a good candidate for LRU-caching. Instead I will cache the higher level packet generation function which will implicitly cache the result of the LRC function.
So if we spin up a 4 example packets, and the iteratively call the get_npc_packet function with that sequence n times averaging the result of 100 attempts to reduce noise. This should show the increasing benefits of caching as n increases.
Method | n=100 | n=1000 | n=10,000 | n=100,000 | n=1,000,000 |
No Caching | 0.0013s | 0.0094s | 0.0908s | 0.8513s | 8.4864s |
LRU Caching | 0.0002s | 0.0014s | 0.0129s | 0.1201s | 1.2005s |
Gains | 6.5 x | 6.7 x | 7 x | 7.1 x | 7.2 x |
These results were surprising to me. While there are significant performance improvements across the board, the increase in gain with n didn’t rise nearly as much as I would have expected. I believe the explanation for this is the relatively low time-cost of executing the internals of this function. I suspect this means the implicit operations of the python interpreter staging the abstract-syntax-treee and executing the function call ultimately end up dominating the action.
So 650%-720% performance increases sound great, until you realize the function call only runs in the order of 10-6 seconds, and those gains will never be noticed by a user. I’ll likely still keep the implementation in place as the trade off of memory usage is not significant. In most cases users are likely to never execute more than 100 different distinct commands and it’s nice to know behind the scenes LRU caching is saving the world one ALU cycle at a time.
I’ll sign off by acknowledging this isn’t the intended use case for this tooling with such a low cost function, and much more significant performance improvements can be achieved on much higher time-cost functions. It’s really worth bench-marking the performance of chunks of your code, since you can often be surprised by the efficiency / inefficiency of different implementations.