[SPARK-56067][PYTHON] Lazy import psutil to improve import speed#54897
[SPARK-56067][PYTHON] Lazy import psutil to improve import speed#54897gaogaotiantian wants to merge 1 commit intoapache:masterfrom
Conversation
| def get_used_memory(): | ||
| """Return the used memory in MiB""" | ||
| try: | ||
| import psutil |
There was a problem hiding this comment.
I believe this will result in the try...except block getting executed multiple times (the import runs each time the function is called).
We could use a strategy that lazily imports, but caches the import in a global variable instead of having Python's import system cache it.
_psutil = None
_psutil_checked = False
def get_used_memory():
global _psutil, _psutil_checked
if not _psutil_checked:
try:
import psutil
_psutil = psutil
except ImportError:
pass
_psutil_checked = True
if _psutil is not None:
# psutil path
else:
# fallback pathThere was a problem hiding this comment.
The import cache check is basically a dict check in sys.modules, which is super fast compared to the rest of the function. I don't think it's worth it to make the code more complicated. Getting memory usage is expensive (involves IO normally) so optimizing the rest of tis function does not give us much. I prefer to keep code simple unless we have proof that the performance difference is observable.
What changes were proposed in this pull request?
Instead of
import psutilat module-level, we do it in the function to avoid always triggering the import when weimport pyspark.Why are the changes needed?
psutilitself is relatively fast to import, but it's also unnecessary. More importantly, it's the only outlier afternumpyand memory_profiler. After getting rid of it, we can write a test to check if we import 3rd party library when we doimport pyspark` to prevent similar issues in the future.Does this PR introduce any user-facing change?
No.
How was this patch tested?
test_shufflepassed locally, the rest is on CI.Was this patch authored or co-authored using generative AI tooling?
No.