Article: Two Misconfigurations That Caused Spark OOM Failures on Kubernetes
InfoQ AI/ML
•
Generative AI
After migrating Spark pipelines to Azure Kubernetes Service, two infrastructure settings interacted destructively: spark.kubernetes.local.dirs.tmpfs=true backed shuffle spill with RAM instead of disk, and a hard podAffinity rule forced all executors onto one node. Together, they caused repeated OOM kills invisible to standard diagnostics. By Prana Bhasker