If Spark data frames are processed 'in memory' then isn't there a limit on the size of the frame relative to the cluster size and available RAM? Fine if you have a cluster of many machines but in most scenarios the cluster wont be that big, Is there some spilling to disk to handle this?