Does anyone have an intuition for what would make model training fastest: splitting the model across cores to increase cache locality or parallelizing outer loops of individual operators?
Does anyone have an intuition for what would make model training fastest: splitting the model across cores to increase cache locality or parallelizing outer loops of individual operators?