Training throughput: model splitting vs threading

Does anyone have an intuition for what would make model training fastest: splitting the model across cores to increase cache locality or parallelizing outer loops of individual operators?