Training throughput: model splitting vs threading

nhynes · September 30, 2018, 5:30am

Does anyone have an intuition for what would make model training fastest: splitting the model across cores to increase cache locality or parallelizing outer loops of individual operators?