I used the PAPI to profile my models and compare tuned v/s untuned model and found that tuned models have higher number of cache misses (almost 3x to 5x) than the untuned model. Could anyone explain the reason for this ?
PS :
- The rest of the parameter like execution time, stalls, instructions show expected output.
- I have tried for multiple models and all of them have higher cache misses when tuned.