Improve backward compatibility of tvm

samwyi · March 2, 2023, 9:00pm

I noticed that the auto-scheduler tuning records we tuned with older version of TVM no longer work on recent TVM. After debuging the issue, I found out that the tasks’ hash keys have changed because of the changes in the tasks’ Relay representations, which are used to generate the hash keys. Before, I also noticed some models compiled on old TVM not working with new TVM.

I understand that TVM is evolving quickly. However, better backward compatibility can reduce the engineering effort for TVM based products, and can help the adoption of TVM in real products.

Could we take some steps to improve the backward compatibility of tvm? E.g. creating a guildeline to check the backward compatibility of IR representations, APIs etc.; adding backward compatibility to code review guidelines; asking breaking changes to be written to commit message and announced to the community…Just some initial thoughts. Hope to hear more discussions.

areusch · March 2, 2023, 11:49pm

hi @samwyi , my understanding is that while TVM remains as a 0.x release, we aren’t requiring patches to consider backwards compatibility, although we do generally make an effort to avoid painful compatibility problems. the auto-scheduler folks could comment more on this specific issue.

Hzfengsy · March 3, 2023, 5:34am

Hi @samwyi! Thanks for your feedback and thank you @areusch for the clarifications.

Backward compatibility is no doubt a critical metric for all projects, no matter if the project is evolving quickly. We are trying our best to improve backward compatibility and I’m sorry to hear that it troubles you a lot.

Yes, we also found it recently. It is a fundamental design issue of TVM that the hashing method is mismatched on different targets and different versions. We are going to resolve this bug if possible.

It’s a bit wired, since we have thousands of unittests to ensure the backward compatibility. Could you please show your model that failed?

Thanks for the suggestion, all of them make sense to me. But I’d like to clarify a bit about it:

Check the backward compatibility of IR representations:

IR is one of the fundamental data structures of TVM. We are trying to make it as stable as possible. However, some new features may require the changes of IR, we would send the RFCs. The APIs are the same.
adding backward compatibility to code review guidelines

It’s one of the most important review metrics for the community. Also, backward compatibility is guaranteed by the test cases
asking breaking changes to be written to commit message and announced to the community

All breaking changes are asked to have an RFC, which is an official announcement to the community

samwyi · March 13, 2023, 6:20pm

I see. Thanks for the clarification. Do you know if TVM 1.0 is on the roadmap, e.g. when will 1.0 be released and its specification maybe?

samwyi · March 13, 2023, 6:40pm

Thanks for the detailed explanation, @Hzfengsy.

As the for the failed model that worked on old TVM but failed on new TVM, I can’t show you the model as it’s AMD’s proprietary model, but I can explain the problem in more detail:

It is a tflite model. We tuned this model on Windows+Vulkan target, generated tuning records, and compiled into dll for Windows using an older verson of TVM in Oct 2022. In Jan 2023, we updated tvm and tried to run the the same dll on the new tvm. It met the following error:

Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-13: Unknown Vulkan error code

This happened when we the module.run() method was called. Any idea on what could be the cause or how to debug it? Thanks.