I am running CNN models on CUDA (without cuDNN) on an AGX Xavier, and I am getting a strange error that occurs with dense layers.
I am unable to run ResNet18-CIFAR10 without getting a segmentation fault.
I’ve managed to identify that the issue appears to be related to how the dense layer is interacting with some convolutional layers?
If we look at the PyTorch definition of the model, removing most of the layers, this is the minimal network I can get a crash in:
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = F.relu(self.layer1(out))
out = out.view(out.size(0), -1)
out = self.linear(out)
return out
If I remove out = self.layer1(out), it works. If I remove
out = self.linear(out)` it works.
I’ve got the both the model and the TVM inference in this single script.
I am using CUDA without cuDNN. It does not seem to be related to things like bias or kernel size or padding.
The two conv2d layers seem pretty normal looking:
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.layer1 = nn.Conv2d(
self.in_planes, 64, kernel_size=3, stride=1, padding=1, bias=False
)
Does anyone have any suggestions as to how to identify the root cause here? Again, I have the code here.
This is the case on several versions of TVM, including the latest one (4087e72b657eae484bb647cbd8ef86b9acf11748).