[Codegen][CUDA] Bug in global barrier

vinx13 · January 16, 2019, 6:16am

The global barrier on cuda is implemented using shared memory and atomic add.

dmlc/tvm/blob/a527b58e6d856c4ac323d1397b170d670157760d/src/codegen/codegen_cuda.cc#L229


  this->PrintIndent();
  this->stream <<"if (" << is_load << ") {\n";
  int wb = this->BeginScope();
  this->PrintIndent();
  this->stream << "atomicAdd(&" << vid_global_barrier_state_ << ", 1);\n";
  this->PrintIndent();
  std::string ptr = GetUniqueName("pf");
  this->stream << "volatile unsigned* "
               << ptr << " = &" << vid_global_barrier_state_<< ";\n";
  this->PrintIndent();
  this->stream << vid_global_barrier_expect_ << " += " << num_blocks << ";\n";
  this->PrintIndent();
  this->stream <<"while (" << ptr << "[0] < " << vid_global_barrier_expect_ << ");\n";
  this->EndScope(wb);
  this->PrintIndent();
  this->stream <<"}\n";
  this->PrintIndent();
  this->stream <<"__syncthreads();\n";
}
}

In this line each thread adds num_blocks to vid_global_barrier_expect_, which is a variable on shared memory. As result, vid_global_barrier_expect_ == num_blocks * num_threads_per_block after all threads finish this line. Since this is a variable on shared memory, I think this should use atomicAdd or atomicAdd_block?

@tqchen

tqchen · January 18, 2019, 4:53pm

Sorry I was a bit busy recently , would be great if you can also tag a few others who are familiar with cuda

eqy · January 18, 2019, 8:53pm

Can you give an example test case that generates code through this path? I am not familiar with this as most kernels I have seen do not use this synchronization mechanism.

vinx13 · January 19, 2019, 3:39pm

This is a test case

import tvm
def ir(in_buf, out_buf):
   ib = tvm.ir_builder.create()
   p_in = ib.buffer_ptr(in_buf)
   p_out = ib.buffer_ptr(out_buf)
   nthreads = 256
   nblocks = in_buf.shape[0] // nthreads
   bx = tvm.thread_axis("blockIdx.x")
   tx = tvm.thread_axis("threadIdx.x")
   ib.scope_attr(tx, "thread_extent", nthreads)
   ib.scope_attr(bx, "thread_extent", nblocks)
   ib.emit(tvm.make.Call(None, 'tvm_global_barrier_kinit', None, tvm.expr.Call.Intrinsic, None, 0))
   i = bx * nthreads + tx
   p_out[i] = p_in[i]
   ib.emit(tvm.make.Call(None, 'tvm_storage_sync',
                                  tvm.convert(['global', True, nblocks]),
                                  tvm.expr.Call.Intrinsic, None, 0))
   return ib.get()

a = tvm.placeholder((1024,))
b = tvm.extern([a.shape], [a], lambda ins, outs: ir(ins[0], outs[0]))
s = tvm.create_schedule(b.op)
print(tvm.build(s, [a, b], 'cuda').imported_modules[0].get_source())

which produces

extern "C" __device__ unsigned __tvm_global_barrier_state;
extern "C" __global__ void default_function_kernel0( float* __restrict__ extern1,  float* __restrict__ placeholder) {
  __shared__ unsigned __barrier_expect;
  if (threadIdx.x == 0) {
    __barrier_expect = 0;
  }
  extern1[(((int)threadIdx.x) + (((int)blockIdx.x) * 256))] = placeholder[(((int)threadIdx.x) + (((int)blockIdx.x) * 256))];
  __threadfence_system();
  if ((bool)1) {
    atomicAdd(&__tvm_global_barrier_state, 1);
    volatile unsigned* pf = &__tvm_global_barrier_state;
    __barrier_expect += 4;
    while (pf[0] < __barrier_expect);
  }
  __syncthreads();
}

masahi · January 21, 2019, 3:18am

Does this issue come up in your proposal op PR?

vinx13 · January 21, 2019, 3:20am

Yes, I found some data race in argsort

masahi · January 21, 2019, 5:22am

@vinx you are right, without atomic this doesn’t make a global barrier

eqy · January 22, 2019, 1:20am

As an aside, I am confused about the meaning of global in the global barrier here. Is it supposed to be a true global barrier, as in all threads in all blocks?

vinx13 · January 22, 2019, 2:07am

If the second argument of tvm_storage_sync is True, all threads will be synchronized

eqy · January 22, 2019, 2:38am

If I understand correctly, the implementation strategy is just to wait on a counter until all threads are accounted for. I thought that CUDA provides no guarantees that this style of implementation does not deadlock, as nothing prevents the execution of blocks from being serialized. But my knowledge of CUDA could just be very stale, so if someone has updated info that would be great.

vinx13 · January 23, 2019, 8:50am

You are right, I indeed observed deadlock when I repeat many times

masahi · January 23, 2019, 9:32am

but then the original code is not doing what it is supposed to do (global barrier hack), isn’t it? Should we remove it?

vinx13 · January 23, 2019, 9:37am

The original one is wrong as well. Global barrier is only used in https://github.com/dmlc/tvm/blob/master/topi/recipe/rnn/lstm.py, I’m not sure whether it works