Hi all,
When I schedule double_buffer to my local output buffer, it seems that all double_buffer scheduler does not work any more.
for (int i1 = 0; i1 < 8; ++i1) {
A_local[i1] = A[i1];
}
for (int i11 = 0; i11 < 8; ++i11) {
B_local[i11] = B[i11];
}
for (int i12 = 0; i12 < 8; ++i12) {
C_local[i12] =(int)B_local[i12] + (int)A_local[i12];
}
for (int i1_outer_outer = 0; i1_outer_outer < 7; ++i1_outer_outer) {
for (int i13 = 0; i13 < 8; ++i13) {
A_local[((((i1_outer_outer + 1) % 2) * 8) + i13)] = A[(((i1_outer_outer * 8) + i13) + 8)];
}
for (int i14 = 0; i14 < 8; ++i14) {
B_local[((((i1_outer_outer + 1) % 2) * 8) + i14)] = B[(((i1_outer_outer * 8) + i14) + 8)];
}
for (int i15 = 0; i15 < 8; ++i15) {
C_local[((((i1_outer_outer + 1) % 2) * 8) + i15)] = (int)B_local[((((i1_outer_outer + 1) % 2) * 8) + i15)] +((int)A_local[((((i1_outer_outer + 1) % 2) * 8) + i15)];
}
for (int i1_inner = 0; i1_inner < 8; ++i1_inner) {
C[((i1_outer_outer * 8) + i1_inner)] = C_local[(((i1_outer_outer % 2) * 8) + i1_inner)];
}
}
for (int i1_inner1 = 0; i1_inner1 < 8; ++i1_inner1) {
C[(i1_inner1 + 56)] = C_local[(i1_inner1 + 8)];
}
Could you guys give me some advices to implement double_buffer scheduler? Or I have to add another pass to schedule double buffer for output data?Preformatted text
zfhn
April 2, 2019, 8:27am
2
Double Buffer Pass maybe only used on GPU for input data pre-fetching. Please reference this link:
master
← tqchen:master
opened 12:57AM - 01 Sep 17 UTC
This enables double buffering pre-fetching. Could be useful shared memory pre-fe… tching. One advantage of double buffering is that the logic explicit prefetchs next stage's input to the shared memory buffer.
Source
```c++
for (i, 0, 100) {
allocate B[float32 * 4]
for (i, 0, 4) {
B[i] = A[((i*4) + i)]
}
for (i, 0, 4) {
A[i] = (B[i] + 1.000000f)
}
}
```
Target
```c++
allocate B[float32 * 2 * 4]
for (i, 0, 4) {
B[i] = A[i]
}
for (i, 0, 99) {
// prefetch next iteration
for (i, 0, 4) {
B[((((i + 1) % 2)*4) + i)] = A[(((i*4) + i) + 4)]
}
for (i, 0, 4) {
A[i] = (B[(((i % 2)*4) + i)] + 1.000000f)
}
}
for (i, 0, 4) {
A[i] = (B[(i + 4)] + 1.000000f)
}
```
## Note
Usually when GPU fetches memory, there is a big latency before the data arrives. There are two ways to hide this cost:
- Context switch to another GPU thread on the same block, this requires us to launch many GPU threads, limiting the resources(registers) used on each block
- Do double buffering, to prefetch the data needed in next iteration.
There is a tradeoff here. Bigger tiles means more resources(registers) and more reuse, but harder to hide loading cost (because we launch less threads). Smaller tiles means more threads and easier to hide loading cost but less reuse.
Enable double buffering allows us to get bigger tiles and more reuse with less requirement on the context switch.
So directly enable it may not speedup things(because the old schedule is tuned to contain enough thread to hide the latency). We might need to enable it and also increase tile size to get a schedule with more reuse and also hide loading cost