FP16 Mixed Precision Reduction with Result cast

I have been attempting to write a single Cuda kernel reduction that takes FP16 inputs cast the inputs to FP32 for the reduction and then saves the results back into FP16. Is this possible currently in a single kernel with Tensor Expressions? The error I see is that when I try to add a cast to the result I get an error suggesting that the last operation has to be the reduction. I have also tried adding another compute line to the schedule but I get the same error.

TVMError: Check failed: 0 == level_: Reductions are only allowed at the top level of compute. Please create another tensor for further composition.

Example:

import tvm

tgt_host="llvm"
tgt="cuda"

toks         = tvm.var("tokens")
hidden       = tvm.const(1024)
inputs       = tvm.placeholder((toks, hidden), name='inputs', dtype='float16')
y            = tvm.reduce_axis((0, hidden), "y")
outputs      = tvm.compute((toks,), lambda x : tvm.sum(inputs[x][y].astype('float32'), axis=y).astype('float16'),  name='outputs')
sched        = tvm.create_schedule([outputs.op])

sched[outputs].bind(outputs.op.axis[0], tvm.thread_axis("blockIdx.x"))
sched[outputs].bind(outputs.op.reduce_axis[0], tvm.thread_axis("threadIdx.x"))