It is quite possible that after each timestep, your program is incurring cudaMemcpy's to copy memory from the GPU to CPU every timestep and then writing the results to a file. cudaMemcpy requires the GPU to synchronize all threads and is thus a very expensive operation.
If it's possible, you could just keep all the data in GPU memory in a separate variable after each step, and then after the N-steps, write all the data to your file at once. The downside is you will use more GPU memory saving , but if you are not writing large amounts of data per timestep, it should be fine.
Let me know if this works. If not, feel free to share your code and I'll take a further look at it.