The main difference between using RL for control vs parameter tuning is that in the first case the policy will directly output, e.g. torque, or whatever is your control input. In the latter case, the output of the policy would be parameter values, e.g., if you are trying to tune a PID, the policy would output 3 numbers, Kp, Ki and Kd. Obviously the observations/inputs to the policy as well as the reward would probably need to be different too.
To your question on how the latter could run in parallel with the controller, I can see two scenarios:
1) Using RL for finding static gains. In this case you train, you get the constant parameter values the RL policy finds, and then you discard the policy and adjust your controller gains with these numbers
2) Using RL for finding dynamic/observation-based parameters. This would be in some sense similar to gain scheduling and for this case you would run the policy in parallel with the controller. The idea would be the same(i.e. the policy would output parameter values) but it would do so all the time, thus updating the controller parameters dynamically based on observations.
Hope that helps.