Generalized on-policy distillation with reward extrapolation

3 points | by fzliu 2 days ago

No comments yet.