When using data parallel (DP, using fully_shard here) and tensor parallel (TP), if TP is applied to only a subset of layers such that some have only DP applied, the DP only parameters for the same DP ...
I am trying to implement a handoff where a subgraph transfers the user to another node in the parent graph (a sibling node in this case) using a Command, but I keep ...