기존 diffusion 모델은 비디오 생성에서 긴 동영상을 생성x, subject identity를 시간에 따라 일정하게 유지하지 못함
이를 위해 -> in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control.
참조 이미지 injection - token dimension
참조 이미지 잠재 벡터(VAE latents)를 노이즈 잠재 벡터(VAE 결과)와 연결(concatenate)하여 외형 컨텍스트를 직접 제공
모션 제어 (Motion Control) - channel dimension
포즈 skeleton 및 **손(hand surface normals)**의 잠재 벡터를 노이즈 잠재 벡터와 채널 차원(channel dimension)을 따라 연결. 손 맵은 손과 같은 고주파 영역의 품질 저하를 완화하기 위해 auxiliary control signals로 도입 suffered from degraded quality in hand regions due to their high-frequency textures and rapid movements.

실행 디테일
We use Sapiens (Khirodkar et al. 2024) for pose/hand normal extraction and build upon Wan2.1-I2V-14B (Wan et al. 2025). We finetune linear layers within self-attention, cross-attention, and feed forward modules in Wan2.1 using LoRA (rank=16) (Hu et al. 2022) on a self-collected 33-hour video dataset (7,005 videos), which is orders of magnitude smaller than prior works (Luo et al. 2025; Gan et al. 2025). The base LoRA was trained for 4,000 steps on 8 NVIDIA A100-SXM4-80GB GPUs with a batch size of 8, 720p resolution. The video stitching LoRA was trained to retain the first and last 1/4 frames while keeping other settings unchanged. Background KV-sharing is performed at initial denoising steps and deep layers following the configuration in DiTCtrl