Method Overview
Pipeline of our proposed HiFiVFS, including training and inference phases. HiFiVFS is primarily trained based on the SVD framework, utilizing multi-frame input and a temporal attention to ensure the stability of the generated videos. In the training phase, HiFiVFS introduces fine-grained attribute learning (FAL) and detailed identity learning (DIL). In FAL, attribute disentanglement and enhancement are achieved through identity desensitization and adversarial learning. DIL uses more face swapping suited ID features to further boost identity similarity. In the inference phase, FAL only retains Eatt for attribute extraction, making the testing process more convenient. It is noted that HiFiVFS is trained and tested in the latent space, but for visualization purposes, we illustrate all processes in the original image space.