赞
踩
Automatic medical report generation (MRG) is of great research value as it has the potential to relieve radiologists from
the heavy burden of report writing. Despite recent advancements, accurate MRG remains challenging due to the need
for precise clinical understanding and disease identifcation. Moreover, the imbalanced distribution of diseases makes the
challenge even more pronounced, as rare diseases are underrepresented in training data, making their diagnosis unreliable.
To address these challenges, we propose diagnosisdriven prompts for medical report generation (PromptMRG), a novel framework that aims to improve the diagnostic accuracy of MRG with the guidance of diagnosis-aware prompts. Specifcally, PromptMRG is based on encoder-decoder architecture with an extra disease classifcation branch. When generating reports, the diagnostic results from the classifcation branch are converted into token prompts to explicitly guide the generation process. To further improve the diagnostic accuracy, we design cross-modal feature enhancement, which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging the knowledge from a pre-trained CLIP. Moreover, the disease imbalanced issue is addressed by applying an adaptive logit-adjusted loss to the classifcation branch based on the individual learning status of each disease, which overcomes the barrier of text decoder’s inability to manipulate disease distributions.
Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains state-of-the-art clinical effcacy performance on both datasets.
Automated analysis of medical images involves wide range of tasks, such as anomaly detection (Cai et al. 2022), disease
classifcation (Luo et al. 2022, 2020), lesion detection (Luo et al. 2021), landmark detection (Jin, Che, and Chen 2023), etc. Among them, medical report generation (MRG) is a task to generate a free-text description of a medical image, where it provides a comprehensive summary of the image’s content. Due to its potential in relieving the heavy workload of radiologists, many works haven been proposed for MRG in recent years.
这里面第一个prediction1,它的bleu,rouge,等参数一定是好的,因为和原句字很像,但是内容是完全错误的。而prediction2,虽然语言风格与金标准不同,但是内容却是对的。
However, it is challenging to generate an accurate medical report as it demands a comprehensive understanding of the given image, especially the ability to identify clinical fndings. For example, Figure 1(a) shows two sample predictions of a chest X-ray alongside the ground-truth (GT). While the wording of the frst prediction is highly similar to the GT, its diagnosis regarding opacity and pneumonia is incorrect. In contrast, the second prediction is preferred as it successfully identifes opacity and pneumonia, albeit the different wording. Therefore, an ideal MRG system should be able to identify abnormalities accurately, then convert the fndings into texts with both linguistic precision and clinical relevance.
To obtain a MRG system with satisfactory performance, various methods have been proposed. For example, knowledge graph is an effective technique to enhance feature learning and diagnostic ability by injecting domain knowledge into the model (Zhang et al. 2020; Liu et al. 2021a);
Knowledge:(在医学知识下的)
Zhang, Y.; Wang, X.; Xu, Z.; Yu, Q.; Yuille, A.; and Xu, D. 2020. When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artifcial Intelligence, volume 34, 12910–12917.
Liu, F.; Wu, X.; Ge, S.; Fan, W.; and Zou, Y. 2021a. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13753–13762.
Multi-task:使用多种任务,训练,优化强化的
multi-task learning has also been widely used for obtaining better feature representations, where extra auxiliary tasks are
simultaneously conducted (Jing, Xie, and Xing 2018; Wang et al. 2022; Yan and Pei 2022).
Jing, B.; Xie, P.; and Xing, E. 2018. On the Automatic Generation of Medical Imaging Reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2577–2586.
Wang, Z.; Liu, L.; Wang, L.; and Zhou, L. 2023. METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11558–11567.
multi-task learning has also been widely used for obtaining better feature representations, where extra auxiliary tasks are simultaneously conducted (Jing, Xie, and Xing 2018; Wang et al. 2022; Yan and Pei 2022). Despite the success, state-ofthe-art (SOTA) methods still lack the ability in generating diagnostically correct reports. As evidenced by our observation shown in Figure 1(b), a vanilla disease classifcation model outperforms most SOTA MRG methods signifcantly in terms of the F1 score of clinical effcacy (CE). In MRG, CE serves as a metric for assessing the diagnostic accuracy of generated reports. Thus, the figure indicates the existing MRG methods have not fully leveraged the diagnostic information in medical images, which is an obstacle to the application of MRG. Additionally, the biased distribution of diseases leads to imbalanced CE performance (see Figure 1©).
Yet, this issue has not been addressed in prior works, which further reduces the clinical value of current MRG models as their diagnosis on rare diseases are unreliable.
由于报告生成存在上面提到的胡言乱语问题,语言风格很像,但是内容却是不对的。虽然报告能力不行,但是模型在分类任务上表现突出。作者的思路就是使用成熟的分类任务,它得到的可靠性较高,用分类得出的结果作为特殊的标记,输入到语言模型,帮助模型更好的生成报告。
Inspired by the above observations, we propose PromptMRG, a MRG framework with diagnosis-driven prompts (DDP), aiming to improve the CE performance of MRG with the guidance of diagnostic results. Specifcally, based on the encoder-decoder architecture, PromptMRG is also equipped with a disease classifcation branch. When generating reports, the diagnostic results from the classifcation branch are converted into token prompts to explicitly guide the generation process.
To further improve the diagnostic accuracy, we design cross-modal feature enhancement (CFE), which retrieves similar reports from the database to assist the diagnosis of a query image by leveraging a pre-trained CLIP model.
Moreover, the disease imbalanced issue is also explicitly addressed via self-adaptive disease-balanced learning (SDL), which
adaptively adjusts the optimization objectives of different diseases based on their learning status.
Experiments on two MRG benchmarks show the effectiveness of the proposed method, where it obtains SOTA CE performance on both datasets. We summarize contributions as follows.
Medical Report Generation Most MRG models adopted the encoder-decoder architecture from image captioning (Xu et al. 2015; Lu et al. 2017;Ji et al. 2021) due to the similarity of the two tasks. However, MRG is more challenging than captioning as reports
are much longer than captions while the clinical abnormalities are more diffcult to identify than natural objects. Therefore, various methods have been proposed to tackle the above challenges. Chen et al. (2020) and Yang et al. (2023) proposed extra memory modules to record past similar patterns for providing informative content during the decoding process, such that the generation performance could be improved. The proposed CFE in this paper also retrieves similar records as extra information, but differently, it utilizes these information to enhance the disease classifcation branch rather than the generation process.
Knowledge graph has been widely used to incorporate domain knowledge to assist report generation. For example, Zhang et al. (2020) and Liu et al. (2021a) proposed to combine a pre-constructed graph to denote the relationship between diseases and organs via graph neural networks, which allows for dedicated feature learning of the abnormalities. Later, Li et al. (2023) developed a method to dynamically update the graph by injecting new knowledge on-thefy. Huang, Zhang, and Zhang (2023) designed an injected knowledge distiller to fuse the knowledge from a symptom graph into the fnal decoding stage, which shares a similar spirit to our DDP. Nevertheless, DDP explicitly tackles the CE issue via a different guidance mechanism (i.e., prompts),
and shows much stronger performance in CE.
Multi-task learning is another common technique to facilitate the representation learning of MRG. Among the auxiliary tasks, disease classifcation is the most popular one as it helps model to learn discriminative features (Jing, Xie, and Xing 2018; Wang et al. 2022; Yan and Pei 2022). Similarly, weakly supervised contrastive learning was introduced by Yan et al. (2021) as an auxiliary task to learn a semantically meaningful space. Additionally, image-text matching was explored (Wang et al. 2022, 2021; Yan and Pei 2022) to learn an aligned image-text representations in a fne-grained manner. Despite the usage of disease classifcation in this work, we highlight the key difference as follows. Previous methods often treat classifcation as a parallel task and expect it to beneft report generation in an implicit way through learning discriminative features. In contrast, we make use of the diagnostic results from the classifcation via prompts to explicitly guide the generation process. RGRG (Tanida et al. 2023) is the most related work to ours, which leverages object detector as a region guidance for sentence-wise generation. However, their decoder only attends to the regional visual features as most previous works do while ours attends to both visual features and prompts, where the prompts enable the decoder to explicitly leverage the diagnostic information for generating clinically correct reports.
Prompt as Guidance
Prompting is originally a technique from natural language processing for improving the generalization of language models (Liu et al. 2023). Instead of training various tasks in supervised learning individually, prompting enables language models to unify and adapt to a wide range of tasks by modifying inputs into textual templates. Later, some works (Li and Liang 2021; Lester, Al-Rfou, and Constant 2021; Liu et al. 2021b) adopted this technique for effcient fine-tuning, where prompts act as trainable task-specifc vectors. Due to the effectiveness and simplicity, prompt tuning was further introduced to vision (Jia et al. 2022) and vision-language models (Radford et al. 2021; Zhou et al. 2022; Tsimpoukelli et al. 2021; Alayrac et al. 2022). More recently, there are works treating prompts as a guidance for improving the performance of specifc tasks. For example, Qin et al. (2023) developed an automatic generation method of medical prompts to improve the knowledge transferability of pre-trained vision-language models to medical object detection. Ge et al. (2022) proposed to embed domain information into prompts for unsupervised domain adaptation. In this paper, we convert diagnostic results into prompts to guide report generation. To the best of our knowledge, this is the first work that introduces prompts to the task of MRG.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。