Visual understanding environment align

5/17/2023

To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation.

To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Extensive experiments demonstrate that our framework facilitates the learning of general pedestrian representations and thus leads to promising results on various pedestrian analysis tasks. We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations, and then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search. The self-supervised contrastive learning facilitates the learning of the intrinsic pedestrian properties, while the image-text contrastive learning guides the model to focus on the appearance information of pedestrians.Meanwhile, multi-attribute classification encourages the model to recognize attributes to excavate fine-grained pedestrian information. To train our framework, we introduce three learning objectives, \emph self-supervised contrastive learning, image-text contrastive learning and multi-attribute classification. In this paper, we propose VAL-PAT, a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information. However, those pre-trained methods are specifically designed for reID and suffer flexible adaption to other pedestrian analysis tasks.

Recent researches on unsupervised person re-identification~(reID) have demonstrated that pre-training on unlabeled person images achieves superior performance on downstream reID tasks than pre-training on ImageNet. The experimental results of 1097 IVUS images of 12 patients resulted in meanIoU (0.7894 ± 0.011), Dice Coefficient (0.8763 ± 0.070), precision (0.8768 ± 0.069) and recall (0.8774 ± 0.071) of the proposed model CADNet which show the model's effectiveness relative to other state-of-the art methods. The main contribution of the work is that IVUS images of varying degrees of calcification till 360° are also considered in this work, which is usually neglected in previous studies. Also, the model efficiently detects the calcification area even in case of severely complex lesions with shadows behind them. The novelty of the model design is such that it is able to detect the lumen area in the presence/absence of calcification and bifurcation artifacts too. The IVUS data of 12 patients undergoing the treatment is taken for this study. The attention modules prove effective in dealing with areas of special attention by assigning additional weights to crucial channels and preserving spatial features. Considering, the need, to provide special attention to crucial areas, convolutional block attention modules (CBAM) is integrated into an encoder-decoder-based U-Net architecture along with Atrous Spatial Pyramid Pooling (ASPP) to detect vessel components: lumen, calcification and shadow borders. Ultrasound imaging suffers from the generation of artifacts which obstructs the clear delineation among structures.

Manual detection of lumen and calcification is very time-consuming and requires technical experience. The detection of vascular structures is extremely important for accurate treatment procedures. Intravascular Ultrasound (IVUS) is a medical imaging modality widely used for the detection and treatment of coronary heart disease.

0 Comments

Visual understanding environment align

Leave a Reply.

Author

Archives

Categories