This is an experimental demo aims to combine ImageBind and SAM to generate mask with different modalities.
This basic idea is followed with IEA: Image Editing Anything and CLIP-SAM which generate the referring mask with the following steps:
- Step 1: Generate auto masks with
SamAutomaticMaskGenerator
- Step 2: Crop all the box region from the masks
- Step 3: Compute the similarity with cropped images and different modalities
- Step 4: Merge the highest similarity mask region
- Installation
- ImageBind-SAM Demo
- Audio Referring Segment
- Text Referring Segment
- Image Referring Segment
- Download the pretrained checkpoints
cd playground/ImageBind_SAM
mkdir .checkpoints
cd .checkpoints
# download imagebind weights
wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
- Install ImageBind follow the official installation guidance.
- Install Grounded-SAM follow install Grounded-SAM.
python demo.py
We implement Text Seg
and Audio Seg
in this demo, the generate masks will be saved as text_sam_merged_mask.jpg
and audio_sam_merged_mask.jpg
:
Input Model | Modality | Generate Mask |
---|---|---|
car audio | ||
"A car" | ||
By setting different threshold may influence a lot on the final results.
# download the referring image
cd .assets
wget https://github.com/IDEA-Research/detrex-storage/releases/download/grounded-sam-storage/referring_car_image.jpg
cd ..
python image_referring_seg_demo.py
python audio_referring_seg_demo.py
python text_referring_seg_demo.py