[Feature Matching: RoMa] Robust Feature Matching in case of different viewpoints between query image and candidate image (CVPR 2024)

▷ Studies/F

[Feature Matching: RoMa] Robust Feature Matching in case of different viewpoints between query image and candidate image (CVPR 2024)

스프링클 2024. 9. 18. 20:06

[Goal]

같은 공간이지만 view point가 많이 다른 2개의 이미지에 대한 robust feature matching을 할 수 있다.

[RoMa Github page]

[ https://github.com/Parskatt/RoMa ]

[Dependencies]

python 3.8
pytorch 2.0.1
cuda 11.7
torchvision 0.15.2
einops 0.6.1
kornia 0.7.0
xformers 0.0.21
opencv-python
matplotlib

[Model Configuration]

더보기
[ https://github.com/Parskatt/RoMa/blob/main/roma/models/model_zoo/roma_models.py ]
- Code

import warnings
import torch.nn as nn
from roma.models.matcher import *
from roma.models.transformer import Block, TransformerDecoder, MemEffAttention
from roma.models.encoders import *

def roma_model(resolution, upsample_preds, device = None, weights=None, dinov2_weights=None, **kwargs):
    # roma weights and dinov2 weights are loaded seperately, as dinov2 weights are not parameters
    torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
    torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
    warnings.filterwarnings('ignore', category=UserWarning, message='TypedStorage is deprecated')
    gp_dim = 512
    feat_dim = 512
    decoder_dim = gp_dim + feat_dim
    cls_to_coord_res = 64
    coordinate_decoder = TransformerDecoder(
        nn.Sequential(*[Block(decoder_dim, 8, attn_class=MemEffAttention) for _ in range(5)]), 
        decoder_dim, 
        cls_to_coord_res**2 + 1,
        is_classifier=True,
        amp = True,
        pos_enc = False,)
    dw = True
    hidden_blocks = 8
    kernel_size = 5
    displacement_emb = "linear"
    disable_local_corr_grad = True

    conv_refiner = nn.ModuleDict(
        {
            "16": ConvRefiner(
                2 * 512+128+(2*7+1)**2,
                2 * 512+128+(2*7+1)**2,
                2 + 1,
                kernel_size=kernel_size,
                dw=dw,
                hidden_blocks=hidden_blocks,
                displacement_emb=displacement_emb,
                displacement_emb_dim=128,
                local_corr_radius = 7,
                corr_in_other = True,
                amp = True,
                disable_local_corr_grad = disable_local_corr_grad,
                bn_momentum = 0.01,
            ),
            "8": ConvRefiner(
                2 * 512+64+(2*3+1)**2,
                2 * 512+64+(2*3+1)**2,
                2 + 1,
                kernel_size=kernel_size,
                dw=dw,
                hidden_blocks=hidden_blocks,
                displacement_emb=displacement_emb,
                displacement_emb_dim=64,
                local_corr_radius = 3,
                corr_in_other = True,
                amp = True,
                disable_local_corr_grad = disable_local_corr_grad,
                bn_momentum = 0.01,
            ),
            "4": ConvRefiner(
                2 * 256+32+(2*2+1)**2,
                2 * 256+32+(2*2+1)**2,
                2 + 1,
                kernel_size=kernel_size,
                dw=dw,
                hidden_blocks=hidden_blocks,
                displacement_emb=displacement_emb,
                displacement_emb_dim=32,
                local_corr_radius = 2,
                corr_in_other = True,
                amp = True,
                disable_local_corr_grad = disable_local_corr_grad,
                bn_momentum = 0.01,
            ),
            "2": ConvRefiner(
                2 * 64+16,
                128+16,
                2 + 1,
                kernel_size=kernel_size,
                dw=dw,
                hidden_blocks=hidden_blocks,
                displacement_emb=displacement_emb,
                displacement_emb_dim=16,
                amp = True,
                disable_local_corr_grad = disable_local_corr_grad,
                bn_momentum = 0.01,
            ),
            "1": ConvRefiner(
                2 * 9 + 6,
                24,
                2 + 1,
                kernel_size=kernel_size,
                dw=dw,
                hidden_blocks = hidden_blocks,
                displacement_emb = displacement_emb,
                displacement_emb_dim = 6,
                amp = True,
                disable_local_corr_grad = disable_local_corr_grad,
                bn_momentum = 0.01,
            ),
        }
    )
    kernel_temperature = 0.2
    learn_temperature = False
    no_cov = True
    kernel = CosKernel
    only_attention = False
    basis = "fourier"
    gp16 = GP(
        kernel,
        T=kernel_temperature,
        learn_temperature=learn_temperature,
        only_attention=only_attention,
        gp_dim=gp_dim,
        basis=basis,
        no_cov=no_cov,
    )
    gps = nn.ModuleDict({"16": gp16})
    proj16 = nn.Sequential(nn.Conv2d(1024, 512, 1, 1), nn.BatchNorm2d(512))
    proj8 = nn.Sequential(nn.Conv2d(512, 512, 1, 1), nn.BatchNorm2d(512))
    proj4 = nn.Sequential(nn.Conv2d(256, 256, 1, 1), nn.BatchNorm2d(256))
    proj2 = nn.Sequential(nn.Conv2d(128, 64, 1, 1), nn.BatchNorm2d(64))
    proj1 = nn.Sequential(nn.Conv2d(64, 9, 1, 1), nn.BatchNorm2d(9))
    proj = nn.ModuleDict({
        "16": proj16,
        "8": proj8,
        "4": proj4,
        "2": proj2,
        "1": proj1,
        })
    displacement_dropout_p = 0.0
    gm_warp_dropout_p = 0.0
    decoder = Decoder(coordinate_decoder, 
                      gps, 
                      proj, 
                      conv_refiner, 
                      detach=True, 
                      scales=["16", "8", "4", "2", "1"], 
                      displacement_dropout_p = displacement_dropout_p,
                      gm_warp_dropout_p = gm_warp_dropout_p)

    encoder = CNNandDinov2(
        cnn_kwargs = dict(
            pretrained=False,
            amp = True),
        amp = True,
        use_vgg = True,
        dinov2_weights = dinov2_weights
    )
    h,w = resolution
    symmetric = True
    attenuate_cert = True
    matcher = RegressionMatcher(encoder, decoder, h=h, w=w, upsample_preds=upsample_preds, 
                                symmetric = symmetric, attenuate_cert=attenuate_cert, **kwargs).to(device)
    matcher.load_state_dict(weights)
    return matcher

[Get Pre-trained Model]

자동적으로 URL을 통해서 다운 받게 coding 되어 있지만 수동적으로 해당 모델을 다운 받아서 사용할 수 있다.
- roma/models/model_zoo/__init__.py
  - roma_indoor model: https://github.com/Parskatt/storage/releases/download/roma/roma_indoor.pth
  - roma_outdoor model: https://github.com/Parskatt/storage/releases/download/roma/roma_outdoor.pth
  - DINOv2 model: https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_pretrain.pth

[Basic Code]

demo folder에 작성되어 있는 예시 기반 code

from roma import roma_indoor, roma_outdoor
from PIL import Image
import numpy as np
import PIL
import torch
import pdb
import triton
import cv2

# ====================================================================================================== # 
def draw_keypoints_on_image(image, keypoints, color='blue', radius=2, use_normalized_coordinates=False):
  """Draws keypoints on an image.

  Args:
    image: a PIL.Image object.
    keypoints: a numpy array with shape [num_keypoints, 2].
    color: color to draw the keypoints with. Default is red.
    radius: keypoint radius. Default value is 2.
    use_normalized_coordinates: if True (default), treat keypoint values as
      relative to the image.  Otherwise treat them as absolute.
  """
  draw = PIL.ImageDraw.Draw(image)
  im_width, im_height = image.size
  keypoints_x = [k[1] for k in keypoints]
  keypoints_y = [k[0] for k in keypoints]
  if use_normalized_coordinates:
    keypoints_x = tuple([im_width * x for x in keypoints_x])
    keypoints_y = tuple([im_height * y for y in keypoints_y])
  for keypoint_x, keypoint_y in zip(keypoints_x, keypoints_y):
    draw.ellipse([(keypoint_x - radius, keypoint_y - radius),
                  (keypoint_x + radius, keypoint_y + radius)],
                 outline=color, fill=color)
# ====================================================================================================== #
# ====================================================================================================== # 
def draw_line_on_image(image, kptsA, kptsB, color='red', size=0):
  draw = PIL.ImageDraw.Draw(image)
  length = kptsA.shape[0]
  for i in range(length):
    correspondence = [(kptsA[i][0], kptsA[i][1]), (kptsB[i][0], kptsB[i][1])]
    draw.line(correspondence, fill=color, width=size)
# ====================================================================================================== # 

# ============================================= MAIN =================================================== # 
# ================= Set image path and cuda device ================= #
query_path = "~/src/RoMa/query.png"
cand_path = "~/src/RoMa/cand.png"

device = "cuda"

# ====================== Create RoMa Model ====================== #
# Create Model 
roma_model = roma_indoor(device=device)

# ====================== Get Output Resolution ====================== #
# Output: 560 x 560  
H, W = roma_model.get_output_resolution() 

# ============ Change image size to fit output resolution ============ #
im1 = Image.open(query_path).resize((W, H))
im2 = Image.open(cand_path).resize((W, H))

# ============ Create image to show the correspondence pairs  ============ #
# Create a new output image that concatenates the two images together
output_img = Image.new("RGB", (im1.width + im2.width, im1.height))
output_img.paste(im1, (0, 0))
output_img.paste(im2, (im1.width, 0))

# ====================== Match Two Images ====================== #
# Get warp and convariance (certainty)
# warp size: torch.Size([560, 1120, 4]) & type: <class 'torch.Tensor'>
# certainty size: torch.Size([10000]) & type: <class 'torch.Tensor'>
warp, certainty = roma_model.match(query_path, cand_path, device=device)

# ====================== Match Two Images ====================== #
# Sample matches for estimation
# matches size: torch.Size([10000, 4]) & type: <class 'torch.Tensor'>
matches, certainty = roma_model.sample(warp, certainty)

# ====================== Get Feature Points ====================== #
# Convert to pixel coordinates (RoMa produces matches in [-1,1]x[-1,1])
# kptsA size: torch.Size([10000, 2]) & type: <class 'torch.Tensor'>
# kptsB size: torch.Size([10000, 2]) & type: <class 'torch.Tensor'>
kptsA, kptsB = roma_model.to_pixel_coordinates(matches, H, W, H, W)

# ====================== Get Feature Points ====================== #
# Find a fundamental matrix (or anything else of interest)
F, mask = cv2.findFundamentalMat(
    kptsA.cpu().numpy(), kptsB.cpu().numpy(), ransacReprojThreshold=0.2, method=cv2.USAC_MAGSAC, confidence=0.999999, maxIters=10000
)

# ================= Convert PIL image channel to RGB  ================= #
image1_pil = Image.fromarray(np.uint8(im1)).convert('RGB')
image2_pil = Image.fromarray(np.uint8(im2)).convert('RGB')

# ================= Select inlier points  ================= #
# We select only inlier points
kptsA = kptsA[mask.ravel()==1]
kptsB = kptsB[mask.ravel()==1]

# ================= Draw Feature Points  ================= #
draw_keypoints_on_image(image1_pil, np.array(kptsA.cpu()))
draw_keypoints_on_image(image2_pil, np.array(kptsB.cpu()))

# use numpy to convert the pil_image into a numpy array
numpy_image1 = np.array(image1_pil)  
numpy_image2 = np.array(image2_pil)  

# convert to a openCV2 image and convert from RGB to BGR format
cv_image1 = cv2.cvtColor(numpy_image1, cv2.COLOR_RGB2BGR)
cv_image2 = cv2.cvtColor(numpy_image2, cv2.COLOR_RGB2BGR)

# ================= Move feature points for plotting correspondence pair  ================= #
# Get each image row & column
rows1, cols1 = cv_image1.shape[:2]
rows2, cols2 = cv_image2.shape[:2]
kptsC = np.array(kptsB.cpu()) + [cols1, 0]

# ================= Draw Matching Pairs  ================= #
# Draw matching pair
draw_line_on_image(output_img, np.array(kptsA.cpu()), kptsC)
output_img.show()

Reference Site
- 더보기
  [ https://github.com/datitran/object_detector_app/blob/master/object_detection/utils/visualization_utils.py ]
  [ http://www.gisdeveloper.co.kr/?p=6922 ]
  [ https://www.zinnunkebi.com/python-opencv-pil-convert/ ]
  [ https://www.geeksforgeeks.org/python-pil-imagedraw-draw-line/ ]
  [ https://note.nkmk.me/en/python-pillow-concat-images/ ]

[Experiments]

View point가 다른 challenging 한 이미지 pair 3쌍을 준비
- Test Pair 1

Query Image	Candidate Image

- Test Pair 2 (Challenging !)

Query Image	Candidate Image

- Test Pair 3

Query Image	Candidate Image

[Result]

주어진 desktop 환경은 NVIDIA RTX 3060 이 탑재되어 있고 RoMa를 돌리면 약 5.5GB 정도의 GPU memory 소모 (좀 무거운듯?)
RoMa에서 주어진 pre-trained model을 적용하여 해당 결과 plot (본 데이터셋으로 추가 train 시키지 않음!)
Not Change Output Resolution: 기본 이미지 사이즈로 feature matching 진행하는 경우
- Keypoint Detection

- Inlier Keypoint using cv2.findFundamentalMat

- Feature Matching Result

Draw All inliers

Draw 50 inlier pairs

Draw 100 inlier pairs

Change Output Resolution: RoMa에서 제공한 output resolution으로 이미지 size를 변경 후 matching 진행하는 경우
- Keypoint Detection

- Inlier Keypoint using cv2.findFundamentalMat

- Feature Matching Result

Draw All inliers

Draw 50 inlier pairs

Draw 100 inlier pairs

[Ablation Study]

Not change image size !

[Case 1] ransacReprojThreshold = 0.2
# of inliers: 2978	# of inliers: 512	# of inliers: 697

[Case 2] ransacReprojThreshold = 0.1
# of inliers: 1596	# of inliers: 312	# of inliers: 323

[Case 3] ransacReprojThreshold = 0.05
# of inliers: 855	# of inliers: 158	# of inliers: 175

[Case 4] ransacReprojThreshold = 0.01
# of inliers: 146	# of inliers: 31	# of inliers: 25

[Case 5] ransacReprojThreshold = 0.005
# of inliers: 89	# of inliers: 15	# of inliers: 19

[Case 6] ransacReprojThreshold = 0.001
# of inliers: 18	# of inliers: 0	# of inliers: 0
	2번째 test pair는 matching pair가 없음 !!!	3번째 test pair도 matching pair가 없음 !!!

TEST PAIR 2번째가 매우 challenging한데… 역시 해당 모델도 feature matching이 쉽지 않네요…ㅎㅎㅎ

저작자표시 비영리 변경금지 (새창열림)