SAM2: Model Structure

The Segment Anything Model(SAM) model structure has emerged as a fascinating and highly effective approach to large model in computer vision. Out of my expectation, for large models on computer vision, it starts with segmentation!

Here I am sharing the reading notes to demystify the SAM model structure, exploring its components from image encoder, prompt encoder and mask decoder. Let’s go!

SAM

  • Link: https://ai.meta.com/research/publications/segment-anything/
  • Github: https://github.com/facebookresearch/segment-anything

Model Structure

Input

Step1 Encoder Image

It provides the function predictor.set_image to encode images

Step2 Predict

It sets the function predictor.predict to give the final ious and masks, with input from prompt, labels and encoded image features

Step2-1 Preprocess Prompt

Step 2-2 Torchly Predict

After prompt preprocess, in predictor.predict_torch, inputs would be passed into the rest of model structure

Step2-2-1

It sets model.prompt_encoder to encode prompts

Step 2-2-2 Mask Decoder

model.mask_decoder is called to decode the result from encoded image and prompts

Step 2-2-3 Result Postprocess

That’s all. Thanks for reading it!