Skip to content

Swin transformer handle window size smaller than input size #6222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

YosuaMichael
Copy link
Contributor

This PR introduce logic in swin transformer to update the window_size if it is bigger than the input size.
This behaviour follow the original implementation: https://github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer.py#L192-L195

We introduced this change on a commit in #6088, however since we want to cherrypick that PR and the main goal of #6088 is to adapt the components so it can be reused on the swin transformer 3d, we decide to revert back this change in order to lower the risk during the release.

However since this change can be useful especially if we want to adapt the components to swin transformer v2 (as commented by @xiaohu2015 in #6088 (comment)), we introduce the change back in this PR.

We have tested this change by running the validation script:

python -u ~/script/run_with_submitit.py \
    --timeout 3000 --nodes 1 --ngpus 4 --batch-size=1 \
    --partition train --model <model_variant> \
    --data-path="/datasets01_ontap/imagenet_full_size/061417" \
    --weights="<weight_variant>_Weights.IMAGENET1K_V1" \
    --test-only \

where model_variant can be swin_t, swin_s, swin_b and weight_variant can be Swin_T, Swin_S, Swin_B.
Here are the validation resutl:

  • swin_t, Swin_T : Acc@1 81.472 Acc@5 95.780
  • swin_s, Swin_S: Acc@1 83.186 Acc@5 96.366
  • swin_b, Swin_B: Acc@1 83.584 Acc@5 96.636

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @YosuaMichael. I left a couple of comments, let me know what you think.

Edit: also the failing tests seem related.

@YosuaMichael YosuaMichael marked this pull request as draft July 3, 2022 14:32
@YosuaMichael
Copy link
Contributor Author

Investigating on the failing test, the changes affect the result when the input size is 112 or smaller.
And when I try the resize_size and crop_size on imagenet I got the following resize:

resize_size=256, crop_size=112
- [Original] Acc@1 66.318 Acc@5 85.934
- [This PR] Acc@1 64.236 Acc@5 84.412

resize_size=128, crop_size=112
- [Original] Acc@1 69.248 Acc@5 88.824
- [This PR] Acc@1 67.060 Acc@5 87.218

As we can see, this PR reduce the accuracy significantly.

When looking at the reason, it because the relative_position_index depend on the window_size. And I think this is the reason the original implementation put the window_size update on the constructor with a cost of having the input_resolution specified on the constructor as well (which mean the model can only accept a fixed input).

Looking at this, I think this change need more discussion and if we really want to implement like the original implementation its better to have a "better" reason like to implement Swin Transformer V2. Therefore for now I will close this PR.

@xiaohu2015
Copy link
Contributor

xiaohu2015 commented Jul 9, 2022

@YosuaMichael I think we can do some modification on the model:
the window_size can be List[int] or List[List[int]], if we use List[int], then the window size of all block should be the same (just when the input size is 224 the
window size 7), if we use List[List[int]], we can choose use different window size for each block ([[7, 7], [7, 7], [7, 7], [6, 6]] for input size 192, [[12, 12], [12, 12], [12, 12], [6, 6]] for 384. I mean that the user should make a choose for what they want. The original code require input_resolution, here we require window_size, I think the two ways can have same effect.

cc @datumbox

@xiaohu2015 xiaohu2015 mentioned this pull request Jul 9, 2022
@YosuaMichael
Copy link
Contributor Author

Hi @xiaohu2015 , sorry for the late reply! I am trying to understand your suggestion on having window_size specified for each layer. In this case, is it correct to say that the user need to retrain the model in order to change the window_size?
For instance if we train our model in input size 224, then if the user want to have input of size 384 then he need to change window_size and hence the pretrained weight cannot load and we need to have separate weight for input size 384?

@xiaohu2015
Copy link
Contributor

@YosuaMichael yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants