Edge Impulse logo
Blog Post

FOMO Self-Attention

Computer Vision, Artificial Intelligence, Machine LearningOne approach to incorporate spatial information into FOMO is to include a minimal transformer block between the feature extractor and the classifier.

Mat Kelcey

May 31, 2023

FOMO is a deep learning architecture for running object detection on constrained devices, developed by Edge Impulse.

It combines an image feature extractor ( by default, a truncated section of MobileNet-V2) with a fully convolutional classifier to frame object detection as object classification over a grid, run in parallel.

Read more about FOMO in the Edge Impulse documentation: https://docs.edgeimpulse.com/docs/edge-impulse-studio/learning-blocks/object-detection/fomo-object-detection-for-constrained-devices 

Read the Localized object detection sectionLocalized object detection

Consider an example of detecting different colored cubes on a conveyor belt: https://studio.edgeimpulse.com/public/230968/latest

FOMO does very well at this class of problem; it's a perfect fit where object detection can be framed as object classification in a grid since all that is required for each classification is localized to that section of the input. 

Fun fact: This is the project FOMO was originally developed on!

Performance on validation data during training
Prediction example from held-out test data

Read the Spatial aware object detection sectionSpatial aware object detection

Now consider the same set of input images, but we instead focus on just the green cube and try to predict, not based on color, but based on whether the cube is to the left of other cubes, or to the right.

Example of left vs right labeling of the green cube

This is not something FOMO can do well. While it is able to identify background vs the green cube in general, it doesn't know whether the cube is the left or right of others so instead just makes a decision to label them all as left.

Example of labeling everything as left

To make this work we need to add some spatial understanding to FOMO, but before we do let's review how FOMO is structured.

Read the Architecture review sectionArchitecture review

By default, the FOMO architecture consists of two fundamental pieces

  • a MobileNet-based feature extractor, though any feature extractor can be used, and,
  • a fully convolutional classifier.

For the sake of a concrete example let's consider a specific configuration of 

  • an image input of size (160, 160, 3)
  • a MobileNet-V2 (truncated) backbone with alpha=0.35
  • the default configuration for the FOMO fully convolutional classifier ( i.e. a dense connection with 32 units )
  • and a binary classifier output

In this scenario, we have the following intermediate shapes and learnable parameters.

 layer                     tensor shape         #parameters  comment
 input                     (None, 160, 160, 3)  0            input
 MobileNet-V2 backbone     (None, 20, 20, 96)   15,840       image feature extractor
 dense (Conv2D)            (None, 20, 20, 32)   3,104        fully convolutional classifier
 logits (Conv2D)           (None, 20, 20, 2)    66           fully convolutional classifier     
Performance metrics for standard FOMO model

Read the Integrating spatial information with a micro transformer sectionIntegrating spatial information with a micro transformer

One effect of acting fully convolutionally on the extracted feature map is that there is no integration of information over the entire input; a detection in an area on "one side" of the input is no different from one from the "other side".

For many problems this is the desired behavior, hence was a core FOMO design decision, but as we've just seen in the previous example it isn't always. Sometimes we need to include some signal that incorporates where in the image the detection is occurring in relation to other objects.

One approach to incorporate this spatial information is to include a minimal transformer block between the feature extractor and the classifier. To support this we've added some custom layers to our libs available in Expert Mode WithSpatialPositionalEncodings and PatchAttention

WithSpatialPositionalEncodings takes a feature map (batch, height, width, channels)and adds a spatial form of positional encodings, as additional channels, across the entire feature map. These positional encodings are a 2D form of the normal linear sin/cosine encodings so that the model can use them to incorporate spatial information. Adding this kind of positional encoding is standard before self-attention since self-attention would be otherwise permutation invariant.

PatchAttention converts the 3D spatial array of (height, width, channels) into a 2d array of just (height*width, channels) and then applies MultiHeadAttention.

These two layers can be incorporated in a number of ways but the most straightforward is to add them to FOMO as a branch. We do this as a branch since it allows the main flow to continue to be from the feature extractor to the classifier but allows the optimization to mix in extra spatial information as it requires it from this attention block. Note: we include an aggressive dropout at the start of this branch since self-attention can overfit.

# Cut MobileNet where it hits 1/8th input resolution; i.e. (HW/8, HW/8, C)
    cut_point = mobile_net_v2.get_layer('block_6_expand_relu')
    model = cut_point.output

    # Now attach a small additional head on the MobileNet
    model = Conv2D(filters=32, kernel_size=1, strides=1,
                   activation='relu', name='head')(model)

    # Branch, attend (with positional encodings), 
    # and recombine as residual connection
    from ei_tensorflow.self_attention import WithSpatialPositionalEncodings
    from ei_tensorflow.self_attention import PatchAttention
    from tensorflow.keras.layers import Dropout, Add
    branch = Dropout(rate=0.5)(model)
    branch = WithSpatialPositionalEncodings()(branch)
    branch = PatchAttention(key_dim=8)(branch)
    model = Add()([model, branch])

    # And finally a classifier layer
    logits = Conv2D(filters=num_classes, kernel_size=1, strides=1,
                    activation=None, name='logits')(model)
    return Model(inputs=mobile_net_v2.input, outputs=logits)

This results in a model with the following intermediate shapes and learnable parameters.

 layer                               tensor shape         #parameters  comment
 input                               (None, 160, 160, 3)  0            input
 MobileNet-V2 backbone               (None, 20, 20, 96)   15,840       image feature extractor
 dense (Conv2D)                      (None, 20, 20, 32)   3,104        
 dropout                             (None, 20, 20, 32)   0            self attention branch
 with_spatial_positional_encodings   (None, 20, 20, 32)   0            self attention branch
 patch_attention                     (None, 20, 20, 32)   1,144        self attention branch
 logits (Conv2D)                     (None, 20, 20, 3)    99           fully convolutional classifier

This gets a bump in performance now getting 4 of 5 "left" instances and 1 of 2 "right" instances in validation data.

A public version of this project is available here: https://studio.edgeimpulse.com/public/230985/latest

Improved performance on validation data

It also now correctly identifies two examples in held-out test data.

Correctly classified test set held-out examples

The performance comes with a cost, though, primarily around the extra allocation required for the positional encodings.

Performance metrics for FOMO model with self-attention

If we wanted to tune further one key option is to control the amount of information being passed to the self-attention block with a projection. For example, projecting down to a lower dimensionality at the start of the branch, then back to the original dimensionality at the end, results in an overall reduction in parameters, even though it introduces two extra layers for the projections. In some cases forcing the self-attention to work with a lower dimension representation works as a better regulariser than a dropout.

    # Branch, attend (with positional encodings), 
    # and recombine as residual connection
    from ei_tensorflow.self_attention import WithSpatialPositionalEncodings, PatchAttention
    from tensorflow.keras.layers import Add
    def projection(dim):
        return Conv2D(filters=dim, kernel_size=1, strides=1, 
                      kernel_initializer='orthogonal',
                      use_bias=False)
    branch = projection(dim=4)(model)
    branch = WithSpatialPositionalEncodings()(branch)
    branch = PatchAttention(key_dim=16)(branch)
    branch = projection(dim=32)(branch)
    model = Add()([model, branch])
 layer                               tensor shape         #parameters  comment
 input                               (None, 160, 160, 3)  0            input
 MobileNet-V2 backbone               (None, 20, 20, 96)   15,840       image feature extractor
 dense (Conv2D)                      (None, 20, 20, 32)   3,104        
 dense (Conv2D)                      (None, 20, 20, 4)    128          down projection
 with_spatial_positional_encodings   (None, 20, 20, 4)    0            self attention branch
 patch_attention                     (None, 20, 20, 4)    316          self attention branch
 dense (Conv2D)                      (None, 20, 20, 32)   128          self attention branch
 logits (Conv2D)                     (None, 20, 20, 3)    99           fully convolutional classifier

In this case, the two projections add 256 learnable parameters, but the attention parameter count drops from 1,144 to 316 so it's a net saving.

Read the Getting started sectionGetting started

The easiest way to get started with this idea is to visit this public project and use the expert mode code as an example to start from. Let us know if there are cases where you see an interesting result!

Subscribe

Are you interested in bringing machine learning intelligence to your devices? We're happy to help.

Subscribe to our newsletter