Normalizing input data is generally a good practice, as it can improve the convergence rate and the overall performance of your model. That being said, the Transformer architecture includes layer normalization inside the TransformerEncoderLayer, which helps stabilize the learning process even if the input data is not normalized.
However, it’s important to understand that layer normalization operates along the last dimension (d_model) of input data, while input data normalization (using techniques like min-max scaling or standardization) happens across the examples in the dataset. These two normalizations serve different purposes and are not interchangeable.
Input normalization provides the model with features that are on a consistent scale and removes the bias due to the different ranges of input features, which can be particularly useful in problems where the input feature scales vary significantly.
On the other hand, layer normalization in the Transformer encoder ensures that the activations inside the model have a stable distribution, which helps improve training stability and convergence speed. It does not have the same effect as normalizing the input data.
In conclusion, while the Transformer architecture includes layer normalization, it’s still recommended to normalize the input data to improve the model performance and training stability.