UTF-8 Validation

Problem Link: https://leetcode.com/problems/utf-8-validation/

Problem Statement

Given an integer array data representing the data, return whether it is a valid UTF-8 encoding (i.e. it translates to a sequence of valid UTF-8 encoded characters).

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. For an n-bytes character, the first n bits are all one’s, the n + 1 bit is 0, followed by n - 1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

     Number of Bytes   |        UTF-8 Octet Sequence
                       |              (binary)
   --------------------+-----------------------------------------
            1          |   0xxxxxxx
            2          |   110xxxxx 10xxxxxx
            3          |   1110xxxx 10xxxxxx 10xxxxxx
            4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x denotes a bit in the binary form of a byte that may be either 0 or 1.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

Input: data = [197,130,1]

Output: true

Explanation: data represents the octet sequence: 11000101 10000010 00000001. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

Input: data = [235,140,4]

Output: false

Explanation: data represented the octet sequence: 11101011 10001100 00000100. The first 3 bits are all one’s and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that’s correct. But the second continuation byte does not start with 10, so it is invalid.

Disclaimer: Before proceeding forward towards the solution it is recommended to try first!!

Intuition

As mentioned in the problem statement, every character in UTF8 is only 1 to 4 bytes long and only the least significant 8 bits are used to store the data therefore, we can easily check the starting bits of the first integer to check the number of bytes for the UTF8 octet sequence. Thereby, we can check the starting bits of the remaining integers to see if they follow the conditions given in the problem statement.

One important thing to note is that the integers can be larger than 255 as well. The highest number that can be represented by 8 bits is 255. So, how to tackle the case when the integers are greater than 255? According to the problem statement, we only have to consider the 8 least significant bits of each integer in the array. Thus, we will only consider the least 8 significant bits of each integer in the array.

Approach

  • There are five kinds of valid byte types: 0**, 10**, 110**, 1110**, and 11110**. If we closely look at the valid types the index of the first 0 determines the byte type.
  • Starting with index 0, we will use the right shift operation to shift the starting bits to the end and then compare it to check its validity for every integer in the array according to the rules.
  • According to the valid byte types, we will check the proceeding bytes for validity till all the integers in the array are covered.
  • If every byte is valid then we return true else we return false.

Let us try our approach on an example:

Example 1

data = [197, 130, 1]


Octet sequence for the above array is:
11000101 10000010 00000001

According to the problem statement, for an n-bytes UTF-8 character, the first n-bits would be 1 followed by a 0 in the n+1 bit. Then, the next n – 1 bytes would all have 10 as their most significant bits.

[1 1 0] 0 0 1 0 1
 ↑   ↑

The 2 most significant bits of this byte are 1s and they are followed by a 0. Therefore it is a valid byteThe and this is a 2-byte UTF-8 character. Thus, the next byte ( next integer ) in the sequence must follow the pattern 10xxxxxx.

Now, examining the first and next byte we get:

[1 1 0] 0 0 1 0 1    [1 0] 0 0 0 0 1 0
 ↑   ↑                ↑ ↑

As we can see it follows the sequence i.e. they combine to form a valid 2-byte UTF-8 character. As there are more elements left in the array, we move forward and check the remaining integers similarly as we did with the numbers above.

Let’s look at the binary representation for the next integer which is 1:

00000001

Since the most significant bit itself of this number is a 0, the only rule it satisfies is the 1-byte UTF-8 character rule.

According to the rules for the 1-byte characters, the first bit of the most significant bit is a 0, followed by its Unicode code.

[0] 0 0 0 0 0 0 1
 ↑

Thus, we can see that all the elements in the array form a valid byte sequence. Therefore, we return True.

Code

Complexity

T(n): O(n)

S(n): O(n)

If you like the post, don’t forget to share it with your peers and give your reactions!

%d bloggers like this: