Please purchase the course to watch this video.

Full Course
Handling Unicode in algorithms is crucial for ensuring accurate text processing, especially when dealing with diverse character sets beyond ASCII. Challenges arise when algorithms naively count words, leading to incorrect results due to the presence of multibyte Unicode characters and different whitespace representations. The outlined approach emphasizes iterating through individual runes rather than bytes to accurately detect word boundaries while counting, thus enhancing both reliability and efficiency. By leveraging the unicode/utf8
package for decoding runes, the algorithm can manage varying buffer sizes and correctly process complex text files, ultimately refining the functionality to accommodate both standard ASCII and Unicode characters. This foundational work sets the stage for future enhancements and optimizations in text processing applications.
As we saw in the last lesson, our algorithm currently only really works with ASCII files, and starts to break down once Unicode is encountered. If you'll remember, we created the following words.txt
file, which has the cyrillic P inside of the middle of the word 3. Additionally, I'm also running with a buffer size of 2.
When I go ahead and run this code as follows, you can see it produces the word count of 6, when in fact there are only actually 5 words. Perhaps more egregious than this, however, is the following example. Here I have a file called utf8.txt
, which contains 6 words. However, if I go ahead and change my code to use this file, still keeping the buffer size as 2, and if we go ahead and run this code, you can see it produces the value of 1. This is because in between each of these words is a Unicode white space character, rather than the standard space of the ASCII value of 32 that you get when pressing the space bar.
However, these white space characters are still valid white space, and so we should be producing the value of 6. Of course, if I go ahead and change the buffer size to 4096, and then run this code again, this time it will produce the valid size. However, that's not exactly a good solution, as we will still encounter issues whenever we cross over a boundary. This lack of handling of Unicode is another problem if we want to distribute our application, as we again won't have any control over the files that our application will be used on. Therefore, we need to go ahead and fix this.
We'll take a look at the easy way in the next lesson, but the reason I'm letting you know it exists is so you don't feel discouraged about the code we're about to write. For this lesson, we're going to take a look at the manual way, just as a point of reference, and to get a better understanding of how we can work with Unicode when it comes to data streams. Let's go ahead and begin with the words.txt
file that we already have created.
In order to fix this, we're going to need to make yet another change to our algorithm, specifically how we actually count the number of words inside of our buffer. Instead of casting our buffer into a string, and then using the fields function of the strings package in order to count the number of words, instead we're going to iterate through each rune inside, checking to see if the rune is a space, or if it's a new word. The benefit of this approach is not only will we be able to handle multibyte Unicode values whenever they occur over the threshold of a buffer, but we'll also be improving the efficiency of our algorithm, as we won't be having to first cast our data to a string, which isn't a free operation, before then having to iterate through and pull out the individual words.
Instead, we'll be doing everything in one single pass. Therefore, in order to do that, let's begin by ensuring that our algorithm works for our simple use case, and not considering any multibyte runes initially. Therefore, I'm going to go ahead and remove the Cyrillic P from the words.txt
. Our Unicode test file will be the utf8.txt
file we've already created. Then inside of this actual code, I'm just going to reduce the size of the editor, and we'll begin by changing it up.
Let's go ahead and remove the initial code that we already have, and we're going to start with a sort of blank slate. The idea here is that we want to iterate over each rune inside of our buffer, rather than iterating over each byte. So, how can we do that? The first thing we want to do is to pull out an individual rune from our buffer, which is currently a slice of bytes. If you'll remember, a rune in Go is of a variable size, meaning it can be anywhere between 1 and 4 bytes.
So how do we go about pulling out the first rune from a slice of bytes? Fortunately that's where the unicode/utf8
package comes in, which provides a number of different functions in order to work with UTF-8 Unicode. The function that we're interested in is the DecodeRune
function, which takes a slice of bytes as its parameter, and will return a rune and a size. Therefore, let's go ahead and just pull out the first rune to see how this works. To do so, we can go ahead and capture the R value as a rune, and we'll capture the R size, which is the size of the rune, and we'll pass it in to the utf8.DecodeRune
, passing in our buffer, which is constrained to our size.
Now let's go ahead and print out the actual rune value, and we'll print out the R size value as well. Now if we go ahead and run this code, let's also tidy up any of our input ports, and we do want to use the IsInsideWord
value, so we're just going to go ahead and ignore it for the moment. Okay, great. Now if I go ahead and run this, and we need to cast the actual rune to a string because that doesn't make much sense to us.
Now, if I go ahead and run it again, you can see each time of the buffer, because we're iterating over two, we're pulling out O, missing the N, pulling out E, missing the space, pulling out T, etc, etc. So we are pulling out values of our actual file. Basically we're pulling out the first rune of the buffer for each file. If I go ahead and change the buffer size to 1, this should then pull out every single value, which it does, as each rune is only a single digit in length.
As you can see, by using a buffer size of 1, we're able to pull out every single rune from our actual file. So let's leave it as 1 for the moment and go ahead and perform our actual word count check. To do so is going to be a very similar logic to what we had before, when we were checking to see if we were inside of a word, and if so, we were not increasing the count whenever a word value was detected. This is pretty much going to be the foundation of this current algorithm, which is basically going to look similar to this, if not space, rune, and not is inside word, word count plus plus.
Then is inside word is equal to not is space. This is a bit of pseudocode, but is pretty much the algorithm we actually want. For example, all we need to do is just use the unicode.IsSpace
function, and we can replace that here as well, and we're actually good to go. Let's go ahead and ignore the R size for the moment, we'll need that in a minute when we start dealing with multibyte slices.
Now if I go ahead and run this code, you can see we're getting the value of 5. Pretty good. Although this is only working when we've set our buffer size to 1, and as you'll know, we'll likely want to have a much larger buffer size. For example, if we set it to 2 and now run this code, you can see it's only detecting two words. This is because we need to add another loop into our code, in order to iterate over each rune found inside of the buffer.
To do so is actually pretty simple - we can make use of another for loop. However, before we do that, I'm just going to first add in our sub-buffer code. You'll see why we need this in a minute. Which we'll set to be the buffer, constrained to the size. Then let's go ahead and set this to be subBuffer
, as follows. Again, if we set this to 1, everything should still work the same as it did before, and we can just go ahead and prove that, which it does.
Set this back to 2. Now we want to go ahead and add in a loop. However, rather than having an infinite loop where we break out of, like we saw on line 33, let's instead set the for, which is similar to a while in Go, for length of subBuffer
is greater than 0. And we'll just iterate continuously. This for loop where we're checking the length of the subBuffer
in order to know whether we continue is the reason we're using a new variable rather than the initial buffer size, as we always want this to be the same length as our actual buffer size is.
Therefore this condition would never end up becoming true. In our case now, we're going to iterate until our subBuffer
no longer has any bytes within, which means we need to reduce the number of bytes inside once we've read a rune. This is where the R size value comes in, which will tell us the size of the rune in bytes that we just decoded. Which means we can go ahead and use it with the following expression to shrink the size of our subBuffer
by the number of runes, by the size of runes that we've used.
This is the same subslicing syntax we've already seen before. We can also get rid of the ignore value here. Now if I go ahead and run this code, you can see it now works with our buffer size of 2. As we're iterating over each rune found inside of the sub-buffer, reducing the size of the sub-buffer by that rune's size. This means we can now change the buffer size to any value we want. And our code should just work.
Great. So far, this is working for an ASCII file. We've managed to replace the functionality of the strings.Fields
function and, instead, perform all of our checking inside. As you can see, this isn't too difficult. Additionally, if we go ahead and add back in the Cyrillic P to our words.txt
file, which beforehand was causing it to fail when we had a buffer size of 2, if we go ahead and run this code again, you'll see it also produces the number 5. This gives the impression that our code is actually working.
However, it's actually not. To show what I mean, let's go ahead and actually print out the value of our rune as follows. And we'll actually do a Printf
here, let's cast this to a string as follows, and we'll add a space in between each rune. Then at the end, we'll make sure to just print. Hmm. Oh, let's go ahead and add the new line here. Now I've added in some code in order to be able to print each rune followed by a space, just so that we can see it a little clearer.
It might actually be easier to put a vertical pipe in, just so that we know the boundaries are between a rune, and we're not going to get confused about our space values. Additionally, I'm also printing out a new line at the end, just so that we know when the word count is going to be printed. Now if I go ahead and run this code, you can see 1, 2, 1, 2, 3, but then we get our Cyrillic P, which is currently being displayed with the question mark symbol. This is a bit of an issue.
First of all, there's only one character here, but we're printing out the two bytes, and each time we're actually printing out what's called a rune error. If we take a look at the documentation for the utf8.DecodeRune
function, here you can see it unpacks the first UTF-8 encoding in P and returns the rune and its width in bytes. This is something we've already been working with. If P is empty, which in most cases it's not, it returns runeError
and 0. In our case, we have our length of subBuffer
greater than 0 check, so we're never ever adding an empty buffer to the DecodeRune
function. So this is not valid in our case.
However, otherwise, if the encoding is invalid, it returns runeError
1. This is what's happening for us. Currently, we're getting a runeError
and an R size of 1, so we're still managing to shrink the size of our buffer as the R size is 1, but we're not handling the runeError
, which is where we get an invalid decoding. Because of this, if we try to run our UTF-8 text file again, you'll see it produces the value of 1, and in each of these cases we're getting a runeError
of 3 bytes when we should be getting a space.
So whilst our code is working for this simple case, it's not working for the more advanced case. Instead, we need to now figure out a way to handle the actual runeError
. But why is the runeError
actually happening? The reason it's happening is due to the fact that, because our buffer size is only 1, then we're only able to read 1 byte in at a time. This is a problem when you have a rune of say 3 bytes here, or 2 bytes in this case, as it means that you're not able to fully read in all of the values of a rune.
For example, if we go ahead and change the buffer size to say 10, and run our code again, you can see here that the spaces are working in the majority of cases, but not in the case where a boundary threshold is taking place. This means that our decodeRune
function will work when there's enough bytes inside of the buffer to actually encapsulate the entire UTF-8 character. So the way that we can solve this is, if there is a runeError
when we try to decode a rune, we need to store the bytes that we try to decode, and use them again in the next buffer, in order to fully form our UTF-8 character.
If that's a little confusing to visualize, then don't worry, hopefully the code will explain it. In any case, the first thing we need to do is check to see if our rune is an error, which we can do by using the if (r == utf8.RuneError)
as follows. If it is, then we just want to break out of our individual loop. Now if I go ahead and change the buffer size back down to 1, and if we go ahead and run this against our UTF-8 text file, you can see now we're just ignoring the actual spaces, and it gives a better representation of why we think there's one word.
If I run this against words.txt
, by the way changing the file names this way is kind of tedious, we'll solve that in not the next lesson, but the one after. And as you can see here, whilst this works correctly, we're just ignoring the P value. This means we're just breaking whenever we see a UTF-8 rune error. And so we're not handling multibyte Unicode values. So the next thing to do is figure out how to handle them.
Well in order to do so, we basically need to prefix whatever is remaining in our subBuffer
once we finish this loop, prefix it onto the beginning of the next buffer. Therefore, in order to do so, let's go ahead and create a new slice of bytes called leftover
, which will be used to store any leftover bytes that we weren't able to decode into runes. Then the next thing we need to do is populate the value of leftover
with any bytes that are remaining inside of our subBuffer
. We can do this using the append
function, which will create a new slice, which takes a slice as its first argument, in this case the leftover
slice, and will append any of the other values to it.
In this case, it would append 1, 2 and 3. In our case, we want to append whatever's remaining within the subBuffer
, which we can do using the following syntax. With that, our leftover
slice should now contain the contents of our subBuffer
at the end of our rune decoding. Which means we can then use this leftover
within the next time we instantiate the subBuffer
, again using the append
value of leftover
and bufferSize...
. This means our subBuffer
will then have whatever's been left over in our leftovers, and then we'll take all of the values in our sub-slice of buffer and append it to the leftover
slice.
Now if I go ahead and run this, running it against the words.txt
, and just to make things a little easier, let's go ahead and actually fmt.Println(fileName)
so we know what's going on here. But you'll see we're actually failing. Once we get to the Cyrillic P, which we do now handle, we stop printing out any other words. So what's going on? Well, the issue we have is we're not actually clearing out the leftover
buffer. For example, if I go ahead and print out leftover
, or the length of leftover
, that's probably a better approach.
And let's go ahead and do leftover length
. And if we run this code as follows, you can see that once we hit the Cyrillic P, our leftover length
is 1. And then we keep adding to it every single time. In fact, we keep adding to it and it grows exponentially. This is because we're continuously adding values onto the leftover
buffer, which isn't what we want to do. Instead, before we add the subBuffer
into the leftovers, we want to make sure we reduce it back down to nothing. We can do so a couple of ways. We can either set leftover
to equal nil
, which then should cause our code to now work and the leftover
doesn't keep getting appended to, or my personal favorite is to reuse the memory inside of leftover
, but just sub-slice it to a zero length.
Which means if we go ahead and run this again, you can see it also works, but it should be a little more optimal. There is a caveat with doing this approach, however. If the subBuffer
is, if the leftovers
are extended somehow into their existing capacity, then any garbage values that have been written in will be restored. So, just something to keep in mind whenever you do sub-slicing in this way. If you're not too sure, then the safest approach is always going to be to set it to nil. However, I like to just be a little bit more optimal.
In any case, let's go ahead and remove the print line and we can see that our code is now actually working. And it's both handling, and it's now handling the Unicode character of the Cyrillic P and producing the correct word count. Therefore, let's go ahead and get rid of the fmt.Print
statements as follows. And we can now go ahead and run this code and we should get the value of 5 for words.txt
. Next, the real test is to make sure this is now working with our UTF-8 string, which if we go ahead and run, is doing so as well.
With that, we've managed to update our algorithm to not only work with large files, but to also iterate through each rune or each UTF-8 character inside of those files and perform our word counting algorithm correctly. This is actually a pretty big achievement, and is something, as you can tell, that is not very trivial to handle. As you can see, there's a lot of code to make this happen. However, don't be discouraged that it takes so much effort in order to actually pull out individual characters from a slice of bytes. Because as with most things when it comes to Go, the standard library provides a really simple approach for us to do this, which we're going to take a look at in the next lesson.