Please purchase the course to watch this video.

Full Course
The lesson explores the use of the bufio
package in Go for reading and counting words from files, streamlining previous code written using byte buffers and UTF-8 decoding. By implementing the bufio.Reader
and utilizing the ReadRune
method, developers can efficiently process runes without the complexity of earlier approaches, significantly simplifying the code. The bufio.Scanner
further enhances this functionality by providing a convenient interface for scanning newline-delimited lines or custom tokens like words. With this scanner, developers can iterate through the words seamlessly, optimizing the word-counting algorithm and eliminating the need for manual space checks. This approach not only improves code efficiency but also sets the stage for future enhancements in word counting functionality.
Now that we have an understanding of how to read from a file using a byte buffer, and how to decode those bytes into UTF-8 encoded runes, let's go ahead and simplify our code by making use of another package from the standard library, the bufio
package, which provides functionality for working with "buffered io", pretty much what we've been doing these last few lessons.
One such functionality that this package provides is the ability to pull out runes from a data stream, such as a file, in a simple and efficient way, by using the read rune
method of the bufio
reader type.
Whilst I can't speak for you, for myself, this is music to my ears. To see this in action, let's go ahead and reset our code a little. We're going to begin by removing pretty much all of the logic inside of our code, leaving both the wordCounter
as 0
and the isInsideWord
as false
. This means deleting all of the buffer code, the buffer size, and the individual loops to loop over each individual buffer, loading our data in. So goodbye.
Next, let's go ahead and just silence the error of isInsideWord
, because we're going to want to use it in a minute. Next, with our code now reduced, let's go ahead and remove the imports of the unicode
and the unicode/utf8
package, and replace this with an import to the bufio
package. Additionally, I'm going to go ahead and change the file name to words.txt
just for the moment, although we'll probably test this on both files later on.
Next, we then want to implement the exact same algorithm that we had before, where we were iterating over each rune and checking to see if it was a white space character. If it wasn't, and we weren't already inside of a word, then we were increasing our word count by one. Rather than implementing the code in order to pull out a reader from our file using a byte buffer and the utf8.Decode
functions, let's instead achieve this using the bufio.Reader
type of the bufio
package, which we can create using the bufio.NewReader
function.
The io.Reader
is an interface that defines any type that has the Read
method, which we were using before with the os.File
in order to pull bytes into a buffer. This means the os.File
type conforms to the io.Reader
interface, so we can just go ahead and pass it in.
Now we want to be able to pull out a rune from this bufio
reader, which we can do using the Reader.ReadRune
method, which returns three values. The first being a rune
, the second being the size of the rune, and the third being an error
. As you can see, this is a lot easier than what we were doing before, as this provides a nice abstraction around all the code we had already written.
Let's go ahead and capture the rune inside of a variable called r
. In our case, we don't need to capture the size because we're not doing anything related to buffers anymore, so there's no need for us to have this value. And we can go ahead and capture the error which we're going to use in order to break out of a for loop. Of which, let's go ahead and create one, iterating on the Reader.ReadRune
method until we detect an err
, in which case we'll go ahead and break
out.
Next we can then go ahead and use this rune with the exact same logic we had before to detect if we start a new word. Which we were doing using the if unicode.IsSpace
, passing in a rune, and if not isInsideWord
, then we will increase our word count with wordCount++
. Additionally, let's go ahead and do isInsideWord = !unicode.IsSpace(r)
.
Now if I go ahead and run the code, we should see that it works as expected, producing the value of 5
. Additionally, if I change the code to point to the utf8.txt
string, again this should work, and it will even work for our lots_of_words.txt
text file, although this may take just a little bit more time. As you can see, it took 8 seconds
and we got the total count of our words. So far, pretty decent.
To quickly explain why this code works, as you can tell we're pulling out the ReadRune
from our io.Reader
, but we're exiting the loop in the event we encounter an error. Most likely the error that we're going to encounter is an io.EOF
, which is the same error we encountered whenever we were reading from the actual file itself, which is why we're then breaking out.
Again after this, we're doing the exact same logic we saw before, which is where we're checking to see if we're not a space, and that we're not already inside of a word. Then we increase our word count. Lastly, we then add in the check to see if we are inside of a word, ensuring we check which is being set to !unicode.IsSpace
. And that's all it takes.
As you can see, this code is a lot more simple than what we had before. This is because the reader type of the bufio
package is abstracting all of the complexity required to pull out a rune out of a file away from us.
However, the simplicity doesn't end there, as the bufio
package provides yet another type that's going to be better suited to our algorithmic needs. This is the bufio.Scanner
, which provides a convenient interface for reading data, such as a file of newline delimited lines of text.
If we take a look at the example, you can see that the Scanner.Scan
method is used with a for loop, meaning we won't have to check to see if an error in order to break out. Instead we can just use the Scan
method. By default, the bufio.Scanner
will scan across lines, so each iteration of this for loop will pull a single line out from the reader.
However, the SplitFunc
of the bufio.Scanner
can actually be customized to whatever you like. And the bufio
package provides a number of scan functions that we can use, such as the ability to scan bytes or the ability to scan runes. But that's not all, as the bufio
package also provides the ability to scan words, which is pretty much perfect for our own word counting algorithm.
So let's go ahead and replace our current bufio.Reader
logic with the bufio.Scanner
instead. To do so, let's go ahead and delete pretty much all of our code, except for the actual word count. Then we can go ahead and create a new scanner from the bufio
package of NewScanner
, passing in our file as the io.Reader
. Then we can go ahead and define what we want the scanner to split on, which is done by using the scanner.Split
, passing in the bufio.ScanWords
function, as follows.
Now with our scanner defined, we can then iterate over each word inside of the scanner, using the following scanner.Scan
method, which advances the scanner to the next token, which will then be available through the scanner.Bytes
or scanner.Text
method. It returns false when there are no more tokens, either by reaching the end of the inputs or an error. Therefore, we can use this function to iterate over each token, which is going to be a word in our configured scanner.
However, we don't even need to pull out the text or the bytes. Instead, we can just go ahead and increment our word count as follows. Now if we go ahead and delete the Unicode
package, as we're no longer using it, and first change this to words.txt
, when I run this code as follows, we get the correct count, and the same for the utf8.txt
file, and also the same for the lots_of_words.txt
file as well, which if we go ahead and run, should take around 8 seconds
. In fact, it only took 5
, so it's even more optimal.
As you can see, we've managed to reduce our algorithm even further. This time we're no longer needing to keep track of whether or not we're inside of a word, and whether or not the current rune is a space. In fact, this algorithm is so optimal, we can actually consider replacing our countWords
function with it.
But in order to do so, we're going to need to make some changes to both our function signatures and our test cases in order to make this work with both files and other types as well. We're going to take a look at how we can do that in the next lesson.