Create an account to watch this video

Full Course
Counting words in a file efficiently involves reading data in manageable chunks rather than loading the entire file into memory. By modifying a function to rename it from print file contents
to count words in file
, a variable is created to store the word count, incremented by leveraging the fields
function from the strings package. However, challenges arise when words are split across buffer boundaries, resulting in double counting, especially with smaller buffer sizes. A proposed solution involves tracking whether the reading is within a word by checking the leading and trailing buffer characters, adjusting the count accordingly. The solution is further refined by utilizing the unicode.isSpace
function to account for various whitespace characters beyond just spaces and newlines. This approach ensures accurate word counting while handling potential complications from Unicode characters. The necessity of developing a robust algorithm highlights the complexity of text processing in programming, paving the way for deeper discussion in subsequent lessons.
Now that we know how to read data from a file without loading the entire contents into memory, let's take a look at how we can use this new approach with our algorithm in order to count the number of words.
In order to do so, let's go ahead and make some modifications to our PrintFileContents
function. The first thing we're going to modify is the name, changing it from PrintFileContens
to CountWordsInFile
. Additionally, we're also going to want to change the return value from nothing to an integer. Then inside, let's go ahead and create a new variable called countWords
, which is what we're going to use to actually store the number of words that we detect inside of our file type, and we'll set this to be zero. Lastly, all we need to do is return this variable at the end.
Now let's head on over to our main function, and let's replace the call to PrintFileContents
with a call to CountWordsInFile
, capturing the return value in a variable called count
. And underneath this, let's go ahead and print our word count out as well, using the following fmt.Println
function. With that, let's go ahead and get rid of the commented out code that we added in the last lesson. Removing our call to the CountWords
function, which takes in our data slice, and the fmt.Println
of word count that we set up before.
With that, we're ready to begin making some modifications to our algorithm, in order to count the number of words in a file. The first thing we need to do is to be able to find a way to count the number of words inside of a buffer, which is a slice of bytes. There are a couple of ways we can actually do this, however we already have an approach in our code, which takes a slice of bytes, of which the buffer is, and returns an integer representing the number of words.
If you'll remember, we're achieving this using the Fields
function of the strings
package, so like all good software developers, let's go ahead and reuse this code, in order to count the number of words found inside of our buffer. To do so, let's go ahead and delete the following fmt.Print
function, and replace it with a call to the CountWords
function, passing in our buffer. As you'll remember, we need to constrain this to only the bytes that have been read in the last read operation, which we can do by subslicing up to our size.
Next, we then need to add the result of this value into our countWords
variable, which we can do using the following countWords +=
statement. This will add the return value of the CountWords
function to the countWords
. Let's go ahead and actually change this to wordCount
, I think, countWords
is not the best name, wordCount
.
Okay, let me try that again. Okay, with the better variable name in place, this line will take the return value of the countWords
function and both add and set it to the wordCount
value. Now, if I go ahead and check that we're pointing to our words.txt
file, which if I go ahead and cat
has five words inside, now when we run this code we should see it print out the number 5
. As you can see, it works. Sort of.
There is in fact a slight issue, which as you may tell is a common theme in this course. To show this issue in action, if I go ahead and replace our buffer size from 8192
to something a lot smaller, say 5
, then if I go ahead and run this code again, you'll see this time we get the number 9
. But as you can tell, there aren't nine words inside of our words.txt
. So what's going on?
Well, in order to debug this, let's go ahead and change some of the printing that we're doing inside of our algorithm. Let's begin by first capturing a bufferCount
, which is going to be the result of the countWords
function for our buffer up to size. Then we'll add, then we'll +=
this bufferCount
instead of calling the function directly. Next, let's go ahead and print to the console the bufferCount
, followed by a :
, followed by the actual buffer up to the size and we'll cast this to a string as well.
Now, if I go ahead and run this code, you can see what is actually happening. Here we're counting 2
, 2
, 2
, 2
and 1
, which adds up to 9
. However, we don't actually have nine words. The reason this is happening, as you can probably see, is that whenever a word is split across two buffers, it's being double counted. For example, the word two
is being counted once, on the first buffer read, and again on the second buffer read, when it should only be counted as once.
This is also happening for the word three
, the word four
, and the word five
. Therefore, we have four double reads, which is giving us our total of nine. So how do we go about solving this? Well, there are a number of different approaches we could take. One approach would be to have a separate buffer that we keep adding to until we find the next space. Whilst this could work, it actually could cause a bit of an issue. For example, let's say we had a file that only had a single word inside, and was again around one gigabyte in total. This would mean we'd end up loading the entire buffer into memory, and we'd have the same issue that we had in the last lesson.
Another approach we can take is to keep track of whether or not we're inside of a word whenever we move into a new buffer. For example, figuring out whether or not figuring out at this point between the first read and the second read, determining whether or not we're still inside of a word. Which in this case we are, due to the fact that there's no white space at the end of the last buffer, and no white space at the beginning of the current buffer. Then, if we detect that we're inside the middle of a word, all we need to do is just reduce the count by one.
This works for a number of different cases, such as if we happen to be inside of the word and then have another word start, as well as if the buffer itself is just an entire word. For example, if I go ahead and change the buffer size to three and run this code again, you can see in this instance and this instance the count would be zero, due to the fact that we would have already counted the word inside of the previous read.
So let's go ahead and implement this. In order to do so, we need to create a new variable called isInsideWord
, which is going to be a bool
that we'll initially set to false
. Then for our actual buffer count, if we detect that we're inside of a word, so if isInsideWord
, we can go ahead and subtract one from our buffer count as follows. Let's also get rid of this following fmt.Println
statement as we no longer need it. This handles the case of when we detect we're inside of a word, removing that word from our end buffer count.
Next, we then need to check the last character of our byte array to determine whether or not it's a space. To do so, we can go ahead and assign the value of isInsideWord
to be buffer size - 1
, which is the last value from the buffer that we read. And let's go ahead and compare that it's not equal to a space. As you know, we need to be checking for more characters than just the single space character, as we also need to check for new lines, tabs and other forms of white space.
However, for the moment, just to get this algorithm working, let's forget about that for the meantime. Now that we've got some code to check whether or not the last byte is an actual character, let's go ahead and add in some code in order to check if the current buffer also starts with a non-space character, which means we're still inside of the word. To do so, let's go ahead and add in the following line above where we pull our buffer count. Passing in the isInsideWord
is equal to buffer[0]
is not equal to space and isInsideWord
. This means we're now checking if the first byte of our current buffer is not a space, and that we were inside of a word at the end of our last buffer.
If both of these are true, that means that the buffer threshold took place in the middle of a word. Now if I go ahead and run this code, we should see the number 5
appear, which it does. Additionally, if we go ahead and change the buffer size to say 5
, which beforehand gave us the problem of nine, we should again also see this as five as well. Let's go ahead and also add in the print statement, just so that we can see that this is actually now working.
To show this a little clearer, let's go ahead and add back in the Println
statement that we saw before. bufferCount:
, and we'll do string(buffer[:size])
. Now if we go ahead and run this again, you can see that our counts are this time correct. Two for the first read, one and two. Only one new word for the second read, which is the start of three. Only one for the third read, which is the start of four. And one for the fifth read, which is the start of five. And on the fifth read, we don't see any new words come in.
Let's also go ahead and change this to be the value of three and run this again. First clearing this and then run this again. Again you can see one on the first read, two on the second read, one on the third read, nothing on the fourth read, one on the fifth read, nothing on the sixth read, and one on the seventh read. For a grand total of five. Of course while this is working, we're only considering spaces. For example, if I go ahead and copy the words.txt
file, and replace it with lines.txt
.
Let's also go ahead and change this. If I replace all of these spaces with new line values, we should again expect the value of five. Let's replace the words.txt
with lines.txt
. And we can go ahead and run this again. However, this time you'll see it doesn't work. We're only getting the value of three. The reason this isn't working is because we're only checking for the existence of a space. And we would also now need to check for the existence of a new line character \n
, which we can do as follows.
&& buffer[size - 1] != '\n``
And we'll add in the same check here.
&& buffer[0] != '\n'
And let's go ahead and get rid of the fmt.Println
statement so it's easy to see. Now if I run this code, you can see again it returns 5
. Of course, there are more than just two white space characters. I think in total there are around 25. In any case, we need to be able to handle all of these.
However, rather than writing each check explicitly, we can instead make use of a function provided by the standard library. If you'll remember back to when we implemented our countWords
function, we ended up using the Fields
function of the strings
package in order to separate all of these strings into individual words. This Fields
function makes use of another function in order to determine what is a white space character.
If we take a quick look at the documentation for this function, you can see it says that the Fields
function splits the string s
around each instance of one or more consecutive white space characters. If we go ahead and check the go documentation for this function, we can see that isSpace
reports whether the rune is a space character as defined by Unicode's white space property. In the LADM1 space, this is '\t'
which stands for tab, '\n'
which stands for newline, '\v'
which I don't actually know, '\r'
and the empty space character, as well as the two Unicode characters of Nell and NBSP.
Therefore, let's go ahead and use the unicode.IsSpace
function in order to make this a little easier. If we take a look at the documentation, you can see that the IsSpace
function accepts a rune instead of a byte. This will cause us a bit of an issue which we'll talk about in a moment. However, for the meantime, let's go ahead and replace the, let's go ahead and replace our logic where we're checking for both the space and newline character, with instead a check for the unicode
, importing the package in, .IsSpace
function. Passing in our last bytes, cast to a rune as follows, buffer[bufferSize-1]
, and then on line 38 doing the exact same thing, but instead doing it for our first byte in our buffer.
I actually made a mistake here, this needs to be !unicode.IsSpace
and the same for line 38 as well. Now if I go ahead and run this code, we see we get the correct values and we should do for our words.txt
file as well. With that, we've managed to implement simple word checking into our actual code, making use of the unicode.IsSpace
function in order to determine whether or not we're inside of a word during a buffer boundary.
As I mentioned before, however, we have yet another issue, which is caused by the fact that we're casting a single byte into a rune. If you're unaware, runes in Go are UTF-8 encoded characters, which is what the Unicode package helps us to work with. However, when it comes to Unicode and UTF-8, it's a variable width encoding, meaning that each character can consist of between 1 and 4 1-byte 8-bit code units. UTF-8 itself is compatible with ASCII, so when we're checking for things such as spaces, tabs, or newline characters, then this isn't too much of an issue, as these values themselves won't ever conflict with UTF-8.
This includes on the first byte, or for up to 4-byte characters on any of the subsequent bytes as well. However, where it can conflict is for Unicode characters. For example, the Nel character and the MBSP, which are 2-byte characters consisting of the hex value of 85 and the hex value of A0. To show why this is an issue, if I go ahead and find another Unicode character that ends in A0, which if I take a look at the Unicode site as follows, the Cyrillic capital letter 'er' does, as you can see it ends in D0A0.
Then if I go ahead and open up my words.txt
file, and let's say we'll paste this in between the middle of the three-thra-p, that wasn't intended, word, we should still only have 5 words here. And if I set the buffer size to be, say, 2, if I go ahead and run this code, you can see that we're now getting the value of 6, which isn't correct. We still only have 5 words, however, because the Cyrillic P, which is a 2-byte rune, can be mistaken for a space character if you take the second byte, then it causes our code to produce the wrong value.
Therefore, we're going to need to change this algorithm yet again, this time to consider Unicode characters when we pull bytes from our buffer. We'll take a look at how we can do that in the next lesson.