Lesson Complete!
Great job! What would you like to do next?
Full Course
Developing a word counting algorithm is essential for processing text data effectively. This approach begins by loading content from a file into memory, followed by counting the words by iterating over the byte slice. By replacing a traditional C-style loop with Go’s range
keyword, the code becomes more efficient and readable. The algorithm identifies spaces as word delimiters, incrementing a word count based on their occurrences. However, it also highlights common pitfalls, such as undercounting words due to the absence of a space after the last word. Additionally, best practices around using named constants instead of magic numbers are emphasized, signaling a focus on writing clean and understandable code. Future lessons will address improvements and testing the algorithm to handle various edge cases.
Now that we have the contents of our words.txt
file being loaded into our application's memory, being stored in a variable called data
which is a slice of bytes, in this lesson we're going to begin writing our first word counting algorithm in order to count the number of words inside of our data slice. This algorithm is going to act as the starting point for some of the improvements that we're going to make through both this module and the next, but just to be aware this algorithm is going to have some bugs, but that's by design.
In any case, let's go ahead and actually make some modifications to our code in order to support this word counting algorithm. The first thing we want to do is to go ahead and actually collect a count of words, which we then want to print at the end of our code. To do so, let's begin by creating a new variable called wordCount
, which we'll use to keep track of the number of words we detect inside of our data slice. Then, rather than printing out our file's data to the console, let's instead replace it with a call to the fmt.Println
function, but this time printing out our wordCount
. Additionally, in order to squash this error, let's go ahead and just assign the data
to the blank identifier, just so that our compiler is happy.
Now, if we go ahead and run this code using the go run
command, this time you can see it prints out the number zero. So far, so good. Our code is now ready to add in an algorithm in order to change this value. In order to figure out what that algorithm is, let's take a quick minute to take a look at the words.txt
file in order to gain some insight about what our algorithm could be. Upon inspecting this file, we can see we have five words inside. One, two, three, four, five, and each word is separated by a space character. Therefore, this seems like a good candidate for us to be able to use in order to count the number of words.
As I said, this algorithm does have some problems, but it's a good starting point that we can use. Therefore, in order to do so, we first need to iterate over each byte inside of our data slice, in order to check to see whether or not it's a space character. In Go, we can do so a couple of different ways. The first is to use a typical C-style loop, which is assigning a variable of i
to equal zero, then checking to see if i
is less than the length of data
, and then proceeding to increment the value of i
. Then we can go ahead and access the actual byte inside of the data slice by using the following syntax, which will allow us to pull out the value at the index defined by i
.
Whilst this approach works, there's actually a better way to perform iteration when it comes to Go, which is to make use of the range
keyword, which can be used in conjunction with a number of different types in order to perform iteration. Then, in order to replace our C-style loop with a call to the range
keyword, we can go ahead and do as follows. Assigning the value of i
to be the result of range
of data
. Now, we should be iterating over each of the values inside of the data slice as follows. In fact, if we go ahead and do a fmt.Println
, passing in the value of i
, and if we run this code, we should see it iterate through each of the values as it does, from 0 to 23, as there's 24 characters in the file.
Whilst this approach works, we can actually simplify this even further. This is because the range
keyword when operating over a slice, and some other values as well, will actually produce two values that we can capture. The first, as we've seen, is an index that we can use to pull out the current value. However, the second is actually the current value itself. Meaning, rather than having to capture the value of i
and pull it out from our slice, we can instead just capture it as follows, and we can go ahead and print it out using the fmt.Println
function.
Additionally, let's go ahead and ignore the i
using the blank identifier as follows, and when I go ahead and run this code, you can see it's printing out each of the bytes on their own line in their decimal representation. Okay, with that, we're able to iterate over each of the bytes inside of our byte slice. The next thing we want to do is to check to see whether or not they are a space character. As you can see, the bytes are being represented in decimal form, which means we can go ahead and check our ASCII table from earlier and figure out the decimal value for what a space character is.
The first one in the ASCII printable characters, which are character codes between 32 and 127, is the number 32 in decimal, which, if we check the description, is a space character. Therefore, in order to track the number of spaces, we can just go ahead and compare our value of x
against the value of 32, which, if we go ahead and print, fmt.Println("space detected")
. Now, when I go ahead and run this code again, you can see that the value of space detected
is being printed to the console.
if x == 32 {
fmt.Println("space detected")
}
However, comparing a byte against a decimal value isn't exactly the most human readable experience. Instead, it would be nicer if we could make this a little bit easier for us, for our future selves, and other people to know that we are actually checking against a space character.
When it comes to Go, there's a couple of ways we can actually do this. The first thing is to go ahead and assign a constant of spaceChar
, which would equal 32.
const spaceChar = 32
In many cases, this is going to be the preferred approach, as we can just go ahead and compare x
against a spaceChar
if x == spaceChar {
fmt.Println("space detected")
}
and when we're reading this code, we now know that this number relates to a space character. Just as an FYI, whenever you see a number that has been hard-coded into an expression or an equation without being assigned to a named variable, this is known as a magic number when it comes to writing code, and they're generally seen as a bad practice.
There are some situations where magic numbers make sense, but for most of the code that we're going to be writing when it comes to building applications, especially web or CLI applications, magic numbers tend to be a bit of an anti-pattern. This is because they're ambiguous in nature. As you can see, the number 32
doesn't really mean much to us. Instead, by assigning it to a named constant, such as const spaceChar
, or even if this was a variable, we give some context to what this number represents. In this case, the space character. Therefore, by setting this as follows, it makes it a little bit easier for us to understand what is going on.
So as we go through this course, remember, magic numbers tend to be bad. However, as I mentioned, whilst in most cases this would be the right approach, in our case it actually isn't. This is because, in Go, the byte character can actually be compared against more than just an integer value. In fact, we can actually go ahead and compare it against a rune, which is Go's version of chars. However, rather than a char being an ASCII value, or a byte, runes in Go are actually UTF-8 values, which we'll talk more about in a future lesson, but they have some more properties associated with them and they have a greater number of characters that they actually support.
In any case, this means we can just go ahead and add in the following if check, surrounding our rune in a single quote, rather than using double quotes, which is what's used for strings. For example, if we tried to compare this against a string, you can see we get mismatched types: bytes and untyped string. Therefore, if you want to compare a byte against a rune, rather than using double quotes, you instead use singles, such as follows.
if x == ' ' {
fmt.Println("space detected")
}
Now, as you can see, we're comparing our bytes of x
against a single space character, which, if I go ahead and run, you can see now works as expected. So far, so good.
The next thing we want to do is, rather than printing to the console that we detected a space, we can instead just go ahead and add 1 to our wordCount
value, which we can do by assigning the value of wordCount
to wordCount
plus 1, or we could go ahead and use one of the shorthand ways of doing this, which is wordCount += 1
, or if you want, you can do wordCount++
. Either approach will work. Okay, with that, our code should now be incrementing the wordCount
whenever we detect a space, and we should then be printing this out to the console once we're done.
Now, if we go ahead and run this code, we're hoping to see the number 5 appear. However, when we do, we actually see the number 4. This is the first issue we have when it comes to this algorithm. But why is it producing 4 instead of 5? Well, if we take a look at the words.txt
file, you can see here we actually only have 4 spaces, despite having 5 words. This is because each word, apart from the last word, has a single space after it, meaning that there's always going to be one more word than there is number of spaces. Therefore, in order to resolve this, let's go about just adding in a single value to our wordCount
, which we can do as follows: wordCount++
.
Now, if we go ahead and run our code as follows, you can see it's producing the number 5. Additionally, if we go ahead and modify the words.txt
file and add another word in, say 6 - let's actually spell that properly - we can see the algorithm is now working as well, producing the number of 6 for the 6 words. Unfortunately however, whilst this algorithm is working for our current situation, where the file is well formed, there's unfortunately still a number of bugs with this algorithm. For example, if I go ahead and remove all of the words inside of this file, and run it again using the go run
command, you can see it produces the number 1, which isn't correct.
Therefore, in the next lesson we're going to spend some time going through some of these bugs, and we'll achieve this by adding in some tests. Before we do so however, as usual, go ahead and add in the changes to the main.go
into your git repository and commit them as follows, with a message similar to what I have here of simple word count algorithm
. Once that's done, I'll see you in the next lesson.