Create an account to watch this video

Full Course
Handling large files in applications can pose significant challenges, particularly when memory usage is a concern. A common pitfall is attempting to load an entire file into memory, which can lead to crashes on systems with limited resources. This lesson highlights the importance of designing code that can efficiently process files without exceeding memory limits. Key strategies discussed include using the os.open
function to read files in manageable chunks rather than all at once, implementing a loop to read bytes iteratively, and employing subslices to handle only the bytes that were recently read. Additionally, the optimal size of the buffer is addressed, emphasizing the use of powers of two and multiples of the operating system's block size for more efficient file handling. This approach ensures that applications are robust and capable of running on a variety of devices, from powerful machines to resource-constrained environments.
Currently, our word counter application is working well, with both our unit tests, which are underneath the main.test
file, and with our words.txt
file, which contains the words of one two three four five
. If I go ahead and run my code, you can see it produces the letter number 5
, which is correct. Despite this, however, our application has yet another bug, one that's considered pretty serious.
To show this bug in action, here I have another words file called lotsofwords.txt
, which if I go ahead and print out the size of, you can see, is just over 1,000 megabytes, which is short of 1 gigabyte. If I go ahead and change my code to make use of this new file, lotsofwords.txt
, and run my code again, you can see after a short while, it should print out the number of words contained, which is just over 100 million. That's a lot of words. So, what's the issue?
Well, if I go ahead and run the following ulimit
command, which will set the available memory that this shell has to 1 gigabyte, then if I go ahead and run the code again, you'll see this time our code crashes with an error. This error is a runtime error letting us know that the application ran out of memory. Which makes sense, as we're loading the entire contents of our file into memory using the os.ReadFile
function, storing it in a byte slice called data
. Therefore, if the size of the file that we're trying to load in is greater than the available system memory, our code will crash. This is actually a pretty serious bug.
But like my favourite representation of the joker, you may be thinking, why so serious? How often is it that we're going to want to count the number of words in a file that's larger than my available system memory? In my case, I have a system with around 128 gigs, and it's very unlikely I'll ever have a text file that's going to be around that size.
Unfortunately, the problem is, whilst it may not affect our own use case, if we ever plan on distributing a CLI application, then we have to consider that we have no control over the available system memory that our code will be ran on, or even the size of files that may be passed into our application. Whilst the total system memory on most people's machines is going to be enough for most files, our code may actually be executed on much smaller devices, such as mobile phones, Raspberry Pis, or even VPS instances, all of which not only have a smaller amount of system memory to work with, sometimes as being low as 512MB, but will often be sharing that memory with other processes and services, meaning not all of that memory is going to be available for our own application.
Therefore, in order to know how to create powerful and robust CLI applications, we need to know how to write code that will have a small footprint, and will work on most machines, with consideration to how our code will run on more resource-constrained systems. This means that our current approach of loading in all of the file's data into memory using the os.ReadFile
function is generally a bad practice, and instead we should be loading smaller subslices of the file's data at a time, iterating through each subpart.
So how can we do this? Well, to do so, we can make use of another function provided by the os
package, this time the Open
function. This function takes a filename similar to the ReadFile
function we're already using. However, rather than returning a slice of bytes with the file contents, it instead returns a pointer to an os.File
type, which we can use to read the file's data in from sized chunks.
To show how this works, let's go ahead and use it in our code. To begin, let's go ahead and replace the reference to os.ReadFile
with a reference to os.Open
, which again takes a file name as a parameter. Additionally, let's go ahead and change the data
variable to be one called file
, which better represents the return value of the Open
function.
Next, let's go ahead and try passing it in to our CountWords
function. However, upon doing so, we actually receive an error. This error is telling us that we can't use the file
variable of type *os.File
as a []byte
value in argument to CountWords
. This is because the CountWords
function is expecting a []byte
, called data
, of which the file
variable is not, as it's an os.File
.
So how can we use this file
variable with our CountWords
function? Well, we need to be able to read the bytes from it in order to parse them in. To do so in Go, we could use something like the ReadAll
function of the io
package, which if we take a look at the documentation, ReadAll
reads from r
, a io.Reader
, which in this case the file conforms to, until an error or end of file and returns the data it reads in the form of a slice of bytes. So if we go ahead and use it, passing in our file to collect the data
and then pass this data
variable to our CountWords
, you can see we know how long I have an error.
However, this code won't actually work. If I go ahead and run the code again on the same shell that's constrained to 1GB, yet again we get a crash. This is because the ReadFile
function of the io
package is basically doing the same thing as the os.ReadFile
method we were using before, where it's reading all of the data from the given *os.File
and storing it inside of a byte slice called data
. Effectively all we've done is turn the single step we had before into two discrete steps. Which means we again we can't just load all of the contents of the file into our data
, as it will cause another crash.
Therefore, in order to use the *os.File
type, we're going to need to think about how we access the data provided by our *os.File
. Before we do that, however, let's take a step back. And rather than trying to integrate this *os.File
type into our CountWords
algorithm, let's instead simplify and try and figure out how we can print all of the characters of the file to standard out. This will allow us to better understand how reading from a file works, after which we can then move on to adding it into our algorithm over the next couple of lessons.
Therefore, to begin, let's go ahead and create a new function called PrintFileContents
, of which we'll take a parameter of file
, which is a pointer to os.File
, as follows. Then we can go ahead and call the PrintFileContents
function inside of our main
function, passing in our file
, and let's go ahead and comment out the call to our existing algorithm.
As for the implementation for this function, the main goal is that we want to be able to iterate over each byte inside of a file, printing it to standard out. Therefore, the first step in doing that is to print out the first byte, which is a good first goal. In order to obtain bytes from the file, we need to use the Read
method. If we take a quick look at the documentation, we can see how this method works. This method accepts a slice of bytes called b
. And if we read the documentation, Read
reads up to the length of b
, so the length of the slice of bytes that we pass in.
This means that if the slice of bytes is one, it'll read one byte from the file and store it within the slice. Effectively, you can think of b
as being a buffer, of which we'll be storing the bytes that are read from the file inside. Therefore, in order to read bytes from the file, let's go ahead and create a buffer which is going to be a slice of bytes.
In order to create a slice of bytes with a length, instead of defining the buffer this way, we can actually use the make
function, which is similar to the malloc
function when it comes to C, or the new
function in C++.
This function takes a type, as its first parameter, and a size, which is a variadic array of integers, meaning you can pass multiple ones in. Don't worry about that for the moment, however, we'll take a look at that later on in the course. In our case, the type we want to pass in is a slice of byte, and the size we want to pass in is one, which will end up creating a slice of bytes with length of one.
For example, to show that this is the case, if I go ahead and call the fmt.PrintLn
function, passing in length of buffer
, and we'll do len(buffer)
as follows, followed by now running our code, you can see our length of buffer is one. This means we can now pass this buffer into the file's Read
method, and we should then read one byte from the file, storing it in our buffer. For the moment, let's go ahead and ignore the two return values that come back from the Read
method.
We'll take a look at both of those later on in the lesson. In any case, now that we're reading from our file into our buffer, let's go ahead and print out the following read bytes
, or read from file
, and we'll cast it to a string of buffer
. Whilst we're here, let's also go ahead and change this from lots of words
to our words.txt
, which is our file that contains the words one two three four five
.
Now if we go ahead and run this code, you can see we read from file the letter o
, and if I print out the words.txt
, you can see that this is the first character inside of our words file. Great, we've managed to read a single byte from this file, but how can we go about reading the rest of the bytes?
Well, one simple way to do so would be to go ahead and copy these two lines over and over again. Then if we go ahead and run our code again, you can see this time we're reading each individual byte. However, this isn't a scalable solution in order to read all of the contents of the file. Instead, we need to use a loop, such as the for loop, as follows. However, now we have an infinite loop. If I go ahead and run this code, you'll see that it runs forever, printing the same line of read from file, of which there's no actual byte to print out.
Instead, we need a way to exit this loop when we're done reading. This is where those two return values that the Read
function are going to come in handy. If we take a look at the documentation of the Read
method inside of the os
package, we can understand what these two return values actually are. The first of these is an int
, which represents the number of bytes read, and the second return value is an error
, which represents any error encountered.
Additionally, at the end of the file, Read
returns 0
and io.EOF
, meaning that the integer value is 0 and the error is an io.EOF
error. Therefore, we can use either of these two return values in order to exit our loop early. Let's go ahead and capture the size first and ignore the error to see how this would work. In this case, because we know at the end of a file, Read
returns 0, io.EOF
, we can then do an if check of if size == 0
, then we can just break
out of our loop.
Now, if we go ahead and run this code, you can see we're printing all of the contents of the file and then exiting. Let's go ahead and change from our Println
method to Printf
, and we'll get rid of this prefix of read from file, or read from file, depending on how you read it, and go ahead and print out the buffer cast to a string. Now, if we run this again, you can see we're printing out the exact contents of our words.txt
. Great.
Whilst using the if size == 0
is perfectly valid, personally it's not my favourite choice. Instead, I personally prefer to check the err
value to see whether or not an error was returned. The reason I prefer this err
value is because it not only returns an error at the end of the file, letting us know that the Read
function has finished, or letting us know that the file can no longer be read, but if there's any other error that may be encountered, this will also return true
as well, in which case we also probably want to stop reading from the file.
Therefore, let's go ahead and capture this error in the following variable, and we can use it to break out instead. Using our error handling syntax of if error != nil
. Then we'll go ahead and break. Now we have an error letting us know that the size
variable is declared and not used. However, rather than ignoring it, let's go ahead and add in the following syntax to just squash the error, because we're going to make use of this size
variable in a minute. Now, if I run this code again, you should see it works the same as it did before. Great.
With that, we've managed to read in from a file byte by byte. However, we still actually have an issue. Whilst reading from a file byte by byte works, it's unfortunately really slow, as well as being very inefficient due to the way that operating systems work when it comes to reading from file systems. Instead, you'll typically want to set your buffer size to be a little larger, which, as you'll notice at the moment, is currently a magic number.
Let's go ahead and replace this magic number with a constant called bufferSize
, which we'll set to 1
. Perfect. As I mentioned, you're typically going to want to set your buffer to be a larger size, so that you can read a more optimal number of bytes each time from the file system. But unfortunately, our code doesn't actually handle this well. For example, if I go ahead and change the buffer size from 1
to the number 7
, which is a prime number, followed by running our code using the go run
command, you can see, whilst it does print out the contents of our file, one two three four five
, it also prints out these other characters at the end, which don't exist inside of our words.txt
.
So what's going on? Where are these additional characters coming from? Well, these values are actually bytes that are left over inside of our buffer, from a previous read. As you can see, there's 4 bytes in total, which actually correlate with the r
,
, f
and i
from the words.txt
. This is happening because the last read of our file data only read in 3 bytes, which would be the v
, v
and newline \n
character, which is hard to actually see.
And therefore, the last 4 bytes of the buffer - of which our buffer size is 7
- weren't being overwritten. To better show this, if I go ahead and print out the full buffer each time we read on separate lines as follows. And let's go ahead and create an int
where we can see i++
and we'll print this out here as follows. And let's go ahead and replace the newline character just to make it a little easier to see. Replace all from the buffer. Newline, which is denoted as a \n
. And we'll replace it with the dot character for the moment.
If I go ahead and run this code, you can see what's happening. The first read we're reading one two
. The second read we're reading three
. The third we're reading four fi
. And the fourth read we're actually reading ve.
. And then the r fi
is from the read above. To see another example of this, let's go ahead and change our buffer size to be 11
, which is another prime number. The reason prime numbers are susceptible is because they're less likely to be a divisor of the file's total byte size.
Let's also comment out the debugging print that we've got. And we'll just do fmt.Print(string(buffer))
. Now if I go ahead and run this code, you can see that we have even more garbage values. Although this time you can see the word for space and the first three characters of five. If I go ahead and print this out so we can see it a little better, here you can see better what is actually happening.
This time there are nine bytes on the second read that are being printed out on read number three, which should only be printing the first two bytes. So how do we go about solving this? Well, this is where the first return value of the Read
method comes in - the size
. As you'll remember, this is the number of bytes that have been read by the Read
function, which means it's the number of new bytes that have been placed into the byte buffer. This means we can use this value in order to know which bytes we should ignore.
So let's go ahead and fix our issue by using it. To do so, let's go ahead and get rid of this code that we set up in order to see this issue better. Then let's go ahead and create a new variable called subBuffer
. This variable, as the name implies, is going to store the subslice of the buffer that contains only the recently read bytes. We can derive a subslice in Go using the following syntax: where we pass in the start of the slice that we want, followed by a colon, and the end of it.
In this case we would be taking a subslice of the buffer, starting at index 1 and ending at index 3 [1:3]
, but not inclusive - which means we would only have a byte slice of 2, where the first value would be the value at index 1 and the last value would be the value at index 2, not 3. To better show this, here I have a more simple example, where we have a buffer with the values a, b, c, d and e. If I go ahead and run this code using the following go run
command, you can see it prints out the values of our subBuffer
, which at the moment is exactly the same as the buffer, containing bytes that are represented as a, b, c, d and e.
If we want to create a subbuffer
which contains the first two values, we can do so as follows. First passing in the start index we wish to take from, in this case is going to be the zeroth index. Next we then pass the separator, which is a colon, followed by passing in the top index we wish to take. In this case we want to pass in the first two values, which is the zero index and the first index. Therefore the top index we need to pass in is actually the second index, which marks the boundary where we want to take from. This can sometimes be a little confusing, which is why we're doing this example.
Now if I go ahead and run this code, you can see that our subBuffer
only contains a
and b
. Additionally if we want to take the middle three values, we can go ahead and change our code again to take from the first index, which will be b, all the way up to the second, third, third index, which means we need to specify the fourth index. Now if I go ahead and run this code again, you can see our slice is b, c, and d. Additionally this syntax also comes with some syntactic sugar whenever we want to take from the start or end of a slice.
For example, if I go ahead and set this to be three and just remove the first value, then if we go ahead and run this you'll see that it takes from the start of our buffer, a, all the way up to index three, not including of course, which means we're taking a, b, and c. This also works for the end of the buffer as well. So if we pass in three followed by a colon and don't specify an upper to our subslice, if we go ahead and run this again, you'll see we get d and e, which is the third index and fourth index of our buffer
now placed inside of our subBuffer
.
Hopefully that explains how we can take a subslice from a slice, therefore let's go ahead and use it in order to take a subslice of our buffer up to our size. Now if we go ahead and run this code, we should see that our erroneous values are no longer being printed and the exact same string as our words.txt
is printed out. Additionally if we go ahead and change this to be the value of seven and run our code again, again you can see we're no longer getting any garbage values printed out after our words.txt
string, letting us know that the problem has been resolved.
Additionally we can also get rid of this subBuffer
code and just replace this with a call to buffer
and up to size. And we can check that this is working again, which it is. With our PrintFileContents
function now working, let's go ahead and load in our lotsofwords.txt
files to make sure that we're no longer crashing. Before we do so, we're first going to want to set a larger number for our bufferSize
. If you'll remember, earlier I mentioned that reading in a single byte at a time from a file is rather inefficient.
And this will also be the case if we try and only read seven bytes in at a time as well. For example if I go ahead and run this code, you can see that it won't finish in any short amount of time. There's a couple of reasons why this is going to take a long time. The first is due to the buffer size being unoptimal or inefficient, but the second reason is because we're also printing to standard out. Which, believe it or not, is actually quite a slow operation.
Therefore, in order for this to finish in a reasonable amount of time, let's remove the prints of our buffer
to standard out and instead keep track of the total size or total number of bytes that we've read. Adding in the following totalSize += size
. We'll then print this out using the fmt.Println(total bytes read:
and we can pass in our totalSize
. As you can see, this code is still running, which then has done so for two minutes, four seconds. Therefore, let's also let's go ahead and change this to be 1024
, which is about a kilobyte in size.
This 1024
also isn't optimal, although it is more so than the actual seven bytes we were reading before. We'll talk about what the optimal buffer size is in a minute. For now, if we go ahead and run this code, you can see it's pretty quick. And to prove that the buffer size has an impact, let's go ahead and change this value to 8 and run this code again. And you can see this time it's a lot slower.
So what is the ideal buffer size? Well, as a start, you'll want to use a power of 2
. Such as, in this case you can see we're using the power of 2 of 8, but also 1024
is also a power of 2 as well. However, in addition to using a power of 2, we should also use a multiple of the operating system's pre-configured block size. For most operating systems, this is going to be 4096
. You can actually obtain this value from your operating system a number of different ways, depending on which file system that you're using.
In my case, because I'm using btrfs, I can use the following sudo btrfs inspect-internal dump-super -f
, pointing it to my actual partition, which isn't my nix root partition. Let's go ahead and set this to be nix home
. And then grepping for the sector size. If I run this, you can see my sector size is 4096
. Therefore, by setting a multiple of this value, either 4096
or 8192
, means that we're not wasting any bytes the kernel has actually read from the file system itself.
Oh good, our actual 8 byte buffer finished in about 1 minute 23 seconds, considerably slower than the 1024. Let's go ahead and set our buffer size to be 8192. And to really test the speed, let's go ahead and build our code. And we'll use the time
function in order to actually track the amount of time it takes. Here you can see it was 0.215 in total. And if we go ahead and change this to be 1024, and we'll build our code again, and we use the time function this time, you can see it was 0.868.
So considerably faster. In any case, that wraps up the end of this lesson, where we took a look at how we can actually handle large file sizes, which would previously crash our application. In the next lesson, we're going to build on top of the *os.File
type, using it in our algorithm in order to be able to count the words from files of any size.