File handling#

The data to use in analyzes, simulation or display obviously needs to originate from somewhere. It can be live data retrieved from sensor measurments, downloaded from somewhere in the internet or take from some database. But the most usual way to work with data will probably be to read and write it from files at some place in the filesystem. Python offers a variety of support for this both in its standard library and by means of specialized third party packages. When it comes to working with different file formats it’s mostly not the question if you can work with the file in Python, but just how to do it. There will be almost always someone, who had that problem before and made it happen by writing a Python package and publishing it to be freely available to everyone.

But we should start with the most basic functionality first and use simple standard library functionality to deal with files. In its standard library Python wraps functionality of the OS itself to allow working witj files. So these low level builtin functions work pretty much the same in all programming languages and are not specific to Python. However some simplifications are made in the Python implementation, to make the API look more pythonic.

To have some data to play with, we should create a text file with some contents first. And there is already the first special thing to take care of. Operating Systems dustinguish between text files (humand readable) and binary files. This is mainly to allow text files to have some control sequences in it to support an inner structure of the file. Control sequences exist for adding a New Line or Carriage Return for example to make a text end somewhere on the line and continue on the line below. Or adding a tabulator is possible as well to indent the text in a line. These originate from corresponding actions on typewriters and are implemented by using special unprintable characters that can be placed in the text. Unprintable characters are those outside of the letter, digit and punctuation range in the ASCII character set. For example a byte with the value of 10 will effectively cause a linefeed to happen, whilst a byte with the value 13 will cause a carriage return. Luckily for us, we do not have to create those bytes somehow to insert them into our readable text ourselves to cause the desired effect, but can simply use escape sequences for that. A line feed for example can be inserted by adding the sequencce \n to our text. Adding a \r will cause a carriage return to happen when interpreting the text, whilst inserting a \t will indent the text to the right at that position. There are quite some more control sequences around. However these are the most common ones.

Note

Please be aware that a proper new line works different on Windows and Unix like Operating systems. Whilst adding a \n is enough for all Unix systems to have the following text start at the beginning of the next line, on Windows systems two control sequences have to be added. Use the sequence \r\n here for the same effect. Also notice, that this is only necessary, if you plan to edit or read your text with other applications outside your Python environment. Many external editors support switching the new line sequence to be used somewhere in their configuration or try to guess it correctly when reading the text in a file. When staying within the Python environment staying at a simple \n sequence is probably the right idea, as you can always re-read the file in the same way you wrote it.

So let’s finally create our text file to play with and add some words to it, line by line.

We create a list of words, open the file with the name “words.txt” for writing in text mode using the parameter “w” and assign the resulting file handle to a variable called f. Next we loop over the elements of our list of words using a for loop, which assigns the next word in the list to a variable word each iteration. By using. the write method we can add the text provided as an argument to it to the open file f. To have a text with a new line at the end we create a format string adding the new line control sequence \n to our variable. Finally we have to close our file.

words = ["Hello", "World", "February", "Eleven", "Python", "Data", "Cloud", "Wisdom", "Fun"]
f = open("words.txt", "w")
for word in words:
    f.write(f"{word}\n")
f.close()

Quite some things happened in the background here. Although the code looks as though the file would have been filled word by word to create line after line, actually this probably has been done in a buffered manner to optimize performance. It’s very important to close the file after filling it to guarantee the data is really available as contents of a file in the filesystem, and not just somewhere in the layers of the operating system abstracting real file access away. If you don’t explicitely close the file yourself, chances are that the words will be written to disk only when the application ends. If you need to re-read the data at some point whilst the application writing it is still running, you cannot guarantee the desired content is available.

There are ways to force the operating system (to some degree) to perform the writes by actively forcing a sync using the flush method for example, or disabling the underlying buffer by setting its size to zero bytes. But for this simple task, just taking care of properly closing the file is enough.

Actually Python offers a way to help you with this using a concept called Context Managers. These Context Managers help to free allocated resources when leaving the scope of a context. These Context Managers are not available per se for everything, but can be implemented by ourselves to keep resource handling clean when creating own matching tasks. We will see this later when working with classes in object orientated designs.

A Context Manager exists however when working with files. It’s implemented using the with builtin and used like this. We’ll simply re-create our file of words once more using a Context Manager.

with open("words.txt", "w") as f:
    for word in words:
        f.write(f"{word}\n")

That’s it. No longer taking care of closing the file f. Python closes it for you as soon as the with block is done.

Reading the words back works pretty much the same way. The algorithm for this works this way:

  • Create a new empty list to store the words in in.

  • Open the file for reading this time within a Contet Manager.

  • Read the file line by line using the readline method.

  • Stop processing if no more lines can be read.

  • Remove the newline sequence at the end by stripping away white space at the end using rstrip.

  • Add the resulting word to the list of words.

One concept of implementing the algorithm could be the following.

words2 = []
with open("words.txt", "r") as f:
    while True:
        line = f.readline()
        if not line:
            break

        word = line.rstrip()
        print(word)
        words2.append(word)
Hello
World
February
Eleven
Python
Data
Cloud
Wisdom
Fun

Newer versions of Python allow to use a new operator, called the walrus operator := to be used to simplify reading the file further.

I think that’s quite an elegant approach this way, as it combines the check for continuing the loop and the exit check together with the variable asssignment in a readable and understandable concept.

words2 = []
with open("words.txt", "r") as f:
    while (line:=f.readline()):
        word = line.rstrip()
        print(word)
        words2.append(word)
Hello
World
February
Eleven
Python
Data
Cloud
Wisdom
Fun

The read back list looks like this now.

words2
['Hello',
 'World',
 'February',
 'Eleven',
 'Python',
 'Data',
 'Cloud',
 'Wisdom',
 'Fun']

And it stands a comparison check with the original list as well.

words == words2
True

So we’ve correctly serialized and de-serialized a dataset our first data set here. Let’s move on with a more complex example then. What about saving measurement or simulation data?

The many ways to read and write files#

Handling file contents in one go.#

The example we’ve seen above was just one way of handling files. It suited our needs, as words had to be written line by line. So a line by line based approach was obviously matching best. The few data we had to handle would have allowed to read the file in one go. We can use the read method for that

with open("words.txt", "r") as f:
    text = f.read()
    print(text)
Hello
World
February
Eleven
Python
Data
Cloud
Wisdom
Fun

The output of the result pretty much looks the same here, though when looking more closely it’s not quite the case. This time we’ve printed the whole contents of the file at once including the newline control sequences \n, which structure the text to a line by line output. The result is still a string, so to have the original input back, we need to process a little more.

The string type offers a handy method for that called split. It takes a character as an argument which should be used as a separator to split the string into a list of strings, the words in our case. If we use the newline control sequence \n for splitting, we should be able to re-create our input. Let’s see.

words2 = text.split("\n")
words2
['Hello',
 'World',
 'February',
 'Eleven',
 'Python',
 'Data',
 'Cloud',
 'Wisdom',
 'Fun',
 '']
words2 == words
False

Okay, so we’re not quite there. Theres a little trailing item here, which let’s our result differ from the original input, and empty string. This is because we’ve written our words that way to the file, adding a newline character to the end of each word, also the last one. The split method will always create a result with items to the left and the right of the provided separator, so we receive on more item here because of the very last separating newline character here. But we already know the answer to that. We just need to leave the last item out after splitting the text to receive our input.

words2 = text.split("\n")[:-1]
words2
['Hello',
 'World',
 'February',
 'Eleven',
 'Python',
 'Data',
 'Cloud',
 'Wisdom',
 'Fun']
words2 == words
True

There’s also the inverse operation to splitting a text into items, which is join. By using this we can create a symmetrical file handling operation again, very much like the initial approach, where we’ve written our words line by line, to read them back line by line afterwards. This time we’re about to join a list of words before writing it in one go, followed by reading it as a whole and splitting it into words again. It goes like this.

text = "\n".join(words)
print(text)

with open("words.txt", "w") as f:
    f.write(text)
Hello
World
February
Eleven
Python
Data
Cloud
Wisdom
Fun

This joins the words in our list using the newline character and writes it to the file. And reading it back works as expected now.

with open("words.txt", "r") as f:
    text = f.read()
words2 = text.split("\n")
words2
['Hello',
 'World',
 'February',
 'Eleven',
 'Python',
 'Data',
 'Cloud',
 'Wisdom',
 'Fun']
words2 == words
True

Reading and Writiing the file character by character.#

We could have read the file contents in smaller chunks defined by a size as well. This would be the approach to use if you know nothing about the inner structures of the file to be processed. Say you’re reading a very large file of words, a real dictionary maybe, without and structuring elements, so no newline characters inside. Then the readline approach won’t help here, as there are no such lines in the file. The file might also be to big to be handled in one go as it won’t fit into RAM. In that case we have to read the text in small chunks. We can again use the read method for that, as it takes an optional argument to specify the size of the chunk to read when called. In our simple case here, we will re-create the text string to hold the file contents first and split the text afterwards.

text = ""
with open("words.txt", "r") as f:
    while chunk:=f.read(4):
        text+=chunk
print(text)
Hello
World
February
Eleven
Python
Data
Cloud
Wisdom
Fun
words2 = text.split("\n")
words2
['Hello',
 'World',
 'February',
 'Eleven',
 'Python',
 'Data',
 'Cloud',
 'Wisdom',
 'Fun']
words2 == words
True

This example is not very good in the actually mentioned context. It works as the amount of data we handle is small, but for a text that won’t fit into RAM, this does not really help. A simple working example for this is slightly more complex.

chunk_size=4

words2 = []

with open("words.txt", "r") as f:
    word = ""
    while chunk:=f.read(chunk_size):
        for char in chunk:
            if char == "\n":
                words2.append(word)
                word = ""
            else:
                word += char
    words2.append(word)
    
words2
['Hello',
 'World',
 'February',
 'Eleven',
 'Python',
 'Data',
 'Cloud',
 'Wisdom',
 'Fun']
words2 == words
True

Play with the chunk_size a little to convince yourself, that the algorithm is correct.