Chp 5: Strings and Files

From Logic to Data

We've spent most of our time with Python's logic - variable assignment and reassignment, functions and function calls, iteration and selection. We did encounter a number of Python's data types (ints, floats, Booleans, strings and lists) but didn't have much to say about them.  Let us begin to fix that. Our topic is this and the next two chapters will be the collection data types, i.e. strings, lists and dictionaries (and a few others).  An object of the string, list or dictionary types is a collection of objects; it is many objects collected together.

In this chapter we study strings. (And files too! But those, as we will see, are really just strings.)

Make a String

A string is a sequence of characters. Any characters. Of any length. Even zero length. The string of zero length is the empty string.

How do we make a string? One way is to enclose characters in quotes, either single, double, or triple.  The first string below is the empty string. The second is a string whose only character is a space.

>>> ''

''

>>> ' '

' '

>>> "spam"

'spam'

>>> '''spam'''

'spam'

>>> type('spam')

<class 'str'>

Note that Python reports back with single quotes.

Be careful here. The quotation mark at the start and the end in 'spam' are not part of the string. Instead the first quotation mark says "The string starts with the character after me!" and the second says "The string stopped at the character before me!" Here's a fancy way to say that: quotations marks are the string formation operator.

This of course gives rise to a question: How do we include quotation marks in a string? I'll let you Google that. (This will also answer the question of why Python allows us to use single, double and triple quotes to create strings.)

Here's a second way to make a string: convert a non-string type object to a string with the str type-conversion function. str takes an object and converts it, if it can, into a string.

>>> str(2)

'2'

>>> str(2.0)

'2.0'

>>> str([2])

'[2]'

Strings and Numbers

I know, I know. A number is a sequence of characters too, in particular a sequence of digits. But that doesn't make a number a string, for the digits in a number have a significance that characters in a string do not. The string '123' is just the character '1' followed by the character '2' followed by the character '3'. That's it. In number 123, however, the '1' is more than just the first character. It signifies that we have 1 hundred; and the 2 and 3 signify that we have 2 tens and 3 ones respectively.

So remember: a string is nothing more than certain characters in a certain order.

Traverse a String

Often we'll need to process a string. That means we'll step through it, character by character, and do something with what we find. Perhaps count characters of a particular kind. Or strip out certain characters. How will we accomplish that? A for loop of course. Consider the function below. It takes a string and a character, and returns the number of that character in that string.

1. def char_counter(a_str, char_to_count):

2.     # count and return the number of char_to_count in a_str

3.     count = 0

4.     for next_char in a_str:

5.         if next_char == char_to_count:

6.             count += 1

7.     return count

You read it, didn't you? Carefully, right? No? Then go do that.

Let's look at line 4: next_char is the loop variable and a_str is the sequence over which we iterate. So the first time through the loop, next_char is the first character in a_str. The second time through, it's the second character in a_str. Etc. We iterate once for each character in a_str.

Below are a few trial runs. Note that we do indeed count the number of occurrences of char_to_count in a_str.

>>> char_counter('spam and eggs', 's')

2

>>> char_counter('ham and jam and spam a lot', 'a')

6

>>> char_counter('ham and jam and spam a lot', 'z')

0

Indices and Slices

Let's say we have a sequence of objects and wish to specify their positions. Outside CS, we say 1st, 2nd, 3rd etc. Inside CS, we begin with 0th and then go on to 1st, 2nd, etc. (Why? Amuse yourself with a Google search. I did.)  In CS, when we give position, we start with 0.

An index is a position in an ordered sequence . Consider the string "spam and eggs".  The index of the "s" that begins the list is 0, the "p" next has index 1, etc. 

We can pull out a character from a string by use of indices. We place an index in square brackets after the string. The value returned is the character at that index. Witness:

>>> mp_str = 'ham and jam and spam a lot'

>>> mp_str[0]

'h'

>>> mp_str[2]

'm'

>>> mp_str[8]

'j'

We can also pull out a sequence of characters. When we do, the sequence of characters we get (either sequential or not) is called a slice.

How to we take a slice? We place a pair indices separated by a colon in square brackets after the string . If f is the first index and s the second, the string returned is the sequence of characters that begin at index f and end at index s - 1. Behold:

>>> mp_str[0:3]

'ham'

>>> mp_str[3:7]

' and'

If we add a third value, it's the step value. (You've noticed, haven't you, the similarity to the range function?) Thus if we have [s:f:t], we first take the character at index s. Next we take the character at index s + t. After comes the character at index s + 2t. Etc. As with range, we don't allow ourselves to reach or exceed f. 

If the third value is 2, we take every second character. If 3, we take every third. Observe:

>>> mp_str[0:26:2]

'hmadjmadsa  o'

>>> mp_str[0:26:3]

'h daa aao'

>>> mp_str[0:26:4]

'hajas o'

We can omit one, or two, or all three of the values in a slice. If we omit the first, it defaults to 0.

>>> mp_str[:26:4]

'hajas o'

If we omit the second, it defaults to that value which will carry us through to the end of the string.

>>> mp_str[0::4]

'hajas o'

If we omit the third, it defaults to 1.

>>> mp_str[0:26:]

'ham and jam and spam a lot'

Index Out of Range

Be careful with indices. You can easily crash out. Like this:

>>> a_str = 'spam a lot'

>>> a_str[10]

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

IndexError: string index out of range

The 't' at the end of 'spam a lot' has index 9. 10 takes us past the end of the string and so we crash.

Slices are more tolerant.  If you attempt to slice past the end of a string, you'll get the empty string back.

>>> a_str[1024:2048]

''

Or if your start index is indeed in the string but the stop index is not, we just slice through to the end of the string.

>>> mp_str[4:1024]

'and jam and spam a lot'

Negative Indices

We call also pull out characters from a string by use of negative indices. The idea is simple: imagine that we traverse a string from final character to first, that is from right to left, and that when we do so, we begin with the index -1.  For instance:

>>> 'spam'[-1]

'm'

>>> 'spam'[-2]

'a'

>>> 'spam'[-3]

'p'

Slices with negative indices work as expected. Consider:

>>> mp_str[-1:-4:-1]

'tol'

We begin the slice at index -1 (the end), we increase by -1 at each step (that's the second -1), and we stop before we get to or pass-4. Note that the second -1 is necessary.  mp_str[-1:-4] yields the empty string since the step value defaults to 1.

Len

I'm so very sorry, but Python is about to defy your expectations. I told you that the index of the initial element of a string is 0. So I wouldn't blame you at all if you conclude that the length of 'a' is 0, the length of 'ab' is 1, the length of 'abc' is 2, etc. But that's wrong. Look:

>>> len('a')

1

>>> len('ab')

2

>>> len('abc')

3

(As should be obvious, len is Python's length function. It takes a string and returns the length of that string.) Why the apparent inconsistency? Why do lengths begin at 1 and indices as 0?

Answer: size begins at 0 too!

>>> len('')

0

The length of the empty string is 0!

Satisfied? Even if not, you have to accept the fact: the index of the initial element of a string is 0, but 0 is not the length of a string that contains 1 character.

Traverse by Index

We can make use of indices to traverse a string. The trick here is to use the range function to generate a list of indices. (Why  traverse with indices? Why not simply use the for char in a_str construction from above? Versatility, as you will see.)

Let's first traverse a string from initial character to final by index.

p_str = 'ham and jam and spam a lot'

for i in range(0, len(p_str)):

    print(i, p_str[i])

Note the stop value in the range: the final index is the length of the string minus 1. (Don't get that? Pause and think. Perhaps reread the previous section.)

Ouput:

0 h

1 a

2 m

3

4 a

5 n

6 d

7

8 j

9 a

10 m

11

12 a

13 n

14 d

15

16 s

17 p

18 a

19 m

20

21 a

22

23 l

24 o

25 t

There it is: every character from the string tagged with its index.

Now let's say we wished to skip the first two characters in the string:

p_str = 'ham and jam and spam a lot'

for i in range(2, len(p_str)):  # start = 2

    print(i, p_str[i])

Now let's say we wished to skip both the first two and the last two:

p_str = 'ham and jam and spam a lot'

for i in range(2, len(p_str) - 2): # start = 2, stop = len - 2

    print(i, p_str[i])

Now let's say we wished to skip every other character:

p_str = 'ham and jam and spam a lot'

for i in range(0, len(p_str), 2):  # step = 2

    print(i, p_str[i])

Like I said, indices give us versatility.

String Functions

A string function is a function that either takes as string as argument or returns a string or both. Python provides us with many. Some are prefix, some are infix, some are suffix.


Prefix

Prefix functions precede their arguments. Among the prefix string functions are str and  len.

str 

Convert an object to string type. If the object is already a string, not change is made.

Example: str(2) yields '2'.

len 

Return the length of a string. Unlike with indices, we count length from 1.

Example: len('spam') yields 4.

chr, ord

Every  character on your keyboard has an associated number. (Google "ascii table" if you'd like to see them all.) The ord function returns the number associated with a character, and chr returns that character associated with a number.

Example: ord('S') returns 83 and chr(83) returns 'S'.


Infix

Infix functions, also called operators, come between their inputs.

in 

Return True if the left string is part of the right string, False otherwise.

Examples: 's' in 'spam' yields True, 'sa' in 'spam' yields False.

+  

Concatenate the left and right strings.

Example: 'spam ' + 'and' + ' eggs' yields 'spam and eggs'

<, >, <=, >= , ==, !=

Of two strings, the lesser is the one that comes before the other when the two are placed in alphabetical order. Upper case comes before lower case. (Does that seem strange? Please do go read about the ASCII Table.)

Examples: 'spam' < 'eggs' yields False, 'spam' <= 'spam' yields True and 'spam' < 'Spam' yields False.


Suffix

Suffix functions follow their arguments. One particular variety of a suffix function is a method. The syntax of a method call is: object.method(). Understand that the object which precedes the dot is the first argument of the function. So we say that the method is called on the object.

Python provides many string methods. Some replicate prefix functions. Here are a few.

count 

Count the number of occurrences of a sub-string within a string. As with ln, we count from 1.

Examples: 'aabccc'.count('c') yields 3'abaccaba'.count('aba') yields 2.

find 

Return the index of the first character of the first occurrence of a sub-string within a string. If the string contains no occurrences of the sub-string, return -1.

Examples: 'spam a lot'.find('s') yields 0, 'spam a lot'.find('a') yields 2, 'spam a lot'.find('lo') yields 7, 'spam a lot'.find('z') yields -1.

upper, lower 

Convert to either all upper or all lower case. For examples: 'abc'.upper() yields 'ABC' and 'ABC'.lower() yields 'abc'


Don't think these are all the string functions! A Google search will reveal many more.

Strings are Immutable!

None of the functions above that return a string change the argument string or strings. Instead if they return a string, they create a new string. Consider for instance concatenation:

>>> a_str = 'spam'

>>> a_str

'spam'

>>> a_str + ' and eggs'

'spam and eggs'

>>> a_str

'spam'

The concatenation of a_str and ' and eggs' did not modify a_str, as we saw when we asked for the value of a_str at the end.  Instead a new string was produced - 'spam and eggs'; and since that new string wasn't captured by a variable, it's value was lost.

If we wish a_str to hold the new string, here's the pattern we want:

>>> a_str = a_str + ' and eggs'

The right side creates a new string, and that new string becomes the new value of a_str.

Consider now the index operator. Perhaps you've tried something like a_str[3] = 'm' in the hopes that you can change the character at index 3 to an 'm'. Well, let's try that.

>>> a_str = 'spat'

Drat! I meant 'spam', not 'spat'. So let's try to fix it.

>>> a_str[3] = 'm'

Traceback (most recent call last):

  File "<pyshell#5>", line 1, in <module>

    a_str[3] = 'm'

TypeError: 'str' object does not support item assignment

The message, paraphrased, is that strings are immutable. We can't change them. Instead we must make new strings.

Space Strip

Let's write a little code together. The task I have in mind is to strip spaces from the start and the end of a string.  For instance, '   spam a lot     ' should become 'spam a lot'.

How should we approach this task? As always, the question is: What functions do we need? Here's my suggestion:

How will we strip initial spaces?  Here's an idea: find the index of the first non-space and then slice from that index through to the end. 

def init_strip(s):

    # strip all initial spaces from string s

    i = 0

    while i < len(s):

        if s[i] != ' ':

            first = i

            break

        i += 1

    return s[first:]

Let's give it a try:

>>> init_strip('   spam')

'spam'

Seems good! We better try edge cases though. How about the empty string? Or a string that contains only spaces? We should get the empty string back for both.

>>> init_strip('   ')

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "<stdin>", line 10, in init_strip

UnboundLocalError: local variable 'first' referenced before assignment

What is this! A crash? Ah, I see why. We never did find a non-space, so first was never assigned a value.  How do we fix that? Give first a default value. What should that be? len(s) should work. If that default is never overwritten, we'll slice from len(s), which is greater than the index of the final character; and a slice like that returns the empty string.

The function fixed:

def init_strip(s):

    # strip all initial spaces from string s

    i = 0

    first = len(s)

    while i < len(s):

        if s[i] != ' ':

            first = i

            break

        i += 1

    return s[first:]

Test cases:

>>> init_strip('   spam')

'spam'

>>> init_strip('   ')

''

>>> init_strip('')

''

I think we're good. The edge cases are now handled correctly.

Now we need a function to reverse a string. Here's a clever solution. Notice that we concatenate on the left.

def rev(s):

    # reverse string s

    r = ''

    for c in s:

        r = c + r

    return r

Let's test. Notice that the one edge case - the empty string - is handled correctly; the reverse of the empty string is the empty string.

>>> rev('spam')

'maps'

>>> rev('   spam')

'maps   '

>>> rev('   ')

'   '

>>> rev('')

''

Finally, let's write a function that strips all initial and terminal spaces. It's body is a one-liner.

def strip(s):

    # strip all initial and terminal spaces from string s

    return rev(init_strip(rev(init_strip(s))))

That's a beauty, isn't it? Good code is like lego: simple pieces that fit together is well-defined ways.  Let's see it at work.

>>> strip('  spam')

'spam'

>>> strip('spam   ')

'spam'

>>> strip('  spam a lot     ')

'spam a lot'

>>> strip('   ')

''

>>> strip('')

''

The code above illustrates nicely the mantra of Think Functional! Write functions. Make them single purpose. Test each; include all edge cases. The solution to your problem is the composition of the functions you've written.

Files

Let's end with a little file work. I'll only tell you as much as I need you to know to write the functions in the function set. But you'll get what's most important.

Understand what when I say just "file", I mean "data file". Programs are stored in files too, but when we say "file" we won't mean a file of that sort.

So, what is a file? A place to store data. Of what sorts? Strings. Files are strings. But unlike the strings we store in variables, they aren't lost when the program ceases execution. Files are persistent.

I use repl.it with my students. Below is an snapshot of the IDE. (Your IDE probably has such a view as well.) Look to the left. You'll see two files, main.py and class_data.txt. The first is the program file, the second the data file. The contents of the data file are shown below. (If you want to grab the class data, go to the Gradebook project. It's also at repl in a project I've begun.) 

Read a Line

We need to know how to access the data in a file, and we need to know how to write to a file. (I'll assume throughout that the file and the program file reside in the same directory. That way we don't have to specify a path.) Let's have a bit of code. I assume that you've created a file named "class_data.txt" and the class data from the Gradebook project in it.

data_file = open('class_data.txt', 'r') # 'r' for read only

a_line = data_file.readline()

print(a_line)

data_file.close()

On the first line, we have the new keyword open. It's a function (which of course you knew since it's followed immediately by parentheses). The open function  was sent two arguments, 'class_data.txt' and 'r'. They're both strings. The first is the name of the file that will be opened. The second specifies what we'll do with that open file; the 'r' is for read only, which means that we'll only get data from the file and not alter it. The open function returns an open file and data_file then becomes a name of that open file.

The second line calls the readline() method on data_file, and that method returns the first line of data. In string form of course. readline always returns strings. (Good question: how does Python know that it's hit the end of a line? Answer below.) If you grabbed the class data from the Gradebook project and put it in 'class_data.txt' (go do that now if you haven't already), the print statement on the last line will give us this output:

last, first, hw, hw, project, quiz, hw, hw, hw, quiz, hw, hw, project

On the last line, we have our second new keyword close. Always close your files when you're done with them.

Now let's do it again, but this time let's read two lines. (Are you excited? I'm excited!)

data_file = open('class_data.txt', 'r') # 'r' for read only

a_line = data_file.readline()

print(a_line)

another_line = data_file.readline()

print(another_line)

data_file.close()

Here's my output:

last, first, hw, hw, project, quiz, hw, hw, hw, quiz, hw, hw, project


Skald, Jacart, 57, 58, 71, 93, 56, 86, 90, 99, 55, 99, 88


What's up with the empty lines? Each line that we read from the file ends with a backslash n - '\n'. That's how Python marks the end of a line. (Google 'Python escape sequence' for more on this.) When Python's print function encounters '\n', it doesn't print those two characters; instead it jumps to a new line. So when we printed a_line and another_line, Python jumped to a new line for the '\n' at the end of each line read; and then it jumped to a new line again, since that's what Python print always does when it's done.

The upshot of this is that when you read lines from a file that contains data you'll need to process, you'll have to strip that '\n' off. That should be a simple task given the work you've done with strings. (Do note that the '\n' counts as a single character to Python.)

We have a final question to answer before we're done. Yes, we can read a line from a file. But how do we read them all? There must be some way to automate that, right?

Read All Lines

Of course there's a way to automate that. What we need is iteration; and to store the lines read we need a list. Like this:

data_file = open('class_data.txt', 'r')

a_line = data_file.readline()

line_list = [a_line]

while a_line != '':

    a_line = data_file.readline()

    line_list.append(a_line)

data_file.close()

How does this work? The key idea is that if we try a readline at the end of the file, we get an the emptry string. So we read and store lines in a list until readline returns the empty string, we will have read the whole fine; and in the snippet above, that's just what we've done.

How to Write to a File

I'll keep it simple. The idea here is very much like a file read.

First we need to create a file to which we'll write. Like this:

data_file = open('data.txt', 'w')

'data.txt' is the name of the file; the 'w' specifies that we intend to write to a file. Note that if a file with that name already exists, it will be overwritten.

Next we write to the file. Be careful here. The data that we write to a file must be a string. So assume that the variable line contains a string. We then write the contents of line to data_file like this:

data_file.write(line + '\n')

Note that a '\n' was placed at the end. This is not mandatory, but it is usual. We do so to mark the end of the line.

How will you write multiple lines of data to a file? Multiple calls to write of course. Probably you'll iterate.

Last, you should close the file when you're done. That's done as before.

data_file.close()