Python Essentials

My Essential Python Notes

Garen Ikezian published on
14 min, 2717 words

Tags: Python

Working with Files in Python Cheatsheet

Working with files in python is syntatically intuitive. To achieve file operations, there are are built-in functions and there are also functions from the module os that can be of use.

We can start with with some well-known built-in functions:

open()

It is a built-in function that returns a file object. The second parameter sets the text mode.

What is a text mode?

It is a setting for file handling that specifies how a file will be opened or used.

Each mode specifies how the file is accessed. For example:

'r' means "opens the file for reading only," 'a' means "opens the file for appending text only," and so on.

Note: The text modes for Python are identical to that of C's standard library function [fopen()](https://www.man7.org/linux/man-pages/man3/fopen.3.html)*

Here are some common text modes that you can specify:

  • r:

    • Opens the file for reading only (default if not specified).
    • It will return an exception if the file does not exist (returns FileNotFoundError).
  • a:

    • Opens the file for appending text only. It does not read the file.
    • File will be created if it does not exist.
  • w:

    • Opens the file to overwrite only. It cannot read the file.
    • File will be created if it does not exist.
  • x:

    • Opens the file for writing only (opened only once for the same file). It never overwrites and reads the file.
    • It will return an exception if the file exists (returns FileExistsError).

Adding a '+' to the mode will enable both reading and writing.

  • r+:

    • Opens the file for both overwriting and writing.
    • If the file does not exist, it will return an exception (returns FileNotFoundError).
  • a+:

    • Opens the file for both appendng and reading.
    • If the file does not exist, it will create one.
  • x+:

    • Creates a new file and allows both reading and writing.
    • If the file already exists, it will return an exception.
  • w+:

    • Opens the file for both overwriting and reading.
    • If the file does not exist, it will create one.

Note: Do not confuse open() with os.open(). os.open() returns a file descriptor while open() returns a file object.

write()

It is a built-in function in Python.

If the mode in open() is set to the modes r+, w, or a, write() can be called.

Otherwise, the interpreter will complain that the file is not meant to be written:

io.UnsupportedOperation: not writable

read()

Reads the whole file all at once. It is ideal for reading short texts.

    file = open("text.txt", "r")

    content = file.readline()

readline()

Reads a file line by line with every call. If the function is used inside a loop, it will read the whole line including the newline character (notice the end in the print function).

Ex:

    file = open("text.txt", "r")
    # This will read only 5 lines.
    for i in range(0, 5):
       print(file.read(), end='\0') 

Basic code examples:

To read an existing file:

file = open("file.txt", "r")
file.read()
file.close()

To overwrite an existent file:

file = open("file.txt", "w")
file.write("Hello world!")
file.close()

The two ways to open a file

What you might notice online is that a file can be opened either like this:

    file = open("text.txt", "r")
    file.read()
    file.close()

Or this:

    with open("text.txt") as file
        print(file.read())

The latter has two advantages:

  • It is more concise.
  • It manages exceptions better when the files are closed.

Some os module functions to know of include:

  • os.remove(): Removes the specified file in the current directory.

  • os.rename(): Renames the specified file ("src") to a new one ("dst").

  • os.mkdir(): Creates a new directory in the current working directory (cwd).

  • os.getcwd(): Returns the name of the cwd.

  • os.chdir(): Changes the current working directory to the specified directory. It can either be relative or absolute paths.

    • Note: Typing os.chdir always returns None. So typing:
       import os
       print(os.chdir("..").getcwd())
    

    Will not work because it is syntactically incorrect. It must be done proceedurally like this:

     import os
     print(os.getcwd())
     os.chdir("..")
     print(os.getcwd())
    
  • os.rmdir(): Deletes the specified directory (it will only the directory if it is empty).

  • os.listdir(): Lists all the files and subdirectories in a specified directory. It returns a list of files/dirs found in the cwd.

  • os.pardir: It is a constant string that gives us the cwd's parent directory name.

The os.path submodule is reserved for path-related functions. Almost all of the function's arguments accept pathnames.

Such functions include:

  • os.path.exists()
    • Checks if the path in question exists.
  • os.path.isdir()
    • Checks if the directory in the cwd exists.
  • os.path.isfile()
    • Checks if the file specified exists in the cwd.
  • os.path.getsize()
    • Returns the size of the specified file in bytes.
      • If you would like a human-readable output for file sizes, we can use a module called humanize (See link). For example:
        import humanize
        
        disk_sizes_list = [1, 100, 999, 1000,1024, 2000,2048, 3000, 9999, 10000, 2048000000, 9990000000, 9000000000000000000000]
            for size in disk_sizes_list:
                natural_size = humanize.naturalsize(size)
                binary_size = humanize.naturalsize(size, binary=True)
                print(f" {natural_size} \t| {binary_size}\t|{size}")
        
        Will output:
        1 Byte     | 1 Byte      |1
        100 Bytes  | 100 Bytes   |100
        999 Bytes  | 999 Bytes   |999
        1.0 kB     | 1000 Bytes  |1000
        1.0 kB     | 1.0 KiB     |1024
        ...
        
  • os.path.abspath()
    • Returns the absolute path of the file specified.
  • os.path.join()
    • Returns a string of a concatenated relative path with either a forward slash (/) or a backslash (). It is used for OS compatibility purposes.
  • os.path.getmtime(): Returns a Unix time of a particular path passed.

The datetime module involves manipulating date and time.

There is a datetime object in the datetime module (datetime.datetime).

Working with CSV Files

There is a csv module to parse csv files.

Such well-known functions in the csv module include

  • csv.reader(): Returns a list from each row in the csv text file.
  • csv.writer(): Returns an instance of a csv writer class.
  • csvwriter.writerow(row): Writes the a single row at a time (even if there are multiple rows in the list, it will be treated as a single row and the brackets are included. It is ideal if called inside a loop).
  • csvwriter.writerows(row): Writes multiple rows at a time.
  • csv.DictReader(): Reads csv file and creates a dict object in each order.
  • csv.DictWriter(): Writes a dict object to a csv file. The fieldname parameter is optional.

DictReader creates an object that operates like a regular reader (an object that iterates over lines in the given CSV file), but also maps the information it reads into a dictionary where keys are given by the optional fieldnames parameter. If we omit the fieldnames parameter, the values in the first row of the CSV file will be used as the keys. So, in this case, the first line of the CSV file has the keys and so there's no need to pass fieldnames as a parameter.

Using Regex in Python

Functions

Check out the re module doc for more info.

Inside the re module, there are functions like:

  • re.search(): Finds the first match anywhere in the string.
  • re.match(): Finds the match only at the BEGINNING of the string.
    • Here are explicit examples:
    import re
    
    print("**With re.search():**")
    print(re.search("a", "abc"))
    print(re.search("b", "abc"))
    
    print("**With re.match():**")
    print(re.match("a", "abc"))
    print(re.match("b", "abc"))
    
    Will output:
    **With re.search():**
    <re.Match object; span=(0, 1), match='a'>
    <re.Match object; span=(1, 2), match='b'>
    **With re.match():**
    <re.Match object; span=(0, 1), match='a'>
    None
    
  • re.fullmatch(): Checks for entire string to be a match. It does not return a substring.
  • re.findall(): It is like re.search, but it finds all the matches anywhere in the string. It returns a list or tuples.

Their differences from the docs are outlined here.

They all return a re.Match object.

There are functions for the re.Match object.

Match.groups: Returns a tuple of groups if the parentheses () exist in the passed regex (like (FirstWord,)). Otherwise, it will return ().

The Match object can be treated as an array where index 0 is the fully matched string while the latter indices are the groups.

Special characters

Some special characters to note are:

  • . (dot): Accepts any single character except for a newline character (unless if dotall has been specified)

  • ^ (caret): Matches at the start of the string. If inside the [], it means "not the following".

  • $: Matches at the end of the string.

  • []: If the reggex is with other characters, it is like .. But the only difference is that it has to match the characters passed. For example:

    • [Pp]: P or p
    • [0-9]: 0 to 9
    • |: A match that either this substring or that substring. Ex: cat|dog.
    • [a-z]: only lowercase a to z
    • [A-Z]: only uppercase A to Z
      • Note that [aA-zZ]is valid while [Aa-Zz]is invalid as the ASCII character 'a' comes before 'A' and not the other way around.
  • (): Creates a group. It is used to make tuples with match.groups()

  • * (asterisk): Matches everything including repetitions

  • +: Matches one or more occurrrences of the character that comes before it.

  • ?: Matches zero or more occurrences of the character that comes before it.

Escaping/Matching characters

  • \ (escape character): Escapes the wildcard characters. It can escape a special regex character or a special string character. It is for these reasons that raw strings are important.
  • \w: Matches alphanumeric characters. It matches letters, numbers, an underscore but NOT whitespace characters. It is equivalent to [a-zA-Z0-9_]
  • \s: Matches whitespace characters. It is equivalent to [ \t\n\r\f\v]
  • \d: Matches digits. It is equivalent to [0-9]
  • \b: Matches word boundaries.

Check regex101.com for more.

Rawstrings, IGNORECASE, and DOTALL

Rawstrings do not accept any special characters. We specify it using the letter r before the string. Ex:

result = re.search(r"ion", "occupation")
print(result)
# <re.Match object; span=(7, 10), match='ion'>

It is highly recommended to use rawstrings for regex stuff.

We can pass in values like IGNORECASE and DOTALL in our third parameters.

IGNORECASE: Ignores the difference between uppercase and lowercase letters. DOTALL: The . special can take a newline character if passed.

Environment Variables

To access the environment variables, we type in os.environ.

os.environ will return a mapping object. It is an instance of os._Environ class which itself is a subclass of collections.abc.MutableMapping. It is made to behave like dict but it is not related to the dict type.

If a variable passed exists in our environment (type env to see all the environment variables), we can type:

os.environ["PATH"]

This returns the paths for the PATH variable.

If the variable did not exist, it will print to stderr. It will complain that the key does not exist.

A workaround for there is the get function. It is written like so.

os.environ.get("PATH")

It can accept two arguments. Here, the second argument is assigned to None. It will return by default if a particular variable did not exist unless states othewise..

The sys module

Let's say you want to take in command line arguments before running a Python file. In C, we pass these in our main function:

int main(int argc, char* argv[]){
    return 0;
}

With Python, we simply import sys:

    import sys

    first_arg = sys.argv[0] #Returns the filename
    second_arg = sys.argv[1] #Returns the passing argument (if it exists)
    #Note that if the 2nd argument is not passed, it will return an IndexError exception

    print("This is the first arg", first_arg)
    print("This is the second arg", second_arg)

Exit Status

It is an integer number of a terminated process. It is not an inherent feature in the Python runtime environment as it relates to how operating systems handle processes and their termination. Do note that there are conventions and they are not fully standardized.

The most common accepted exit status numbers are:

0: Successful 1: General errors (failed to execute)

So if you wish to make custom exits, we can use the sys.exit function with the sys to raise an exception.

Ex:

    import sys

    def greet_user(name):
    if not name.strip():
        print("Error: Name cannot be empty.")
        sys.exit(1)  # Exit with status 1 for error
    else:
        print(f"Hello, {name}!")
        sys.exit(0)  # Exit with status 0 for success

    # Directly call the function
    name = input("Enter your name: ")
    greet_user(name)

To see the last exit status in Linux, type echo $? in your bash terminal. For Windows, type echo $LASTEXITCODE in powershell.

The more you know!

You can use the inspect module to find the source file and the documentation of a particular function.

Ex:

print(inspect.getfile(os.environ.get))

This will give us /usr/lib/python3.10/_collections_abc.py. It's very neat. Check out the inspect module for more.

subprocess modules

It is used to run Linux/Windows commands.

subprocess.run sends ICMP packets that are executed within a script.

Ex:

    subprocess.run(["ls", "-l"])

If we want to manually check the exit status. The subprocess module has a variable called returncode. We use it as such:

    result = subprocess.run(["ls", "This_file_does_not_exist"])
    print(result.returncode)
    # Prints 2

If we want to take a "screenshot" of the output when we passed in our commands, we need to set capture_output to true. We set to true in order to order the use of attributes stdout and stderr.

We use it like so:

    result = subprocess.run(["host", "8.8.8.8"], capture_output=True)
    # Now it's stored in the stdout attribute as it won't output an error.

    #Print the stdout 
    print(result.stdout)
    # Returns 
    # b'8.8.8.8.in-addr.arpa domain name pointer dns.google.\n'

Note that the letter "b" is meant to say, "This is not a proper string, it is an array of bytes"

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes. link

In order to translate an arary of bytes into a regular Python string, we use the decode() function. It will turn into an UTF-8 encoding by default.

    print(result.stdout.decode())
    # Prints (including the newline character)

    #b'8.8.8.8.in-addr.arpa domain name pointer dns.google.
    #

The get() function in the object dict

According to pydoc:

| get(self, key, default=None, /)

| Return the value for key if key is in the dictionary, else default.

And the Python manual

The dict datatype in Python has a get function. We use this function in order to avoid errors. If the second parameter is not passed, it will return None.

If the key does not exist in the dictionary, it will return the 2nd parameter if passed.

    usernames = {}
    name = "good_user"
    usernames[name] = usernames.get(name, 0) + 1
    print(usernames)