Lecture 10

2022-10-26 | Week 5 | by Siddarth Krishnamoorthy

Hi! Siddarth here. This lecture note covers Garbage Collection from Data Palooza and the first 20 slides from Function Palooza. We continue from last week.

Memory Safety
Garbage collection
Destructors, Finalizers and Disposers
Function Palooza
- Definitions

Memory Safety

We have had an introduction to memory safety in Lecture 8. As a recap, memory safe languages prevent any memory operations that could lead to undefined behaviour, while memory unsafe languages allow such operations. To see how memory safety (or lack thereof) can cause problems:

class Authenticator {
public:
  bool valid_login(string& user_name, string& pw) {
    string valid_user = "kippy", valid_pw = "woot!";
    lowercase(user_name, lower_name_);
    if (lower_name_ == valid_user && pw == valid_pw)
      let_user_in_ = true;

    return let_user_in_;
  }

 char lower_name_[8];
 bool let_user_in_ = false;
};

int main() {
  string user_name, pw;
  cin >> user_name >> pw;

  Authenticator auth;
  if (auth.valid_login(user_name, pw))
    cout << "Access granted, " << username << endl;
  else
    cout << "Access denied!\n";
}

If user_name is larger than 8, then the value of let_user_in_ is overriden and since C++ interprets any non-zero values as true, we find that any input of size larger than 8 can cause unintended access.

Do memory leaks lead to undefined behaviour? No! Since memory leaks only lead to unused memory not being freed, they don’t lead to undefined behaviour. Even if the system runs out of memory due to the memory leak, the program is predictably terminated. Therefore, we see no undefined behaviour.

Garbage collection

Garbage collection is the automatic reclamation of memory which was allocated by a program, but is no longer referenced in code. In languages with garbage collection (like Python), the programmer does not need to explicitly control object destruction - the languages automatically handles that for the programmer. When a value or an object on the heap is no longer referrenced, the program (eventually) detects this at run time and frees the memory associated with it.

Garbage collection has multiple advantages:

It eliminates memory leaks. This ensures memory allocated for objects is freed once it’s no longer needed
It eliminates dangling pointers and the use of dead objects. This prevents access to objects after they have been de-allocated
It eliminates double-free bugs. This eliminates inadvertent attempts to free memory more than once
It eliminates manual memory management. This simplifies code by eliminating manual deletion of memory

Garbage collection was pioneered in LISP in the early 1960s.

How should objects be garbage collected?

A good rule of thumb is that we should only garbage collect an object when there are no longer any references to that object. For example, if an object goes out of scope, it should be destroyed.

public void do_some_work() {
    Nerd nerd = new Nerd("Jen");
    ...
} // nerd goes out of scope

Another example, if an object reference (pointer) is overriden, then the old value should be deleted (as long as it’s not being used anywhere else).

public void do_some_work() {
    Nerd nerd = new Nerd("Jen");
    ...
    // we overwrite an obj ref
    nerd = new Nerd("Rick");
    // or 
    nerd = null; 
}

Approaches for garbage collection

Broadly speaking, there are three major approaches for garbage collection. They are

Mark and Sweep
Mark and Compact
Reference counting

In approaches 1 and 2, bulk garbage collection is done when the program runs low on memory (note that the program is temporarily frozen while this happens). In approach 3, the individual objects are garbage collected when their reference count equals zero.

Mark and Sweep

Mark and Sweep runs in two phases: a mark phase and a sweep phase. In the mark phase, the program identifies all objects that are still referred to and thus considered to be in-use. In the sweep phase, the algorithm scans all heap memory from start to finish, and frees all blocks not marked as being ‘in-use.’

Mark and Sweep was invented by John McCarthy (the inventor of LISP) in 1960!

Mark

During the mark phase, our goal is to discover all active objects that are still being used. We consider an object in-use (and its memory not reclaimable) if it meets one of two criteria:

it is one of a key set of root objects. Root objects include global variables, local variables across all stack frames, and parameters on the call stack.
it is reachable from a root object. If an object can be transitively reached via one or more pointers/references from a root object (e.g.,robot object points to battery).

During the first part of the mark phase, the garbage collector identifies all root objects and adds their object references to a queue for investigation. During the second part, the garbage collector uses the queue to breadth-first search from the root objects and mark all the reachable objects as “in-use”. To do this, each object is augmented with a bit (hidden from the programmer usually) which is set by the GC to mark that it is in use. Once all reachable objects have been marked, all unmarked objects can be disposed of.

Here is the pseudocode for the mark algorithm.

# Pseudocode for the Mark algorithm
def mark():
  roots = get_all_root_objs()
  candidates = new Queue()
  for each obj_ref in roots:
    candidates.enqueue(obj_ref) 

  while not candidates.empty():
    c = candidates.dequeue()
    for r in get_obj_refs_in_object(c):
      if not is_marked(r):
        mark_as_in_use(r)
        candidates.enqueue(r)

How does the GC find unmarked objects? This is done in the sweep phase.

Sweep

In the sweep phase, we traverse all memory blocks in the heap (each block holds a single object/value/array) and examine each object’s “in-use” flag.

Rather than use a queue, the mark and sweep algorithm can use a clever pointer manipulation trick instead. You can think of this as a breadth-first or depth-first traversal. Since all blocks in the heap are linked together top-to-bottom in a linked list, we only need to traverse linked list and remove those objects which are not “in-use”.

Here is the pseudocode for the sweep algorithm.

# Pseudocode for the Sweep algorithm
def sweep():
    p = pointer_to_first_block_in_heap()
    end = end_of_heap()
    while p < end:
        if is_object_in_block_in_use(p):
            reset_in_use(p)      # remove the mark, object lives
        else:
            free(p)                        # free this block/object
        p = p.next

Here are some pros/cons of Mark and Sweep

Pros:

It’s relatively simple
It has no trouble with cyclic references

Cons:

Program must be paused during GC, causing “freezes”
Not great for real-time apps because of freezes
Dealing with large amounts of data can lead to thrashing pages
Classic mark and sweep can cause memory fragmentation

Mark and Compact

Mark and compact is a twist on mark and sweep designed to address the issue of memory fragmentation. In Mark and Compact, the mark phase proceeds exactly the same as in mark and sweep. However, once we are done marking, we don’t sweep away unmarked objects! Instead, we compact all marked/in-use objects to a new contiguous memory block. Then we can adjust the pointers to the proper relocated addresses. Our original block of memory can be treated as if it’s empty and can be reused without dealing with any sweeping.

Here are some pros/cons of Mark and compact.

Pros:

It eliminates memory fragmentation
It has no trouble with cyclic references

Cons:

Much more complex to implement
Not great for real-time apps because of freezes
Requires more RAM to deal wih compaction
The algorithm is even slower than mark and sweep!

Reference counting

In reference counting GC, every object has a (hidden) reference counter attached to it that tracks how many references there are to it. Every time a new reference is created to an object, the languages secretly increments the counter, and vice versa. When the reference count of an object reacher zero, then it can be destroyed. When an object is destroyed, all objects transitively referenced by that object must also have their reference counts decremented. This can cause a cascading deletion of objects, all being freed at once (which is slow).

To mitigate this, instead of destroying an object as soon as its counter becomes zero, we can add it to a list of pending objects, and then reclaim memory regularly over time.

Here are some pros/cons of reference counting GC.

Pros:

It is simple
Memory reclamation doesn’t cause hangs, and is (usually) real-time
Efficient memory usage since blocks are freed immediately.

Cons:

Updating reference counts must be done in a thread-safe way
Need to update reference counts on each pointer operation
Cascading deletions can lead to uneven memory usage patterns at run time
References take up alot of extra space
Requires explicit handling of cyclic references

Destructors, Finalizers and Disposers

Many objects hold resources (e.g.: dynamic objects, temp files) which need to be released when their lifetime ends. There are three ways this is handled in modern languages, namely destructors, finalizers, and disposers.

Destructors

Destructors are only used in languages with manual memory management, like C++. There are deterministic rules that govern when destructors are run, so the programmer can ensure all of them will run, and control when they run. Since the programmer can control when they run, you can use destructors to release critical resources at the right times: e.g., freeing other objects, closing network connections, deleting files, etc.

Here is an example of destructors in action in C++.

void doSomeProcessing() {
   TempFile *t = new TempFile();
  
   ...
 
   if (dont_need_temp_file_anymore()) 
     delete t; // explicit call to a destructor
  
   ...
}

void otherFunc() {
  NetworkConnection n("www.ucla.edu");

  ...
} // destructor for n called when it goes out of scope

Finalizers

In GC languages, memory is reclaimed automatically by the garbage collector. So finalizers are used to release unmanaged resources like file handles or network connections, which aren’t garbage collected. Unlike a destructor, a finalizer may not run at a predictable time or at all, since objects can be garbage collected at any time (or not at all)! Since they can’t be counted on to run, they’re considered a last-line of defense for freeing resources, and often not used at all! We’ll learn more about finalizers when we cover Object Oriented Programming.

Here are some examples of finalizers in Java and Python.

// Java finalization example
public class SomeClass {
 
  // called by the garbage collector
  protected void finalize() throws Throwable
  {
    // Free unmanaged resources held by SomeObj 
    ...
  }
}

# Python finalizer method
class SomeClass :
  ...

  # called by the garbage collector
  def __del__(self):
     # Finalization code goes here
     ...

Disposers

A disposal method is a function that the programmer must manually call to free non-memory resources (e.g., network connections). You use disposal methods in GC languages because you can’t count on a finalizer to run!. Disposal provides a guaranteed way to release unmanaged resources when needed. However, we run the risk of forgetting to call the disposer.

Here is an example of a disposer in C#.

// C# dispose example
public class FontLoader : IDisposable
{
  ...
    
  public void Dispose()
  {
     // do manual disposal here, e.g., free
     // temp files, close network sockets, etc.
  }
}

...

var f = new FontLoader(...);
... // use f to draw fonts
f.Dispose();

Function Palooza

In Function Palooza, we will be doing a deep dive into functions. We’ll understand how functions pass arguments and receive parameters, how languages return value(s) from functions, how languages communicate errors across functions, how functions can be passed as arguments, returned and stored in variables, and how to design functions to operate on a variety of different types of inputs.

Definitions

Actual and Formal Parameters

The arguments in the definition of a function are called the formal parameters of the function. The arguments we pass to a function when we call it are called the actual parameters of the function. For example

double net_worth(double assets,  // formal parameters
                 double debt) {
  return assets – debt;
}

int main() {
 cout << "Your net worth is: " <<
   net_worth(10000, 3500); // actual parameters
}

Positional and Named Parameters

In languages that support positional parameters, the order of the arguments must match the order of the formal parameters. For example,

bool sum(float arr[], int n) {
  float sum = 0;
  for (int i=0; i < n; ++i) {
    sum += arr[i];
  }
  return sum;
}

int main() {
 float arr[3] = {78,99,65};

 cout << "The sum is: " <<
   sum(arr, 3);
}

In languages that support named parameters, the call can explicitly specify the name of each formal parameter (called an argument label) for each argument. For example,

def net_worth(assets,debt):
    return assets - debt

print("Your net worth is: ",
      net_worth(assets=10000,
                debt=3500))

print("Their net worth is: ",
      net_worth(debt=45000,
                assets=19000))

Many languages (like C++) support a combination of positional and named parameters.

print("Your net worth is: ",
      net_worth(10000,
                debt=3500))

Positional parameters allow for a less wordy syntax since we don’t need to specify an argument label for each argument. The disadvantage is that we have to pass the arguments in the same order as the formal parameters, and this can lead to bugs when we pass arguments in an incorrect order. With named parameters, you can add parameters and shift their order around more easily since each parameter is named. A change in order won’t cause a bug. It also makes code more readable, if a bit more verbose, since you know what each argument is.

Default Parameters

Most languages let you specify default values for specified formal parameters, making these parameters optional in the function call. For example,

double net_worth(double assets, 
                 double debt = 0) {
  return assets – debt;
}

int main() {
 cout << "Your net worth is: " <<
   net_worth(10000, 3500);
}

One or more formal parameters may have a default value. This makes passing the argument optional. If you decide to omit the associated argument to a formal parameter, the provided default value will be used. In languages (like C++ or Python) which do not have mandatory argument labels, default parameters must all be place at the end of the parameter list. This means that a definition like the following would be illegal in a language like Python.

double net_worth(double assets, 
                 double debt = 0,
                 double inheritance = 0) {
  return assets – debt + inheritance;
}

However, in languages with mandatory argument labels (like Swift), default values can be used for any parameter.

// Swift optional parameters
func net_worth(assets: Double, debt: Double=0,
               inheritance: Double) -> Double 
  { return assets-debt+inheritance }

func main() {
 print(net_worth(assets: 10000, inheritance: 500))
}

Some languages like Python or FORTRAN allow to have optional parameters without default values! A function can check if a given argument was present when the function was called and act accordingly. Here is an example in Python.

# Python optional parameters
def net_worth(assets, debt,**my_optionals):
    total_worth = assets - debt
    if "inheritance" in my_optionals: // checking if argument exists
        total_worth = total_worth + 
                      my_optionals["inheritance"]
    return total_worth

print("Net-worth: ", net_worth(10000, 2000))
print("Net-worth: ",
      net_worth(10000, 2000, inheritance=50000))

Here is another example, now in FORTRAN.

! Fortran function with an optional parameter
real function net_worth(assets,debt,inheritance)
    real :: assets
    real :: debt
    real, optional :: inheritance
    
    real :: total_worth

    total_worth = assets - debt
    if (present(inheritance)) THEN
        total_worth = total_worth + inheritance
    end if
    net_worth = total_worth  ! specify return value
end function net_worth

Python optionals aren’t great because you can’t easily look at the function prototype and figure out what options there are!

Variadic functions

A variadic function is a complex term that just means you can pass a variable number of arguments to a function. The canonical use case for such a function is printing a bunch of values.

# Python's print function is variadic
# You can pass any # of arguments to it!
print(1)
print(1,"a")
print(1,3.14159,"c",4,"foobar")

To implement variadic functions, most languages gather variadic arguments and add them to a container (e.g., to an array, dictionary or tuple) and pass that container to the function for processing. A notable exception is C++ which requires use of convoluted library functions to access variadic arguments!

Here are some examples of variadic functions in various languages.

Java

In Java, you may have zero or more fixed parameters. For the variadic parameters, Java creates an array containing every variadic argument and passes it to the variadic parameter.

// Java variadics use an array-like approach
public class VariadicExample {
 private void f(String regular, int... vparams) {
   System.out.println(regular);
   for (int i = 0; i < vparams.length; i++)
     System.out.println(vparams[i]);
 }

 public void someFunc() {
   f("Four #s",2,4,6,8); 
 }
}

In Java, all variadic parameters must have the same type. To deal with variadic parameters of differing types in Java, we can take advantage of the fact that all boxed primitives are derived from the Object class. So we could make a variadic of type Object, and then use type reflection (instanceof) to differentiate between types and specialise behaviour by argument.

JavaScript

In JavaScript we don’t specify variadic parameters in the function declaration, you just specify fixed parameters. JavaScript provides a builtin arguments array which is populated with all arguments (fixed and variadic).

// JavaScript variadics

function f(fixed1, fixed2) {
  console.log("Fixed: " + fixed1 + " " + fixed2)
  console.log("Varargs:")
  for (var i=2; i<arguments.length; i++)
    console.log(arguments[i])
}

f(1,"two","three",4.01);  
  // Fixed: 1 two
  // Varargs: three 4.01

Python

Python created a tuple containing each variadic argument and then passes it to the function. You can then enumerate over the tuple as desired.

# Python variadics
def f(fixed1, *args):
  print("First param: ", fixed1)
  print("Everything after first arg:")
  for arg in args:
    print(arg)

f(1,"two","three",4.0)

C++

Variadics in C++ are a little more tricky. Variadic arguments are identified by the formal parameter of .... C++ doesn’t have any way of determining the number of variable arguments, so we have to somehow specify this ourselves (i.e. by passing a fixed parameter with the number of arguments).

// C++ variadics are janky ☺
#include <stdarg.h>
 
void vprint(int count, ...) {
    va_list args;
    va_start(args, count);
    while (count--) {
      cout << va_arg(args, double) << " ";
    }
    va_end(args);
}
 
int main() {
  vprint(3, 3.14159, 2.718, 32.0);
}

To start processing variadic arguments, you have to create a va_list argument and then you have to call the va_start function with the name of the last fixed parameter. You can then call the va_arg function to get to each argument, and also advance to the next one. You finally call va_end to finish processing the variable arguments. You also have to specify the type of each value in C++, it won’t know it otherwise (this is because C++ doesn’t have type reflection).

Lecture 10

Table of Contents

Memory Safety

Garbage collection

How should objects be garbage collected?

Approaches for garbage collection

Mark and Sweep

Mark

Sweep

Mark and Compact

Reference counting

Destructors, Finalizers and Disposers

Destructors

Finalizers

Disposers

Function Palooza

Definitions

Actual and Formal Parameters

Positional and Named Parameters

Default Parameters

Variadic functions