Lecture 09

2023-05-01 | Week 5 | edited by Matt Wang

(originally written 2022-10-24 by Matt Wang and Siddarth Krishnamoorthy)

Heya! Matt here. This lecture note covers scoping and memory safety from Data Palooza.

Table of Contents

Scoping

Scope and “In Scope”

The scope of a variable is the part of a program where a variable is valid (i.e., can be accessed). The parts can be lines, statements, expressions, instructions, or other units!

A variable is “in scope” in a part of a program if it is currently accessible by name.

For this example,

void foo() {
  int x;
  cout << x;
}

we’d say:

  • “The scope of the x variable is the function foo().”
  • “The x variable is in scope within the foo function because it is defined at the top of the function.”

Scoping strategies describe when variables are in-scope!

Here’s a simple example from C++:

void foo() {
  int x;
  cout << x; // Just fine, x is in foo’s scope!
}

void bar() {
  cout << x; // ERROR! x isn’t in bar’s scope!
}

C++ uses a lexical scoping strategy that scopes to blocks (we’ll explain this soon).

Lexical Environments

Here’s a slightly more in-depth example:

string dinner = "burgers";     // this variable's scope is global - it's in scope everywhere

void party(int drinks) {       // drinks is in scope for the entire party() function
  cout << "Partay! w00t";
  if (drinks > 2) {
    bool puke = true;          // bool is only in scope for this if statement
    cout << "Puked " << dinner;
  }
}

void study(int hrs) {          // hrs is in scope for the entire study() function; but, not the same hrs as main()!
  int drinks = 2;              // drinks is also in scope for study(); but, not the same drinks as party()!
  cout << "Study for " << hrs;
  party(drinks+1);
}

int main() {
  int hrs = 10;                // this hrs is only in scope in main(); different from the hrs in study()!
  study(hrs-1);
}

Functions have their own scope too: they are defined immediately after they’re declared.

As the program runs, it maintains a lexical environment: a tracker of what’s in scope and what’s not!

  • for example, in the main() function, the lexical environment only contains the variable hrs and the three functions main(), study(), and party()
  • but, once we call study(hrs-1) and enter a new environment, the lexical environment changes: we get a different hrs (the one in the argument), as well as a new variable drinks
  • for a full trace, play through the animations on slide 83!

Many languages have slightly different scoping strategies! Differences include:

  • how do constructs like functions, control flow (if statements, loops), or classes affect varaible scope?
  • what do you do if you try to define a new variable with a name that already exists?
  • when does a variable’s scope end?

By the end of this section, we’ll know how to answer these questions for all sorts of functions :)

Lifetime

There is one related but different concept: a lifetime.

A variable’s lifetime describes when a variable can be accessed. It includes times when the variable is in scope, but also times where it’s not in scope (but can be accessed indirectly).

Values also have lifetimes. Variables and the values they point to can have different lifetimes!

Take a look at our previous example, but slightly modified:

void study(int how_long) {
  while (how_long-- > 0)
    cout << "Study!\n";
  cout << "Partay!\n";
}

int main() {
  int hrs = 10;
  study(hrs);
  cout << "I studied " << hrs <<
        " hours!";
}

The scope of hrs is only within main(); when we call into study, we can no longer access it. But, its lifetime also extends to study(), since we can access it indirectly - through the variable how_long!

Python lets you manually end a variable’s lifetime early:

def main():
  var = "I exist"
    ...
  del var    # no longer exists!
  print(var) # error!

Lexical Scoping

Lexical scoping is the most-used scoping paradigm (almost all languages you’ve used use it). The core tenet has to do with how your code is organized, which we’ll call the context that a variable is defined in. If we can’t find the variable in the current context, we’ll go to the enclosing context - until we run out (and hit the global context).

More strictly, languages like Python use the LEGB rule. We look for a variable in this order:

  1. First, look at the local scope. In Python, this is either an expression or a function.
  2. If it’s not there, look at successive enclosing scopes. Repeat this until…
  3. You hit the global scope.
  4. If it’s still not there, look at the built-in scope - this contains things like print

Importantly, once a context is over, the variables in that context can’t be accessed!

To get a feel of lexical scoping, let’s go through a few examples.

In Python, one context type is an expression:

sum([x*x for x in range(10)])     # this works :)
sum([x*x for x in range(10)] + x) # x isn't in scope!

In this example, x is defined in the context of an expression (a list comprehension). Its scope is only in that expression!

In C++, one context type is a block (roughly: any set of braces, { and }):

if (drinks > 2) {
  int puke = 5;
  // ...
}
cout << puke; // error - puke is not in scope!

Note that Python does not scope with respect to blocks!

def a():
  if True:
    x = 5
  print(x) # this works!

However, both Python and C++ scope to functions:

def a():
  x = 5

print(x) # error - x is out of scope!
void snore(int n) {
  int i = 0;
  while (i++ < n){ /* */}
}
cout << i; // error - i is not in scope!

Other relevant scoping constructs include:

  • classes: think about private functions and member variables!
  • namespaces: think about using namespace std; versus having to type out std::cout
  • the global scope: the last resort!

Shadowing

When a variable is redefined in a different context, lexical scoping languages typically use an approach called shadowing. In an inner context, the redefined variable “replaces” all uses of the outer context variable; once the inner context is finished, we return to the outer context variable.

Here’s a classic example from the Wikipedia page on shadowing for Python:

def outer():
  x = 1
  def inner():
    x = 2
    print("inner:", x)
  inner()
  print("outer:", x)
outer()
print("global:", x)

and in C++:

int main(){
  int x = 42;
  int sum = 0;

  for (int i = 0; i < 10; i++) {
    int x = i;
    std::cout << "x: " << x << '\n'; // prints values of i from 0 to 9
    sum += x;
  }

  std::cout << "sum: " << sum << '\n';
  std::cout << "x:   " << x   << '\n'; // prints out 42
}

Shadowing is not great practice (though there are legitimate uses!). However, you’ll encounter it frequently when reading code, and accidentally! For example, consider this function in Python:

def a(input):  # this shadows the global input!
  print(input)
  a = input()  # this doesn't work anymore!!

So, you have to be careful with shadowing!!

Dynamic Scoping

Dynamic scoping takes a different approach: the value of a variable is always the most recently defined (or redefined) version, regardless of the lexical scope! For dynamic scoping, what matters is the chronological order variables are defined in: not how the code is structured.

Here’s a quick example:

func foo() {
  x = x + 1;
  print x
}

func bar() {
  x = 10;
  foo();
}

func main() {
  x = 1;
  foo();
  bar();
}

// prints:
// 2
// 11

The interesting note is that foo() or bar() didn’t need arguments or parameters. After defining x once, its value “persists” across foo(), bar(), and the subsequent call to foo()! Neat!

This seems to behave like Brewin…

These days, few languages use it. Here’s one example from Emacs Lisp:

(setq a 100)  ; sets a to 100

; prints the value of a
(defun print_value_of_a ()
  (print a))

; define local variable a, then
; call print_value_of_a
(let ((a -42))
  (print_value_of_a))

; this prints: -42

Memory Safety (and Management)

Memory safety is a key property of strongly-typed languages. In short, memory safe languages prevent any memory operations that could lead to undefined behaviour, while memory unsafe languages allow such operations. These include:

  • out-of-bound array indexes and unconstrained pointer arithmetic
  • casting values to incompatible types
  • use of uninitialized variables and pointers
  • use of dangling pointers to dead objects (use-after-free, double-free, …)

These are the biggest source of security vulnerabilities in low-level code. In memory safe languages, all the above operations are usually not allowed at compile-time (ex pointer arithmetic or invalid casts) or runtime errors (out-of-bound array indexes).

Do memory leaks lead to undefined behaviour? No! Since memory leaks only lead to unused memory not being freed, they don’t lead to undefined behaviour. Even if the system runs out of memory due to the memory leak, the program is predictably terminated. Therefore, we see no undefined behaviour.

A large innovation in programming languages over the past 50 years is in garbage collection, a form of memory management. Among other things, it helps ensure memory safety in a program; so, we’ll devote a huge chunk of content on it! We’ll also contrast this with other approaches like smart pointers, manual memory management, etc.

Garbage Collection

This is unironically one of Matt’s favourite lecture topics. If you’re interested in software engineering, a ton of great work goes into optimizing GCs and memory management. And, it lets us talk about Rust!

Garbage collection is the automatic reclamation of memory which was allocated by a program, but is no longer referenced in code. In languages with garbage collection (like Python), the programmer does not need to explicitly control object destruction - the languages automatically handles that for the programmer. When a value or an object on the heap is no longer referrenced, the program (eventually) detects this at run time and frees the memory associated with it.

Garbage collection has multiple advantages:

  • It eliminates memory leaks. This ensures memory allocated for objects is freed once it’s no longer needed
  • It eliminates dangling pointers and the use of dead objects. This prevents access to objects after they have been de-allocated
  • It eliminates double-free bugs. This eliminates inadvertent attempts to free memory more than once
  • It eliminates manual memory management. This simplifies code by eliminating manual deletion of memory

However, reasoning about garbage collection is tricky - it’s one of our first language behaviours that’s “nondeterministic”.

Approaches to garbage collection

A good rule of thumb is that we should only garbage collect an object when there are no longer any references to that object. For example, if an object goes out of scope, it should be destroyed.

public void do_some_work() {
    Nerd nerd = new Nerd("Jen");
    ...
} // nerd goes out of scope

Another example, if an object reference (pointer) is overriden, then the old value should be deleted (as long as it’s not being used anywhere else).

public void do_some_work() {
    Nerd nerd = new Nerd("Jen");
    ...
    // we overwrite an obj ref
    nerd = new Nerd("Rick");
    // or
    nerd = null;
}

We’ll focus on three (of many) major approaches to GC:

  • mark and sweep
  • mark and compact
  • reference counting

The first two are part of a class of GC algorithms called “tracing algorithms”, and operate in “bulk” - they do it all at once. The latter instead uses a different metric.

Mark and Sweep

Mark and Sweep runs in two phases: a mark phase and a sweep phase. In the mark phase, the program identifies all objects that are still referred to and thus considered to be in-use. In the sweep phase, the algorithm scans all heap memory from start to finish, and frees all blocks not marked as being ‘in-use.’

Mark

During the mark phase, our goal is to discover all active objects that are still being used. We consider an object in-use (and its memory not reclaimable) if it meets one of two criteria:

  • it is one of a key set of root objects. Root objects include global variables, local variables across all stack frames, and parameters on the call stack.
  • it is reachable from a root object. If an object can be transitively reached via one or more pointers/references from a root object (e.g.,robot object points to battery).

At a high-level, the GC identifies all root objects and adds their object references to a queue. Then, the GC uses the queue to breadth-first search from the root objects and mark all the reachable objects as “in-use”. To do this, each object is augmented with a bit which is set by the GC to mark that it is in use. Once all reachable objects have been marked, all unmarked objects can be disposed of.

Here some pseudocode for a possible mark algorithm:

# Pseudocode for the Mark algorithm
def mark():
  roots = get_all_root_objs()
  candidates = new Queue()
  for each obj_ref in roots:
    candidates.enqueue(obj_ref)

  while not candidates.empty():
    c = candidates.dequeue()
    for r in get_obj_refs_in_object(c):
      if not is_marked(r):
        mark_as_in_use(r)
        candidates.enqueue(r)

Sweep

In the sweep phase, we traverse all memory blocks in the heap (each block holds a single object/value/array) and examine each object’s “in-use” flag. We then free all objects that are not in-use!

Here some pseudocode for the sweep algorithm.

# Pseudocode for the Sweep algorithm
def sweep():
    p = pointer_to_first_block_in_heap()
    end = end_of_heap()
    while p < end:
        if is_object_in_block_in_use(p):
            reset_in_use(p)      # remove the mark, object lives
        else:
            free(p)                        # free this block/object
        p = p.next

While we haven’t seen any other approaches yet, there are some general pros and cons.

Pros:

  • relatively simple
  • no trouble with cyclic references (more on this later!)

Cons:

  • program must be paused during GC, causing “freezes” (this is often called a “stop-the-world” GC)
    • thus, this is bad for real-time programs
  • dealing with large amounts of data can lead to thrashing (at the cache/page-level)
  • can cause memory fragmentation – we’ll have chunks of empty memory (more on that in a moment!)
  • and, it’s unpredictable!

Mark and Compact

Mark and Compact is part of the same family of algorithms as mark and sweep. The difference is the second half of the algorithm, which is designed to prevent fragmentation. Instead of sweeping, we compact all marked/in-use objects to a new contiguous memory block. Then we can adjust the pointers to the proper relocated addresses. Our original block of memory can be treated as if it’s empty, and can be reused without dealing with any sweeping.

The pros and cons are pretty similar to mark and sweep. In comparison, the only differences are:

  • it much better deals with fragmentation
  • but, it’s more complex to implement, requires more RAM, and is slower!

Interlude: Unpredictability

These garbage collection methods only happen when there’s memory pressure (when the program needs memory). This means that GC is non-deterministic: you can’t necessarily predict when (or even if) GC will run.

Immediately, this should make you shudder! Non-determinism is hard to reason about and affects performance. It also means you can’t rely on the garbage collector to do things like release sockets, files, or other resources - since you don’t know when GC will run!

In other words, you can’t really rely on destructors! More on that in a bit :)

Reference Counting

Reference counting takes a different approach. Instead, each object has a hidden counter that tracks how many references there are to it. Every time a reference is created or destroyed, we simply change the counter.

That sounds complicated - but, note that references can only change:

  • with assignment (=)
  • when an object goes out of scope

That’s it! Really! If you don’t believe us, think about why (or, come to office hours).

However, there are two hidden catches:

  1. cyclic references are a huge problem; since there’s a cycle, the counter will never reach zero
    • languages that use RC need to explicitly deal with cycles in a different way.
  2. a “cascade” can happen: if one object is destroyed, it could remove the last reference to another object, which then gets destroyed; then, that removes the last reference to another object, … which could cause a large chunk of deletion (and could slow your computer down)!
    • to remedy this, some languages amortize deletion over time – but this has similar issues with non-determinism as tracing algorithms!

In summary, some pros and cons.

Pros:

  • simple
  • usually real-time (since reclamation is usually instant)
  • more efficient usage since blocks are freed immediately

Cons:

  • updating counts needs to be thread-safe (this is a huge issue!)
  • updating on every operation could be expensive (both in time and space)
  • cascading deletions
  • requires explicit cycle handling