Custom Search

Wednesday, May 11, 2016

A first look at Rust

What is Rust?

Rust is one of the crop of system languages that target roughly the same space as C: running on bare metal, system utilities and libraries, and applications. Like most of them, it incorporates features from the last half century of language research - modern macros, algebraic data types, type inference, user-controlled polymorphism, variable sized arrays, and some degree of inheritance. I think it made a nice set of choices, but that's honestly more a matter of taste than anything else. I'll look at those in some detail, and explain why you'd want to learn Rust.

What sets Rust apart from other such languages is that the compiler also keeps track what code can read and write an object, which it refers to as the lifetime of the object. It uses that to prevent you from creating multiple references that can change an object. This changes how you program in a number of ways.

While it's not the first language to do this - Cordy and Holt's Concurrent Euclid didn't allow aliases, but it also didn't enforce that restriction - I believe it's the only language currently under active development and in production use that does this. That it tracks lifetimes means you don't need either an explicit free operation for stack-based storage or a garbage collector! This makes Rust well suited to bare metal and limited environments. And it also helps with threading: one of the community sayings goes something like 'Everyone knows shared mutability is bad, but while most languages deal with "mutability", Rust deals with "shared".'

The language

Before digging into what sets Rust apart from it's competition, let's talk about what sets it apart from it's successors - namely C. It draws a lot from C, but you can also see bits of functional programming languages and the P-languages in it.

Syntax

It is a curly-brace language, with a C-like syntax. Unlike C, blocks are expressions, taking the value of the last expression in the block. Semi-colons separate expressions, and an empty expression has a value of (). This means a missing or extra semicolon can change the value and type of a block, which can be a PITA.

Statements control blocks, not other statements. This means you can leave off the parenthesis around the condition on an if or while - and in fact, by default adding them generates a warning. It also means that else if is syntax, not just fallout from the syntax of the if statement. There is no until. Like C, for loops are syntactic sugar, but they process an iterator instead of bundling up the parts of a while. So you write:
for element in iterator {
    do_something_with(element);
    }
There is also a loop statement, which expects to exit with a break or return.The claim is that having an explicit forever loop helps the compiler. And oh yeah - loops have labels that can be used with break and continue. Which eliminates my only reason for ever using a goto in C.

Data types

These days, type inferencing seems to be de rigueur. It solves the major issue with static typing - that you have to declare types for all your variables. That's not usually true if the compiler can figure it out for you. In Rust, this only works for local variables. Functions arguments and return values, as well as static variables, constants and structure elements have to have type information. It still saves a lot of typing.

The major win - at least as far as I'm concerned - is the adoption of a proper sum data type. If you're not familiar with them, this is a cross between C's enum's and union's. The difference is that you can declare a structure or enum inside of an enum, like so:
enum Point<T> {
    Null,
    Line(T),
    Plane(T, T),
    Space {x: T, y: T, z: T},
}
This creates the type Point, of which instances are the value Null, or the Line and Plane tuple structures and the Space structure for points in a space of 0, 1, 2 and 3 dimensions. Oh yes, you can have a struct with unnamed members - a tuple struct. And as you can see here, you get generics as well.

There is a match statement to let you distinguish between the different types in an enum - complete with a warning if you don't deal with them all. And an if let sugar to test and disassemble a variable in a single statement. There are some other interesting additions - a tuple type, obviously, an update syntax for structures, names space issues, etc. - but this is the big one.

A second nice gain over C is traits. This is similar to a class in Haskell or an interface in Java. You can implement a trait for a struct or enum, and write functions whose arguments are required to have specific traits instead of types. Or a mix of them. You can use a similar technique to add methods to them, which you use with the object.method syntax. And yes, you can have traits that inherit features from other traits.

Another interesting feature is that size is considered part of the type of an array. That Pascal did this in the '70s was considered a serious problem, since you couldn't write code that handled arbitrary sized arrays, like say a matrix multiplication routine. Rust gets around that by having a Vec type that has a dynamic size. There is also a slice type that either can be coerced into so you can write one function to deal with either. And of course you can get an iterator from all three.

Error handling

The sum type gets it's name from the fact that the number of possible values is the sum of the number of values of it's parts. In contrast, a struct is a product type, as the number of possible values is the product of the number of values of it's parts. Together, this give Rust algebraic data types. To see why this is a big deal, consider an optional result with or without an error message. In Rust, they would be:
enum Option<T> {
    Some (T),
    None,
}

enum Result<T, E> {
    Ok(T),
    Err(E),
}
So where a C function to find a substring might return a pointer to a string or a null if it didn't find it, a Rust version would return an Option<&str>. It would return None if it didn't find it, and Some(string) if it did. Similarly, character reading functions in C return the character or -1 on error, and set errno in the latter case - creating enough headaches that modern languages throw exceptions instead, which still aren't very clean. In Rust, you return a Result<u8, std::IO::Err>. The beauty here is that once you get the result, you can't use it until you've dealt with the possibility of an error. You can't dereference a Option<&str> like a &str - that's a type error. Sure, if you want a quick-and-dirty look at them, you can use .unwrap() to get the result if there wasn't an error, and panic if it did. Or you can use .expect(&str) to do the same, only printing your own panic string. And there's a try! macro which will return the error to your caller if it can't get the result.

Throw in that each has a nice collection of methods for combining Options and Results as well as converting between them so you can easily chain together a number of function that expect or return or both these types of values, and you wind up with code that is only slight more verbose than code that you'd write without any error handling.

Lifetime tracking

Now lets look at what makes rust unique - the lifetime and sharing tracking. Those two sound different, but having done a little Rust, I'll report they are intimately related. And the concept of immutable data ties into them as well.

The Rust compiler keeps track of which variables can be accessed, and whether or not they can be written to, for all code. While it tries to infer lifetimes the way it does types, this again isn't always possible. So there's a way to declare lifetimes and apply them to variables. The goal is to prevent you from using variables that have been deallocated, or to have two different references to a variable that allow you to change it, as the first of these is clearly a bug, and the second is a code smell.

Having immutable variables - or references that are immutable - allows you to have multiple live references to a variable. Clearly, if all references are mutable, then having multiple references would imply shared mutable data, which we want to avoid. But with immutable data - or references that treat it that way - you can create a second reference to an object without that problem. This is called borrowing the object. The part of the compiler that keeps track of lifetimes and references is referred to as the borrower.

Since the compiler keeps track of lifetimes, you don't have to. So there's no explicit call like free in C. There is drop, but it doesn't release a borrow, among other oddities. Note that this all happens in the compiler. This isn't done by reference counting or a garbage collector. In fact, the system has no runtime requirements. There is a drop call that deals with external resources, and can free memory - but only if it's not borrowed. And there are reference-counted containers should you need them, and they have the usual issues that come with reference counting. But if you don't need them, you won't see those issues.

While this is all good stuff, it's unfortunately not free. Since the compiler worries about these things, you have to as well. This isn't bad, because you should be doing that anyway. But having the compiler do it formally causes problems that you might not have expected. This is much like static type checking. Even in languages with the weakest dynamic type checking, you have to worry about it. Sure, if the type system does the right things, it saves you work. But if they don't, you either have to do the right thing by hand or wind up chasing obscure bugs later. And every so often you have to spend a little time making the type checker happy on something that seems like it ought to just work.

For instance, you may know that you're finished with a variable, but the compiler won't. So while action might be perfectly safe, the compiler isn't always able to figure it out. So your first code will get an error message, and you'll have to do something to let the compile know you're done - like a drop or putting it in a nested scope. Or maybe you need to reorder statements, because while you know that the borrow that the compiler wants last won't be used until after the other one runs, the compiler doesn't - and insists that you do them in the order that won't generate errors.

While this may seem like a pain, I appreciate it. Sure, I may "know" these things are true. Now. But what if things change later? You - or some other maintainer - may not realize the assumption that's implicit in your original ordering, and break it. This would introduce bugs - probably very subtle ones. Like type checking, I believe that a little time up front dealing with these issues during development saves time later debugging them during maintenance.

This part of learning Rust seems to be ubiquitous, and is known as "fighting the borrower". Reports are that this gets to be less of a problem as you get more experience. I look forward to finding out.

Preprocessing

Rust has nice precompilation facilities. I don't believe there's an explicit preprocessor, but what they do works nicely.

There's no explicit inclusion facility. Instead, when you declare a module in a file with no body, the compiler will look for either *module*.rs or *module*/mod.rs to use at that point. This is easy to use and straightforward. There's also a use statement to import names from outside the module. It uses the :: namespace conventions. It's fairly standard, and includes the ability to have multiple names in the last level of the namespace as well as a wild card, thought they don't mix.

Conditional compilation support is also on a per-object basis, not textual. Objects declared at the top level can have attributes assigned them. Attributes can serve the purposes of a #pragma, but can also cause the compiler to skip the object - be it a function, variable or module - they are applied to. But this happens per-object, not by eliding text. One of the more interesting attributes allows you to specify traits that the compiler should derive for your data types, like equality checking, ordering and dumping a human-readable version for debugging purposes. There are also macros that check the same things attributes do to create constant expressions at compile time, allowing you to create conditional code in an expression without having to resort to eliding text.

And yes, Rust has macros. Real ones, not the simple textual substitution that C has. They are hygienic, work with the parse tree, and are in general well-behaved. They are marked in the source by ending in an !, like println! and try!. The language doesn't allow variadic functions, so macros are used for those.

Macros do have their own language, and aren't quite as powerful as Rust. If you want that, you can get it via a mechanism known as compiler plugins.

Environment

While there is a stand-alone Rust compiler that works about like you'd expect, there's also cargo, which works with projects. It provides tools for finding and using dependent packages, building the various parts of a project, running tests, generating documentation from comments (a lasting legacy of the literate programming movement) and the usual things. It seems to be up to modern best practices for such things.

There is a Rust mode for emacs along with the usual facilities that one wants in an editor. There's also support for Rust in a number of other editors, plugins for a variety of IDE's, and at least one Rust-specific IDE. It shouldn't be hard to find something you can work with if you want it.

The one thing I really miss is that there's no REPL for Rust. Being able to query a running system for the value, type information and other documentation is an invaluable development tool. On the other hand, the Rust compiler is very fast, so maybe not so big a deal. This also mitigates one of my other complaints - the compiler seems to give up on generating errors earlier than most. But it's so fast I've gotten into the bad habit of correcting a few and then rerunning the compiler, even though the messages are both very informative and don't often cascade.

Summary

So far, I'm impressed. Like I said, it seems to have picked the right elements from the post-C languages. The language isn't perfect. For example, even though you can pass a function as an argument, you can't use the method invocation syntax for the value, but have to deal with the self argument in the function that it's being passed to. There are other issues as well.

More distressing is that the language doesn't appear ready for production use. They have three release channels, and recommend that "unless you have a specific reason, you should be using the stable release channel. These releases are intended for a general audience." However, I continually ran into issues with the stable channel for which the recommended work-around was to use a different channel. The most damning being that you can't build the Rust standard library with the stable channel compiler. If it's not good enough for that task yet, do you want to risk it being good enough for yours?

But the borrower seems like a great idea. If you're looking for a new language that will change the way you think while programming, Rust is a good candidate.