How to parse command-line arguments. Four canonical documents.

A programming issue I've been dealing with quite intensively lately is how to properly parse command-line arguments in programs I write. This article will introduce four canonical documents on how to do that.

As a preliminary remark, let me say that I'm well aware of the fact that pre-build parsers exist for many programming languages and don't mean to discourage their use. This article is specifically about resources that provide general, language-agnostic information on how such a parser should behave, in case you want to or have to build your own.

I first encountered parsing command-line arguments as a considerable problem while putting together a Bash script to facilitate random note-taking in the terminal. Later, when I wrote dibm, by far most of the programming effort went into the argument parsing code (which, admittedly, is still a bit obscure in the current version – though it gets the job done – and will have to be reworked before the release of 1.2.0). Because parsing command-line arguments is a fundamental part of writing command-line utilities, not really knowing how to do it was quickly beginning to give me headaches. For dibm, relying on a pre-built parser was not a viable solution, for several reasons. So, in this case, I would have to do it manually and I wanted to do it right.

Going POSIX

Now, when you're writing programs for Unix-like systems, the closest equivalent of getting something right is, in many cases, to follow the POSIX standard. Parsing command-line arguments is no exception and POSIX covers it quite well.

This isn't to say that POSIX should be strictly followed at all times. Sometimes there are good reasons to deviate from the standard. There is no shortage of Unix-like systems doing it, and as I'll argue later in this article, there are actually some good reasons to do that when parsing command-line arguments as well.

Still, POSIX provides a solid base that shouldn't be tossed aside lightly, especially if, like me, you're not an experienced programmer with a ton of Unix expertise. And while it may seem daunting at times, especially when you're starting out, following a well-defined standard will actually make things easier once you get the gist of it. It will save a lot of time in the long run and also help to improve code quality.

There are essentially three POSIX documents to look into as far as parsing command-line arguments is concerned:

The Utility Conventions include a set of 14 pretty clear-cut Utility Syntax Guidelines, 12 of which deal with command-line arguments. They cover nearly all the basics of how parsing should be done to be POSIX-compliant. There is, however, one rather fundamental aspect that these guidelines don't cover. And this is where the specification of getopts comes into play.

The getopts utility is meant to be provided as a shell built-in by POSIX-compliant systems. Looking into its specification is advisable for two reasons: One, it offers some clues as to what diagnostic messages from parsing should look like. Two, and much more importantly, it contains an express definition of what can indicate the end of options in a set of arguments:

Any of the following shall identify the end of options: the first "--" argument that is not an option-argument, finding an argument that is not an option-argument and does not begin with a '-', or encountering an error.

The critical part is between the two commas here. What it means is: The first operand that is found may identify the end of options, just like -- does, except for not being counted as an operand. This is not obvious from the Utility Syntax Guidelines alone. However, as I discovered later, the rationale for Guideline 9 provided in section A.12.2 of POSIX' Rationale volume is very explicit about this, saying:

Unless explicitly stated otherwise in the utility description, Guideline 9 requires applications to put options before operands, and requires utilities to accept any such usage without misinterpreting operands as options. For example, if an implementation of the printf utility supports a -e option as an extension, the command:

printf %s -e

must output the string "-e" without interpreting the -e as an option. Similarly, the command:

ls myfile -l

must interpret the -l argument as a second file operand, not as a -l option.

This isn't exactly trivial, especially if you're used to tools from GNU, which don't follow this rule.

Beyond POSIX

While dealing with POSIX, I more or less accidentally discovered that The Art of Unix Programming by Eric S. Raymond also contains a chapter on command-line options. It is a very useful supplement to what POSIX specifies because it explicitly talks about those things that are largely out of the scope of a formal standard: traditions, conventions, styles and where they originated, as well as best practices. The most immediately useful part of it is probably The -a to -z of Command-Line Options, a survey of which letters are conventionally used for which kinds of options. Besides that, Raymond provides a comparison of the Unix style, GNU style and what he calls the X toolkit style of command-line options that is also worth reading.

Apropos GNU style: As you may have noticed, especially if you're usually using GNU tools, POSIX does not define so-called long options. I don't have a strong opinion on whether long options are good or bad and I tend to think that such general judgement is not applicable here. The good thing about GNU-style long options (double hyphen plus keyword) is that, contrary to the X toolkit style or MIT style (single hyphen plus keyword), they are completely compatible with POSIX. Also, keywords are certainly more memorable than single letters bound to a certain functionality. Even more so if there are some reserved keywords like in GNU.

On the other hand, keyword-based options facilitate feature creep and, as a consequence, interfaces that are hard to handle. POSIX says that [e]ach option name should be a single alphanumeric character (the alnum character classification) from the portable character set. The portable character set is a subset of ASCII that contains all the letters of the English alphabet (uppercase and lowercase) and the digits 0 through 9. (It contains more than that, but letters and digits are what matters here.) Within this scheme, the maximum possible number of options is 62, and things will begin to look ridiculously complicated way before that. (Raymond also says something about this.) But with keywords, there is no readily calculable limit to the number of options and they invite implementing additional options much more than the POSIX scheme because they are descriptive.

A prominent example offering the worst of both worlds is rsync. Counting out its daemon mode, this tool has 49 POSIX-style short options, plus -@, which isn't alpha-numeric. 47 of these options have GNU-style long-option equivalents. On top of that, there are another 80 GNU-style long options.[1]

The GNU Coding Standards demand that a program should at least accept the --help and --version options. I think this is a good idea in general because it reduces ambiguity. Any sane program for which these options are valid will interpret them as meaning ‘show help text’ and ‘show version information’, respectively. A -h option, on the other hand, might or might not have that meaning. For example, it has a very different meaning in chown and chgrp, where it is used to handle symbolic links. And, as Raymond notes, -h being equivalent of ‘show help text’...

...is actually less common than one might expect offhand — for much of Unix's early history developers tended to think of on-line help as memory-footprint overhead they couldn't afford. Instead they wrote manual pages[.]

Similarly, -v might increase output verbosity and -V show a program's version, or vice versa (like with chattr), or both might be interpreted as something entirely different. See, for example, the definitions of -v in grep or awk, and that of -V in GNU's implementation of tar. Contrary, --version is about as clear-cut as it gets.

As supporting these two long options as aliases to -h and -V is not a violation of POSIX, I do that whenever I write a program that takes arguments.


[1] Counted using grep and wc on the options summary from the manual page of rsync 3.1.3. Beware that wc -l only counts lines that are terminated by a newline. Also, the -h short option was counted twice because its meaning differs depending on whether it is specified with or without any other arguments. [back]