How to parse command-line arguments. Four canonical documents.
A programming issue I've been dealing with quite intensively lately is how to properly parse command-line arguments in programs I write. This article will introduce four canonical documents on how to do that.
As a preliminary remark, let me say that I'm well aware of the fact that pre-build parsers exist for many programming languages and don't mean to discourage their use. This article is specifically about resources that provide general, language-agnostic information on how such a parser should behave, in case you want to or have to build your own.
I first encountered parsing command-line arguments as a considerable problem while putting together a Bash script to facilitate random note-taking in the terminal. Later, when I wrote dibm, by far most of the programming effort went into the argument parsing code (which, admittedly, is still a bit obscure in the current version – though it gets the job done – and will have to be reworked before the release of 1.2.0). Because parsing command-line arguments is a fundamental part of writing command-line utilities, not really knowing how to do it was quickly beginning to give me headaches. For dibm, relying on a pre-built parser was not a viable solution, for several reasons. So, in this case, I would have to do it manually and I wanted to do it right.
Going POSIX
Now, when you're writing programs for Unix-like systems, the closest equivalent of getting something right is, in many cases, to follow the POSIX standard. Parsing command-line arguments is no exception and POSIX covers it quite well.
This isn't to say that POSIX should be strictly followed at all times. Sometimes there are good reasons to deviate from the standard. There is no shortage of Unix-like systems doing it, and as I'll argue later in this article, there are actually some good reasons to do that when parsing command-line arguments as well.
Still, POSIX provides a solid base that shouldn't be tossed aside lightly, especially if, like me, you're not an experienced programmer with a ton of Unix expertise. And while it may seem daunting at times, especially when you're starting out, following a well-defined standard will actually make things easier once you get the gist of it. It will save a lot of time in the long run and also help to improve code quality.
There are essentially three POSIX documents to look into as far as parsing command-line arguments is concerned:
- the Utility Conventions in the Base Definitions volume
- the
specification of
getopts
- section A.12.2 of the Rationale volume
The Utility Conventions include a set of 14 pretty clear-cut
Utility Syntax Guidelines, 12 of which deal with command-line arguments.
They cover nearly all the basics of how parsing should be done to be
POSIX-compliant. There is, however, one rather fundamental aspect that these
guidelines don't cover. And this is where the specification of
getopts
comes into play.
The getopts
utility is meant to be provided as a shell built-in
by POSIX-compliant systems. Looking into its specification is advisable for two
reasons: One, it offers some clues as to what diagnostic messages from parsing
should look like. Two, and much more importantly, it contains an express
definition of what can indicate the end of options in a set of arguments:
Any of the following shall identify the end of options: the first "--" argument that is not an option-argument, finding an argument that is not an option-argument and does not begin with a '-', or encountering an error.
The critical part is between the two commas here. What it means is: The
first operand that is found may identify the end of options, just like
--
does, except for not being counted as an operand. This is not
obvious from the Utility Syntax Guidelines alone. However, as I
discovered later, the rationale for Guideline 9 provided in section
A.12.2 of POSIX' Rationale volume is very explicit about this,
saying:
Unless explicitly stated otherwise in the utility description, Guideline 9 requires applications to put options before operands, and requires utilities to accept any such usage without misinterpreting operands as options. For example, if an implementation of the
printf
utility supports a-e
option as an extension, the command:
printf %s -e
must output the string "
-e
" without interpreting the-e
as an option. Similarly, the command:
ls myfile -l
must interpret the
-l
argument as a second file operand, not as a-l
option.
This isn't exactly trivial, especially if you're used to tools from GNU, which don't follow this rule.
Beyond POSIX
While dealing with POSIX, I more or less accidentally discovered that The
Art of Unix Programming by Eric S. Raymond also contains a
chapter on
command-line options. It is a very useful supplement to what POSIX
specifies because it explicitly talks about those things that are largely out
of the scope of a formal standard: traditions, conventions, styles and where
they originated, as well as best practices. The most immediately useful part of
it is probably The -a to -z of Command-Line Options, a survey of which
letters are conventionally used for which kinds of options. Besides that,
Raymond provides a comparison of the Unix style, GNU style and what he calls
the X toolkit style
of command-line options that is also
worth reading.
Apropos GNU style: As you may have noticed, especially if you're usually using GNU tools, POSIX does not define so-called long options. I don't have a strong opinion on whether long options are good or bad and I tend to think that such general judgement is not applicable here. The good thing about GNU-style long options (double hyphen plus keyword) is that, contrary to the X toolkit style or MIT style (single hyphen plus keyword), they are completely compatible with POSIX. Also, keywords are certainly more memorable than single letters bound to a certain functionality. Even more so if there are some reserved keywords like in GNU.
On the other hand, keyword-based options facilitate feature creep and, as a
consequence, interfaces that are hard to handle. POSIX says that
[e]ach option name should be a single alphanumeric character
(the alnum character classification) from the portable character set.
The
portable character set is a subset of ASCII that contains all the letters of
the English alphabet (uppercase and lowercase) and the digits 0 through 9. (It
contains more than that, but letters and digits are what matters here.) Within
this scheme, the maximum possible number of options is 62, and things will
begin to look ridiculously complicated way before that. (Raymond also says
something about this.) But with keywords, there is no readily calculable limit
to the number of options and they invite implementing additional options much
more than the POSIX scheme because they are descriptive.
A prominent example offering the worst of both worlds is rsync
.
Counting out its daemon mode, this tool has 49 POSIX-style short options, plus
-@
, which isn't alpha-numeric. 47 of these options have GNU-style
long-option equivalents. On top of that, there are another 80 GNU-style long
options.[1]
The GNU Coding Standards
demand that a program should at least accept the --help
and
--version
options. I think this is a good idea in general because
it reduces ambiguity. Any sane program for which these options are valid will
interpret them as meaning ‘show help text’ and ‘show version information’,
respectively. A -h
option, on the other hand, might or might not
have that meaning. For example, it has a very different meaning in
chown
and chgrp
, where it is used to handle symbolic
links. And, as Raymond notes, -h
being equivalent of ‘show help
text’...
...is actually less common than one might expect offhand — for much of Unix's early history developers tended to think of on-line help as memory-footprint overhead they couldn't afford. Instead they wrote manual pages[.]
Similarly, -v
might increase output verbosity and
-V
show a program's version, or vice versa (like with
chattr
), or both might be interpreted as something entirely
different. See, for example, the definitions of -v
in
grep
or awk
, and that of -V
in GNU's
implementation of tar
. Contrary, --version
is about
as clear-cut as it gets.
As supporting these two long options as aliases to -h
and
-V
is not a violation of POSIX, I do that whenever I write a program that
takes arguments.
[1] Counted using grep
and wc
on the
options summary from the
manual page of rsync
3.1.3. Beware that wc -l
only counts lines that are
terminated by a newline. Also, the -h
short option was counted
twice because its meaning differs depending on whether it is specified with or
without any other arguments. [back]