How to parse command-line arguments. Four canonical documents.
A programming issue I've been dealing with quite intensively lately is how to properly parse command-line arguments in programs I write. This article will introduce four canonical documents on how to do that.
As a preliminary remark, let me say that I'm well aware of the fact that pre-build parsers exist for many programming languages and don't mean to discourage their use. This article is specifically about resources that provide general, language-agnostic information on how such a parser should behave, in case you want to or have to build your own.
I first encountered parsing command-line arguments as a considerable problem while putting together a Bash script to facilitate random note-taking in the terminal. Later, when I wrote dibm, by far most of the programming effort went into the argument parsing code (which, admittedly, is still a bit obscure in the current version – though it gets the job done – and will have to be reworked before the release of 1.2.0). Because parsing command-line arguments is a fundamental part of writing command-line utilities, not really knowing how to do it was quickly beginning to give me headaches. For dibm, relying on a pre-built parser was not a viable solution, for several reasons. So, in this case, I would have to do it manually and I wanted to do it right.
Now, when you're writing programs for Unix-like systems, the closest equivalent of getting something right is, in many cases, to follow the POSIX standard. Parsing command-line arguments is no exception and POSIX covers it quite well.
This isn't to say that POSIX should be strictly followed at all times. Sometimes there are good reasons to deviate from the standard. There is no shortage of Unix-like systems doing it, and as I'll argue later in this article, there are actually some good reasons to do that when parsing command-line arguments as well.
Still, POSIX provides a solid base that shouldn't be tossed aside lightly, especially if, like me, you're not an experienced programmer with a ton of Unix expertise. And while it may seem daunting at times, especially when you're starting out, following a well-defined standard will actually make things easier once you get the gist of it. It will save a lot of time in the long run and also help to improve code quality.
There are essentially three POSIX documents to look into as far as parsing command-line arguments is concerned:
- the Utility Conventions in the Base Definitions volume
- section A.12.2 of the Rationale volume
The Utility Conventions include a set of 14 pretty clear-cut Utility
Syntax Guidelines, 12 of which deal with command-line arguments. They cover
nearly all the basics of how parsing should be done to be POSIX-compliant. There
is, however, one rather fundamental aspect that these guidelines don't cover.
And this is where the specification of
getopts comes into play.
getopts utility is meant to be provided as a shell built-in
by POSIX-compliant systems. Looking into its specification is advisable for two
reasons: One, it offers some clues as to how diagnostic messages from parsing
should look like. Two, and much more importantly, it contains an express
definition of what can indicate the end of options in a set of arguments:
Any of the following shall identify the end of options: the first "--" argument that is not an option-argument, finding an argument that is not an option-argument and does not begin with a '-', or encountering an error.
The critical part is between the two commas here. What it means is: The first
operand that is found may identify the end of options, just like
does, except for not being counted as an operand. This is not obvious from the
Utility Syntax Guidelines alone. However, as I discovered later, the
rationale for Guideline 9 provided in section A.12.2 of POSIX'
Rationale volume is very explicit about this, saying:
Unless explicitly stated otherwise in the utility description, Guideline 9 requires applications to put options before operands, and requires utilities to accept any such usage without misinterpreting operands as options. For example, if an implementation of the
printfutility supports a
-eoption as an extension, the command:
printf %s -e
must output the string "
-e" without interpreting the
-eas an option. Similarly, the command:
ls myfile -l
must interpret the
-largument as a second file operand, not as a
This isn't exactly trivial, especially if you're used to tools from GNU, which don't follow this rule.
While dealing with POSIX, I more or less accidentally discovered that The
Art of Unix Programming by Eric S. Raymond also contains a
command-line options. It is a very useful supplement to what POSIX specifies
because it explicitly talks about those things that are largely out of the scope
of a formal standard: traditions, conventions, styles and where they originated,
as well as best practices. The most immediately useful part of it is probably
The -a to -z of Command-Line Options, a survey of which letters are
conventionally used for which kinds of options. Besides that, Raymond provides
a comparison of the Unix style, GNU style and what he calls the
X toolkit style of command-line options that is also worth
Apropos GNU style: As you may have noticed, especially if you're usually using GNU tools, POSIX does not define so-called long options. I don't have a strong opinion on whether long options are good or bad and I tend to think that such general judgement is not applicable here. The good thing about GNU-style long options (double hyphen plus keyword) is that, contrary to the X toolkit style or MIT style (single hyphen plus keyword), they are completely compatible with POSIX. Also, keywords are certainly more memorable than single letters bound to a certain functionality. Even more so if there are some reserved keywords like in GNU.
On the other hand, keyword-based options facilitate feature creep and, as a
consequence, interfaces that are hard to handle. POSIX says that
[e]ach option name should be a single alphanumeric character
(the alnum character classification) from the portable character set. The
portable character set is a subset of ASCII that contains all the letters of the
English alphabet (uppercase and lowercase) and the digits 0 through 9. (It
contains more than that, but letters and digits are what matters here.) Within
this scheme, the maximum possible number of options is 62, and things will begin
to look ridiculously complicated way before that. (Raymond also says something
about this.) But with keywords, there is no readily calculable limit to the
number of options and they invite implementing additional options much more than
the POSIX scheme because they are descriptive.
A prominent example offering the worst of both worlds is
Counting out its daemon mode, this tool has 49 POSIX-style short options, plus
-@, which isn't alpha-numeric. 47 of these options have GNU-style
long-option equivalents. On top of that, there are another 80 GNU-style long
The GNU Coding Standards
demand that a program should at least accept the
--version options. I think this is a good idea in general because
it reduces ambiguity. Any sane program for which these options are valid will
interpret them as meaning ‘show help text’ and ‘show version information’,
-h option, on the other hand, might or might not
have that meaning. For example, it has a very different meaning in
chgrp, where it is used to handle symbolic
links. And, as Raymond notes,
-h being equivalent of ‘show help
...is actually less common than one might expect offhand — for much of Unix's early history developers tended to think of on-line help as memory-footprint overhead they couldn't afford. Instead they wrote manual pages[.]
-v might increase output verbosity and
-V show a program's version, or vice versa (like with
chattr), or both might be interpreted as something entirely
different. See, for example, the definitions of
awk and that of
-V in GNU's
--version is about as
clear-cut as it gets.
As supporting these two long options as aliases to
is not a violation of POSIX, I do that whenever I write a program that takes
 Counted using
wc on the options summary from the manual page of rsync 3.1.3. Beware that
wc -l only counts lines that are terminated by a newline. Also, the
-h short option was counted twice because its meaning differs
depending on whether it is specified with or without any other arguments. [back]