The missing guide for Arch Linux PKGBUILD's pkgver() version numbers

Pacman's version comparison algorithm was designed over a decade ago to properly sort many categories of real-world version numbers, and is now set in stone, quirks and all. Later on, the AUR developed pkgver() conventions and templates which turn Git commits into version numbers that would sort properly in Pacman. But what are Pacman's requirements for sorting real-world version numbers, how does Pacman's version comparison algorithm work, and how are AUR pkgver() built around the algorithm?

How Pacman compares versions

vercmp is a command-line utility which takes two string arguments and compares them using Pacman's version comparison algorithm.

The vercmp executable exposes the algorithm used by Pacman to determine whether a different package version is newer than what you have currently installed. Sadly, https://man.archlinux.org/man/vercmp.8 (as well as the pacman manpage) is inadequate and fails to explain the algorithm, only providing a few examples.

Requirements for comparing versions

Pacman needs to compare the versions of real-world software programs and its own conventions correctly:

Pacman's version comparison algorithm also has incidental properties that I don't consider to be first principles. However, AUR pkgver() depend on certain ones to generate unusual-looking unintuitive version numbers that nonetheless sort properly in Pacman.

Algorithm implementation

The algorithm is implemented in alpm_pkg_vercmp() in the Pacman source code (:lib/libalpm/version.c). The file is 260 lines of code, with multiple functions dedicated to different aspects of version comparison. The algorithm is written in raw C, with glorious null-terminated strings, and string slicing implemented via const-incompatible null byte insertion. 😿

Epoch, version, and release

parseEVR() parses Arch package versions using the format [epoch:]version[-release]. More specifically, all characters after the last hyphen form the release (even if there are colons afterwards), and the epoch is "0" unless the first non-digit is a colon. If no epoch is present, the epoch is labeled as 0.

parseEVR() allows only numbers in the epoch field. It is usually absent, but can be used as a "major version" to ensure that newer program versions compare higher, even if the newer program's version number (stored in the version field) is lower than in older versions.

The release field is an optional location for "build metadata". A version with no release field is considered equal to otherwise-identical versions with any release field, but two otherwise-identical versions with different release fields use the release field to break ties.

Comparing versions

Each field is then compared using rpmvercmp(). Missing epochs are assumed to be 0, and missing releases are assumed to be equal to any numbered release.

rpmvercmp() decomposes its input into "segments", where each segment starts with 0 or more "separator" characters (any non-alphanumeric character), which are followed by 1 or more "body" characters (each body contains either alphabetic characters or numeric characters, so "1a" is 2 segments). The input may be terminated by a "dangling" segment with only separator characters and no body (but realistic version numbers will not have a dangling segment).

This can be modeled as the regex ([^a-zA-Z0-9]* ([a-zA-Z]+ | [0-9]+) )* [^a-zA-Z0-9]* more or less.

Both inputs are split into segments (including dangling segments), starting at the beginning. The algorithm loops over segments from both inputs, starting with the first segment from each, until either input runs out of segments entirely (one or both segments are absent).

Each loop iteration receives one segment from each version, for as long as both versions have segments remaining:

The function returns immediately if the loop finds a pair of segments that compare unequal. Otherwise the loop stops (without stripping separators) when one or both inputs reach the end of line, or breaks (after stripping separators) when one or both inputs reach a final dangling segment.

At this point, one of these is true:

The segments are compared as follows:

The algorithm is complete.

All dangling segments compare equal to one another, but come after "segment with text" and "no segment" and before "segment with number".

Unfortunately this algorithm has a cycle, caused by how more leading separators wins a version comparison (even if followed by a losing body) if both segments have bodies, but gets ignored if one or both segments are empty after trimming.

Note that 1. and 1... are interchangeable, because the dangling separators get stripped out either way.

The Pacman developers commented, "Fun example :) Like I said, having multiple delimiters in a row doesn't make a lot of sense, so that is pretty much undefined behaviour"

Testing the requirements

Dangling segments and multiple separators don't occur in real-world version numbers and can be ignored. Does this algorithm properly order real-world versions?

Yes, "beta" < "".

Yes, "" < ".1".

Yes, ".1" < ".2".

Yes, "" < ".alpha"

What is PKGBUILD and pkgver?

PKGBUILD files are shell scripts defining variables and functions used by makepkg to build a binary package. The pkgver variable serves as the version number of the PKGBUILD and the package produced. All PKGBUILD files contain a pkgver variable, storing the package's version at the time the file was written. However, this is insufficient for VCS/-git packages tracking the latest commit in a Git repository, where the version of software built by a PKGBUILD can change even when the PKGBUILD does not. To accommodate this, makepkg also supports a pkgver() function, which when run produces the current version of the package.

If pkgver is a variable only, then an unmodified PKGBUILD and pkgver means the package has not been updated. But if a pkgver() function is present, then an AUR helper trying to determine if an installed package is outdated must re-clone/pull the VCS repo listed in source=(...) and call pkgver() again, even if the PKGBUILD and pkgver are unmodified.

If a pkgver() function is present, then running makepkg to build the PKGBUILD into a binary package also rewrites the PKGBUILD file with a new value for the pkgver variable. A few fixed-version packages like qt5-base and qt5-wayland use this property by defining a pkgver() function to automatically recompute complex version numbers. Unlike -git packages in the AUR, these PKGBUILDs build a fixed version of the source code, and their pkgver() functions return a fixed value.

Building a pkgver() so Pacman sorts Git repositories correctly

Git repositories in the wild have a lot of variance; some don't have tags, some have tags that sort properly, and some have tags in the wrong order. And some repositories start with no tags, but create tags later on when they make their first release.

Requirements for comparing versions

What are the requirements for generating version numbers from a Git repository?

How can we achieve these criteria, given how Pacman works?

Arch Wiki templates

The Arch wiki provides copy-paste snippets of example pkgver() functions, but fails to explain the underlying concepts (what git describe outputs, what the sed expression does, how the resulting expression is evaluated by vercmp and pacman).

Untagged Git repositories

In a Git repo where the history of master has no tags, the recommended pkgver() counts commits:

pkgver() {
  cd "$pkgname"
  printf "r%s.%s" "$(git rev-list --count HEAD)" "$(git rev-parse --short HEAD)"
}

This produces a string r{number of commits}.{commit hash}.

Any letter would work equally well for the version comparison algorithm, r was chosen because it sounds like "revision". But what is the purpose of a letter?

Tagged Git repositories

If the repo has tags like 0.2.5 which begin with a number (no leading "v" prefix like v0.2.5), git describe --long --tags can be used as the root source for the version:

pkgver() {
  cd "$pkgname"
  git describe --long --tags | sed 's/\([^-]*-g\)/r\1/;s/-/./g'
}

git describe --long produces a string with format {most recent tag}-{commits since tag}-g{commit hash}.

git checkout master
git describe --long --tags  # v2.4-25-ga240b43
git checkout v2.4  # or git checkout HEAD~25
git describe --long --tags  # v2.4-0-g51e51f4

The sed expression turns it into {most recent tag}.r{commits since tag}.g{commit hash}.

Testing the requirements

"r###" < "r###+1"? Trivially so, as the "r" segment is the same, but the "number of commits" segment increases.

"r###" < "1.0.r###"? Yes. The version of the untagged repository starts with a "r" segment. The version of the tagged repository starts with a numeric segment (taken from the tag), which comes after.

"1.0.r###" < "1.0.r###+1"? Yes. "most recent tag" is unchanged, ".r" is unchanged, and "commits since tag" increases.

"1.0.r###" < "1.1.r###"? Yes. "most recent tag" increases.

"1.0.r###" < "1.0.1.r###"? Yes. "1"="1", ".0"=".0", and ".r" < ".1".

Why not "."?

If tag-based versions were {most recent tag}.{commits since tag}..., then if the most recent tag changes from 1.0 to 1.0.1, the version would change from "1.0.###" to "1.0.1.###", where ".1" sorts before ".###" despite 1.0.1 being a newer program version.

This was first brought up by @diabonas:archlinux.org:

You need it because otherwise 1.0.500 (where 500 is the revision count) would be newer than 1.0.1.30 (again, 30 is the revision count) - this doesn't happen with 1.0.r500, which is older than 1.0.1.r30 because a letter is always older than a digit

Why not "r"?

It's 1.0.1.r30 - the dot is important as 1.0.1r30 would be older than 1.0.1, but 1.0.1.r30 is newer - it's a revision after 1.0.1 after all. And yeah, 1.0r31 is a revision after 1.0, but before the next upstream release 1.0.1, whole 1.0.1.r31 is a revision after 1.0.1, so newer than 1.0.r30

The Arch Wiki is wrong

The Arch wiki's stated requirements for generating version numbers are:

It is recommended to have following version format: RELEASE.rREVISION where REVISION is a monotonically increasing number that uniquely identifies the source tree (VCS revisions do this).

The Arch wiki is wrong; given the "RELEASE.RELEASE.rREVISION" convention recommended by the wiki, for Pacman to properly identify older and newer packages, REVISION does not need to be globally monotonic, only within a given RELEASE. And the Arch wiki even breaks its own rules: the example "Git with tags" pkgver()'s REVISION is not monotonic except within a given RELEASE (Git tag).

Even if the Arch wiki was changed to say that REVISION needs to be monotonic within a given RELEASE, it states that 0.1.r456 > r454 but 0.1.456 < 454, without explaining the algorithm used to compare revisions. This only serves to confuse the reader.