Expand VictoriaMetrics WITH templates to canonical PromQL

PromQL query with WITH templates:

Resulting canonical PromQL:

Tutorial for WITH templates

Let's look at the following real query from Node Exporter Full dashboard:

((node_memory_MemTotal_bytes{instance=~"$node:$port", job=~"$job"} - node_memory_MemFree_bytes{instance=~"$node:$port", job=~"$job"}) /
node_memory_MemTotal_bytes{instance=~"$node:$port", job=~"$job"}) * 100

It is clear the query calculates the percentage of used memory for the given $node, $port and $job. Isn't it? :)

What's wrong with this query? Copy-pasted label filters for distinct timeseries which makes it easy to mistype these filters during modification. Let's simplify the query with WITH templates:

WITH (
    commonFilters = {instance=~"$node:$port",job=~"$job"}
)
(node_memory_MemTotal_bytes{commonFilters} - node_memory_MemFree_bytes{commonFilters}) /
    node_memory_MemTotal_bytes{commonFilters} * 100

Now label filters are located in a single place instead of three. The query mentions node_memory_MemTotal_bytes metric twice and {commonFilters} three times. WITH templates may improve this:

WITH (
    my_resource_utilization(free, limit, filters) = (limit{filters} - free{filters}) / limit{filters} * 100
)
my_resource_utilization(node_memory_MemFree_bytes, node_memory_MemTotal_bytes, {instance=~"$node:$port",job=~"$job"})

Now the template function my_resource_utilization may be used for monitoring arbitrary resources - memory, CPU, network, storage, you name it.

Since resource utilization metric is frequently used in monitoring, we made it internal and gave it short name - ru, so you can just use it without WITH section. Try expanding the following query at the top of this page:

ru(node_filesystem_avail_bytes, node_filesystem_size_bytes)

ru() doesn't accept filters unlike my_resource_utilization, because this allows writing the following expressions:

# Calculate network utilization
WITH (
    maxRate = 1e9 / 8, # Gigabit network
    commonFilters = {instance=~"$node:$port",job=~"$job"},
    networkUtilization(bytesTotal, maxRate) = ru(maxRate - rate(bytesTotal{commonFilters}[5m]), maxRate)
) networkUtilization(node_network_receive_bytes_total, maxRate)

The query above contains comments starting with #. Comments may help humans to understand complex queries better.

There is yet another internal template function - ttf(freeResources). It estimates the time in seconds when the given freeResource reaches zero. For instance, the following query may help with capacity planning for disk space:

ttf(free_disk_space)

ttf uses additional functions from extended PromQL, which are available only in VictoriaMetrics, so the expanded query won't work in Prometheus.

Let's take another nice function from Node Exporter Full dashboard:

(((count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle',instance=~"$node:$port",job=~"$job"}[5m])))) * 100) / count(count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu))

Do you understand what does this mess do? Is it manageable? :) WITH templates is happy to help in a few iterations.

1. Extract commonFilters:

WITH (
    commonFilters = {instance=~"$node:$port",job=~"$job"}
) (((count(count(node_cpu_seconds_total{commonFilters}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle',commonFilters}[5m])))) * 100) / count(count(node_cpu_seconds_total{commonFilters}) by (cpu))

2. Extract count(count(...) by (cpu)):

WITH (
    commonFilters = {instance=~"$node:$port",job=~"$job"},
    cpuCount = count(count(node_cpu_seconds_total{commonFilters}) by (cpu))
) ((cpuCount - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle',commonFilters}[5m])))) * 100) / cpuCount

3. Extract irate part. It is clear now that this part calculates the number of idle CPUs:

WITH (
    commonFilters = {instance=~"$node:$port",job=~"$job"},
    cpuCount = count(count(node_cpu_seconds_total{commonFilters}) by (cpu)),
    cpuIdle = sum(irate(node_cpu_seconds_total{mode='idle',commonFilters}[5m]))
) ((cpuCount - cpuIdle) * 100) / cpuCount

4. Use ru func:

WITH (
    commonFilters = {instance=~"$node:$port",job=~"$job"},
    cpuCount = count(count(node_cpu_seconds_total{commonFilters}) by (cpu)),
    cpuIdle = sum(irate(node_cpu_seconds_total{mode='idle', commonFilters}[5m]))
) ru(cpuIdle, cpuCount)

5. Put node_cpu_seconds_total{commonFilters} into an own template:

WITH (
    cpuSeconds = node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"},
    cpuCount = count(count(cpuSeconds) by (cpu)),
    cpuIdle = sum(irate(cpuSeconds{mode='idle'}[5m]))
) ru(cpuIdle, cpuCount)

Now the query became more clear comparing to the initial query.

WITH templates may be nested and may be put anywhere. Try expanding the following query:

WITH (
    f(a, b) = WITH (
        f1(x) = b-x,
        f2(x) = x+x
    ) f1(a)*f2(b)
) f(foo, with(x=bar) x)

WITH nesting may help in copy-pasting of complex functions from trusted sources such as StackOverflow

How to use all this stuff?

It is already available in VictoriaMetrics out of the box. Just start using VictoriaMetrics as long-term remote storage for Prometheus and exploring your metrics in Grafana via standard datasource for Prometheus.
Prometheus continues writing all the metrics into local storage after adding remote storage into its config, so it is safe trying VictoriaMetrics at any time.