tabjson.3am - Man Page

direct processing of JSON streams with GAWK

Synopsis

@load "tabjson"

BEGIN { JPAT[1] = "/path/1"; JPAT[2] = "/path/2"; ...; JPAT[n] = "/path/n" }
BEGIN { JUNWIND = n2 }
# rules with J[1], J[2], ..., J[n]
# rules with J[n2], JUINDEX, and JULENGTH

tabjson::sprint()

Description

Tabjson works by providing a tabular view to a stream of line-delimited JSON objects. The "columns" of the virtual table are declared by filling the JPAT array variable. Each element in this array is an expression similar to JSON pointer expressions as specified in RFC6901. The sole difference between JSON pointer expressions and tabjson expressions is that the latter uses 1-indexed array access to respect AWK conventions.

Given the following input:

{ "name": "Beth", "pay_rate": 4.00, "hours_worked": 0 }
{ "name": "Dan", "pay_rate": 3.75, "hours_worked": 0 }
{ "name": "Kathy", "pay_rate": 4.00, "hours_worked": 10 }
{ "name": "Mark", "pay_rate": 5.00, "hours_worked": 20 }
{ "name": "Mary", "pay_rate": 5.50, "hours_worked": 22 }
{ "name": "Susie", "pay_rate": 4.25, "hours_worked": 18 }

And the following AWK code:

BEGIN {
    JPAT[1] = "/name"
    JPAT[2] = "/pay_rate"
    JPAT[3] = "/hours_worked"
}
J[3] > 0 { print J[1], J[2] * J[3] }

The output will be:

Kathy 40
Mark 100
Mary 121
Susie 76.5

JPAT is similar to FPAT. What happened here is that we taught tabjson how to find the field contents. Behind the scenes, a decision tree is built when JPAT changes before a call to getline. This decision tree will instruct tabjson which element in the J array must be updated as it encounters values in the JSON token stream.

In other words, the algorithm is one-pass, but also robust. The algorithm is not sensitive to the order of fields in the input JSON. Given the following input:

{ "foo": 1, "bar": 2 }
{ "bar": 3, "foo": 4 }
{ "foo": 5 }
{ "bar": 6 }

And the following AWK code:

BEGIN {
    JPAT[1] = "/foo"
    JPAT[2] = "/bar"
}
{ print "foo=" J[1], "bar=" J[2] }

The output will be:

foo=1 bar=2
foo=4 bar=3
foo=5 bar=
foo= bar=6

Besides arbitrary ordering (as it should be given JSON mandates no ordering), tabjson also supports arbitrary nesting for extracted fields (cf. RFC6901 for full path specification).

Updating JSON documents

You can update the JSON document in the current record just by changing the values in the J array variable. When you’re ready to serialize a new JSON document, just call tabjson::sprint(). This function returns the new serialized JSON document.

Given the following input:

{ "name": "Beth", "pay_rate": 4.00, "hours_worked": 0 }
{ "name": "Dan", "pay_rate": 3.75, "hours_worked": 0 }
{ "name": "Kathy", "pay_rate": 4.00, "hours_worked": 10 }
{ "name": "Mark", "pay_rate": 5.00, "hours_worked": 20 }
{ "name": "Mary", "pay_rate": 5.50, "hours_worked": 22 }
{ "name": "Susie", "pay_rate": 4.25, "hours_worked": 18 }

And the following AWK code:

BEGIN {
    JPAT[1] = "/hours_worked"
    JPAT[2] = "/pay_rate"
}
J[1] > 15 { J[2] += 0.5 }
{ print tabjson::sprint() }

The output will be:

{ "name": "Beth", "pay_rate": 4.00000000000000, "hours_worked": 0.00000000000000 }
{ "name": "Dan", "pay_rate": 3.75000000000000, "hours_worked": 0.00000000000000 }
{ "name": "Kathy", "pay_rate": 4.00000000000000, "hours_worked": 10.0000000000000 }
{ "name": "Mark", "pay_rate": 5.50000000000000, "hours_worked": 20.0000000000000 }
{ "name": "Mary", "pay_rate": 6.00000000000000, "hours_worked": 22.0000000000000 }
{ "name": "Susie", "pay_rate": 4.75000000000000, "hours_worked": 18.0000000000000 }

This process is cheap as it only performs string concatenation from the offsets stored during the parsing phase. In other words, only the changed parts of the JSON document need to be re-serialized. This strategy allows you to perform a small number of modifications in big JSON documents pretty efficiently. Also unreferenced fields suffer no loss of representation (e.g. numerical values not mentioned in JPAT are fully preserved bit-by-bit).

Tabjson will also preserve boolean types. If a value was originally boolean and you set J[n] to 1 or 0, the field will stay a boolean. However any numeric value other than 1 or 0 will coerce the type to a number.

Values deleted from J are converted to the null JSON token.

Unwinding arrays

Another feature from tabjson is the ability to emit multiple AWK records for a single JSON document that contains an array field. This feature was inspired by the MongoDB aggregation framework’s $unwind operator. Just specify the array field to flatten by filling the JUNWIND variable. For instance, given the following data set:

{ "_id" : "jane", "joined" : "2011-03-02", "likes" : ["golf", "racquetball"] }
{ "_id" : "joe", "joined" : "2012-07-02", "likes" : ["tennis", "golf", "swimming"] }

And the following AWK program:

BEGIN {
    JPAT[1] = "/likes"
    JUNWIND = 1 #< index in JPAT array
}
{ print NR, tabjson::sprint() }

The output will be:

1 { "_id" : "jane", "joined" : "2011-03-02", "likes" : "golf" }
2 { "_id" : "jane", "joined" : "2011-03-02", "likes" : "racquetball" }
3 { "_id" : "joe", "joined" : "2012-07-02", "likes" : "tennis" }
4 { "_id" : "joe", "joined" : "2012-07-02", "likes" : "golf" }
5 { "_id" : "joe", "joined" : "2012-07-02", "likes" : "swimming" }

And this other AWK program:

BEGIN {
    JPAT[1] = "/likes"
    JUNWIND = 1
}
{ likes[J[1]] += 1 }
END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (like in likes) {
        print like, likes[like]
    }
}

Will output:

golf 2
tennis 1
swimming 1
racquetball 1

As each document is synthesized for that array element, the variable JUINDEX will hold the index from that element in the array. If that field wasn’t an array and the document is just being passed through, JUINDEX has the value 0. If JUINDEX's value is different than 0, JULENGTH is also set with the array length. These two variables let you delimit array elements that have origin in the same document.

Using JUINDEX and JULENGTH it’s also easy to create rules that compute means or only apply to the last element of an array, for instance.

Examples

Skip pong messages in certain journalD logs

journalctl SYSLOG_IDENTIFIER=my-app -o json -f | gawk -l jsontab -e '
    BEGIN { JPAT[1] = "/MESSAGE/heartbeat_ts" }
    1 in J { next }
    1'

Convert FEBE’s cookie backup format to curl’s cookiejar format

@load "tabjson"

BEGIN {
    JPAT[EXPIRES            = 1] = "/expires"
    JPAT[DOMAIN             = 2] = "/host"
    JPAT[INCLUDE_SUBDOMAINS = 3] = "/isDomain"
    JPAT[HTTPS_ONLY         = 4] = "/isSecure"
    JPAT[NAME               = 5] = "/name"
    JPAT[PATH               = 6] = "/path"
    JPAT[VALUE              = 7] = "/value"

    OFS="\t"

    print "# Netscape HTTP Cookie File"
    print "# http://softwarebychuck.com/febe/febe.html"
}
{
    print J[DOMAIN],
        J[INCLUDE_SUBDOMAINS] ? "TRUE" : "FALSE",
        J[PATH],
        J[HTTPS_ONLY] ? "TRUE" : "FALSE",
        J[EXPIRES],
        J[NAME],
        J[VALUE]
}

Limitations

Shared references

It’s not possible to reference the same field twice as the algorithm employed by tabjson works in a decode-at-most-once fashion.

# will raise an error
JPAT[1] = "/extra"
JPAT[2] = "/extra/previous_company"

Gaps in JPAT

JPAT must be an array of elements without gaps from 1 to N.

# wrong
JPAT[1] = "/foo"
JPAT[3] = "/bar"

Features that haven’t been implemented yet

  • Support for non-line delimited JSON streams.
  • MPFR support.
  • Respect OFMT.
  • Non-UTF8 locales.
  • New field insertion.

See Also

Author

Vinícius dos Santos Oliveira

Info

2021-06-09 tabjson 0.1.0 GNU Awk Extension Modules