ETS Match Specification Pitfall: Maps

:: erlang, programming

During my work, I had a problem where I needed to efficiently write tree-shaped data. I also needed the ability to query ranges of the data. I reached for ETS tables, using the ordered_set table setting, and relied on match specifications to read ranges from the table.

ETS tables can have different storage backends. See the documentation for ets:new/2 for details. These backends reside within in a union type.

In many cases, an ETS match specification will scan a whole table. I specifically wanted to avoid this. The documentation for ets:match/2 says this:

If the key is specified in the pattern, the match is very efficient. If the key is not specified, that is, if it is a variable or an underscore, the entire table must be searched. The search time can be substantial if the table is very large.

I learned of an interesting caveat to the above statement; When using ordered_set tables, matching can be made more efficient by specifying a partial key!

(Note: Code snippets are based on this commit of erlang/otp)

enum ms_key_boundness {
    /* Order significant, larger means more "boundness" => less iteration */
    MS_KEY_UNBOUND           = 0,
    MS_KEY_PARTIALLY_BOUND   = 1,
    MS_KEY_BOUND             = 2,
    MS_KEY_IMPOSSIBLE        = 3
};

/*
 * This structure is filled in by analyze_pattern() for the select
 * functions.
 */
struct mp_info {
    enum ms_key_boundness key_boundness;
    Eterm least;                /* The lowest matching key (possibly
                                 * partially bound expression) */
    Eterm most;                 /* The highest matching key (possibly
                                 * partially bound expression) */
    Binary *mp;                 /* The compiled match program */
};

(source)

The analyze_pattern() function determines the ‘boundness’ of the key, and will recursively calculate the least and greatest values that can match the provided partial key.

One interesting bit of the code that calculates key boundness is as follows:

else if (key != am_Underscore && db_is_variable(key) < 0 && !db_has_map(key)) {
    *keyp = key;
    return MS_KEY_PARTIALLY_BOUND;
}

(source)

If the key contains a map, it cannot be considered fully bound. Consider this comment in the db_is_fully_bound() function:

/* Like in Erlang code, "literal" maps in a pattern match any
* map that has the given elements, so they must be considered
* variable. */

(source)

The explanation is intuitive, but the ramifications are important. If you have an ordered_set ETS table, consider changing out maps (or Elixir structs) for tuples and records instead.

I’ve encountered this issue at least twice now. By being clever about the types and ordering of ETS key terms, one can massively improve the performance of ets:select/2 calls, even when the ETS tables contain thousands of rows.


Tag: erlang