public-inbox-extindex-format - Man Page

external index format description

Description

The extindex is an index-only evolution of the per-inbox SQLite and Xapian indices used by public-inbox-v2-format(5) and public-inbox-v1-format(5).  It exists to facilitate searches across multiple inboxes as well as to reduce index space when messages are cross-posted to several existing inboxes.

It transparently indexes messages across any combination of v1 and v2 inboxes and data about inboxes themselves.

Directory Layout

While inspired by v2, there is no git blob storage nor msgmap.sqlite3 DB.

Instead, there is an ALL.git (all caps) git repo which treats every indexed v1 inbox or v2 epoch as a git alternate.

As with v2 inboxes, it uses over.sqlite3 and Xapian "shards" for WWW and IMAP use.  Several exclusive new tables are added to deal with "Xref3 Deduplication" and metadata.

Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP newsgroup.  Thus it lacks msgmap.sqlite3 to enforce the unique Message-ID requirement of NNTP.

Index Overview and Definitions

  $SCHEMA_VERSION - DB schema version (for Xapian)
  $SHARD - Integer starting with 0 based on parallelism

  foo/                              # "foo" is the name of the index
  - ei.lock                         # lock file to protect global state
  - ALL.git                         # empty, alternates for inboxes
  - ei$SCHEMA_VERSION/$SHARD        # per-shard Xapian DB
  - ei$SCHEMA_VERSION/over.sqlite3  # overview DB for WWW, IMAP
  - ei$SCHEMA_VERSION/misc          # misc Xapian DB

File and directory names are intentionally different from analogous v2 names to ensure extindex and v2 inboxes can easily be distinguished from each other.

Xref3 Deduplication

Due to cross-posted messages being the norm in the large Linux kernel development community and Xapian indices being the primary consumer of storage, it makes sense to deduplicate indexing as much as possible.

The internal storage format is based on the NNTP "Xref" tuple, but with the addition of a third element: the git blob OID. Thus the triple is expressed in string form as:

        $NEWSGROUP_NAME:$ARTICLE_NUM:$OID

If no newsgroup is configured for an inbox, the inboxdir of the inbox is used.

This data is stored in the xref3 table of over.sqlite3.

misc XAPIAN DB

In addition to the numeric Xapian shards for indexing messages, there is a new, in-development Xapian index for storing data about inboxes themselves and other non-message data.  This index allows us to speed up operations involving hundreds or thousands of inboxes.

Benefits

In addition to providing cross-inbox search capabilities, it can also replace per-inbox Xapian shards (but not per-inbox over.sqlite3).  This allows reduction in disk space, open file handles, and associated memory use.

Caveats

Relocating v1 and v2 inboxes on the filesystem will require extindex to be garbage-collected and/or reindexed.

Configuring and maintaining stable newsgroup names before any messages are indexed from every inbox can avoid expensive reindexing and rely exclusively on GC.

Locking

flock(2) locking exclusively locks the empty ei.lock file for all non-atomic operations.

Thanks

Thanks to the Linux Foundation for sponsoring the development and testing.

See Also

public-inbox-v2-format(5)

Referenced By

lei-add-external(1), lei-store-format(5), public-inbox-extindex(1), public-inbox-glossary(7), public-inbox-index(1).

1993-10-02 public-inbox.git public-inbox user manual