fapolicyd.timing - Man Page
fapolicyd manual decision timing report
Description
The /run/fapolicyd/fapolicyd.timing file contains one manual decision timing run from fapolicyd(8). The daemon writes this file when a privileged manual timing stop request ends an armed timing run. A stop request while timing is not armed writes a short status report instead of timing data. The daemon may also stop collection and write the report when a timing counter would overflow.
Timing collection is disabled by default. Manual timing requests are honored only when the active daemon configuration has timing_collection=manual in fapolicyd.conf(5). The command fapolicyd-cli --timing-start arms a run, and fapolicyd-cli --timing-stop stops the run and prints this report. --timer-start and --timer-stop are accepted aliases.
The daemon stores aggregate counters only. It does not store one timing record per decision. Each decision worker owns local aggregate blocks containing a call count, total time, maximum time, and fixed latency buckets for each stage. The report snapshots and aggregates those blocks.
Run Summary
The first lines summarize the run:
- Mode
Active timing collection mode.
- Timing run
Wall-clock start and stop time.
- Duration
Elapsed wall-clock run time in H:MM:SS format.
- Workers
Number of decision worker timing blocks included in the snapshot.
- Max queue depth
Highest internal event queue depth observed during the timing run. The daemon temporarily resets only the max queue depth counter when timing starts and restores the previous high-water mark after timing stops if the previous value was larger. The Queueing section repeats this value next to queue wait timing.
- Decisions
Number of completed fanotify decisions timed during the run.
- Throughput
Timed decisions per second over the wall-clock run duration. This includes time when fapolicyd was armed but no events were waiting to be decided.
- Active decision rate
Timed decisions per second using the sum of decision:total latency instead of wall-clock duration. With more than one decision worker, this is worker-time, not elapsed wall-clock time.
- Stop reason
Printed only when the run stopped automatically because a timing counter would overflow.
TL;DR
The TL;DR section appears immediately after the run summary and before Overall decision latency . It surfaces two or three dominant findings drawn from the same data used by the later Derived observations section, such as helper dominance, response formatting cost, or queueing health. The detailed observations remain in their original location later in the report.
Overall Decision Latency
The Overall decision latency section describes end-to-end decision latency from the point a decision worker starts processing a dequeued fanotify event until the decision path returns after response handling.
The section prints average and maximum latency using compact latency units, approximate percentile buckets, and cumulative percentages for useful thresholds such as <=50us , <=100us , <=500us , <=1ms , and >10ms . Percentiles are bucket approximations, not exact per-decision percentiles. The 95th percentile is a useful tail-latency marker: roughly 95% of observed calls were at or below that bucket, so a high p95 points to a recurring slow path rather than a single worst-case outlier.
When high-end tail buckets are observed, the section also prints a compact tail line for >10ms , >25ms , >50ms , >100ms , and >250ms . The line includes both count and percentage so long outliers do not disappear inside a single >10ms bucket.
Queueing
The Queueing section describes work waiting before a decision worker starts processing an event. It is based on time_in_queue:total when enqueue timestamps were observed during the timing run.
- avg wait
Average observed time in the internal userspace event queue.
- max wait
Longest observed wait in the internal userspace event queue.
- p95 bucket
Approximate bucket containing the 95th percentile queue wait.
- total queued time
Sum of observed queue wait time. This is useful for spotting producer/consumer imbalance, but it is not decision CPU time.
- max queue depth
Highest internal event queue depth observed during the timing run.
If no enqueue timestamps were observed, the section prints not observed and still reports max queue depth.
Decision Phase Timing
The Decision phase timing section shows the main decision-thread phases when they were observed:
event_build:total evaluation:total response:total
- Phase
Displayed phase name.
- Calls
Number of times the phase was measured.
- Calls/Dec
Average phase calls per timed decision.
- Total
Total measured phase time.
- Avg
Average phase call time.
- Max
Slowest observed phase call.
- p95 bucket
Approximate bucket containing the 95th percentile phase call.
- Notes
Short deterministic notes. For example, a response phase dominated by syslog or debug formatting is marked syslog/debug-heavy .
These phase rows are the best first view for explaining where decision-thread time is being spent. Helper rows are separate because some lazy helper operations can be requested from more than one phase.
Lazy Helper Attribution
The report introduces helper attribution with:
Lazy helper attribution: Helper timings are attributed to the active logical driver: evaluation or response. Combined totals are evaluation + response.
The driver table and combined helper table follow this note.
Lazy Helper Attribution by Driver
The Lazy helper attribution by driver section shows helper time split by the logical context that caused it: evaluation or response.
- Helper
Short helper path.
- Eval total
Total helper time caused while the rule evaluation driver was active.
- Response total
Total helper time caused while response, audit metadata, or syslog/debug formatting was active.
- Combined
Total helper time from all driver contexts.
- Response %
Percentage of combined helper time caused by response-side work. This is useful for deciding whether manual/debug reporting is driving MIME or trust lookup cost.
Combined Lazy Helper Attribution
The Combined lazy helper attribution section groups helper-related stages into stable combined paths. For helpers that are split by driver, the combined row is calculated by adding the phase-specific evaluation and response rows together. These rows answer how often a helper was needed and how expensive that helper was when amortized over all decisions.
- Helper path
Short combined helper path.
- Calls
Number of helper calls measured.
- Calls/Dec
Average helper calls per timed decision.
- Total
Total measured helper time.
- Avg/call
Average measured helper call time.
- Amort/Dec
Total helper time divided by all timed decisions. This shows how much a lazy or rare helper contributes to a typical decision in the measured workload.
- Max
Slowest observed helper call.
- p95 bucket
Approximate bucket containing the 95th percentile helper call.
The current helper groups are:
mime_detection:total mime_detection:fast_classification mime_detection:gather_elf mime_detection:libmagic_fallback trust_db_lookup:total trust_db_lookup:read trust_db_lookup:lock_wait hash_ima:total hash_sha:total proc_detail_lookup
Derived Observations
The Derived observations section prints short deterministic observations when the required data was present in the run. Examples include:
- Wall-clock throughput compared with active decision rate, which can show that the workload was mostly idle while timing was armed.
- Queue wait and maximum queue depth compared with q_size when queue timing and configuration data are available.
- Response formatting share when syslog/debug formatting dominates response:total .
- MIME helper cost and libmagic fallback share.
- Rare but expensive integrity work from hash_ima or hash_sha , reported as percent of decisions, average time when called, and amortized decision cost.
- Trust database lock wait compared with trust database read time.
Stage Tail Summary
When hot detailed stage rows have observations above 10ms, the report prints a Stage tail summary after the detailed table. Rows are sorted by >10ms count, limited to the top five rows, and extended with any additional row that has a nonzero >50ms count. Near-identical parent/child tail rows may be suppressed to keep the section focused.
stage:name: >10ms count/percent, >25ms count/percent, >50ms count/percent
The detailed table remains the canonical total-time ranking; the tail summary only adds high-end resolution for rows that crossed one of the tail thresholds. Thresholds with zero observations are omitted.
Detailed Stage Table
The Detailed stage timing table is sorted by total measured time by default. It answers which measured operations consumed the most total time during the run. This is the detailed view and is intentionally kept below the summary and helper attribution sections.
The decision:total row is included when observed and participates in the same total-time sort as the other rows.
- Stage
Measured operation name. The current stage set is listed in Decision Flow.
- Calls
Number of times the operation was measured.
- Calls/Dec
Average measured calls per timed decision.
- Total
Total measured time for the stage.
- Avg
Average measured call time for the stage.
- Max
Slowest observed call for the stage.
- p95 bucket
Approximate bucket containing the 95th percentile call for the stage.
Stages with zero calls are hidden from the table and listed under Not observed . If a stage is not observed, no measured call to that operation occurred during the timing window. For example, an absent evaluation:trust_db_lookup:total and response:trust_db_lookup:total line means no measured trust database lookup occurred from those drivers in that run.
Decision Flow
Stage names use a colon-separated hierarchy:
phase:operation:child
Rows ending in :total are parent scopes. Child operation timings are useful contributors, but they can be nested, lazy, or shared by more than one code path, so child rows are not expected to add up to the parent total.
The main decision-thread view is:
decision:total time_in_queue:total event_build:total event_build:cache_flush event_build:proc_fingerprint event_build:fd_stat evaluation:lock_wait evaluation:total evaluation:mime_detection:total evaluation:mime_detection:fast_classification evaluation:mime_detection:gather_elf evaluation:mime_detection:libmagic_fallback evaluation:fd_path_resolution evaluation:proc_detail_lookup evaluation:hash_ima:total evaluation:hash_sha:total evaluation:trust_db_lookup:total evaluation:trust_db_lookup:lock_wait evaluation:trust_db_lookup:read response:total response:syslog_debug_format:total response:mime_detection:total response:mime_detection:fast_classification response:mime_detection:gather_elf response:mime_detection:libmagic_fallback response:trust_db_lookup:total response:trust_db_lookup:lock_wait response:trust_db_lookup:read response:audit_metadata:total response:fanotify_write
Lazy helper operations are attributed to the active logical driver when the helper is called. Rule evaluation uses evaluation: , response logging and audit response work use response: . Helper timing outside those logical drivers is treated as an instrumentation bug and is not reported as a separate phase.
Stages
- decision:total
Decision worker latency after a fanotify event has been dequeued, including event build, policy evaluation, optional logging or audit preparation, and response handling. Queue wait time is reported separately.
- time_in_queue:total
Time a fanotify event spent in the internal userspace queue between enqueue by the fanotify reader thread and dequeue by a decision worker. Events already in the queue when timing is armed do not have an enqueue timestamp and are not included in this stage.
- event_build:total
Construction of the internal event from fanotify metadata. This includes subject and object cache lookups, initial subject and object fingerprints, and some pattern-state work.
- event_build:cache_flush
Object cache flush performed on the decision path when the global flush flag is set.
- event_build:proc_fingerprint
Initial /proc/<pid> stat used to identify the subject process and detect stale subject cache entries.
- event_build:fd_stat
Stat of the fanotify file descriptor used to fingerprint the object.
- evaluation:lock_wait
Time spent waiting for the rule lock before policy evaluation.
- evaluation:total
Rule list evaluation loop. Lazy subject, object, trust, hash, or type lookups caused by rule fields can appear as child operation timings and also contribute to this total.
- evaluation:proc_detail_lookup
On-demand subject procfs detail lookups, including /proc/<pid>/status , /proc/<pid>/exe , subject executable type, session id, and auid lookups. These can be driven lazily by rules, syslog/debug formatting, or audit metadata.
- evaluation:fd_path_resolution
Path resolution for the object file descriptor.
For MIME and trust DB helper rows below, * means the driver prefix evaluation , response .
- *:mime_detection:total
Object type and ELF/script detection.
- *:mime_detection:fast_classification
In-house classification before libmagic fallback. This includes empty-file, ELF/script, shebang, common magic number, common text format, and device checks.
- *:mime_detection:gather_elf
ELF/script header scan performed by gather_elf().
- *:mime_detection:libmagic_fallback
Full libmagic fallback used when the faster type checks cannot classify the object.
- evaluation:hash_ima:total
IMA digest collection from the security.ima xattr through get_ima_hash().
- evaluation:hash_sha:total
SHA-style file digest collection through get_hash_from_fd2().
- *:trust_db_lookup:total
Total trust database lookup through check_trust_database(). This includes update-thread lock wait, long-term LMDB read setup, the primary LMDB lookup, optional hash work requested by trust lookup, and the top-level /usr symlink retry path.
- *:trust_db_lookup:lock_wait
Time spent waiting for the trust database update-thread lock before LMDB read setup. This is a future multi-decision-thread contention signal.
- *:trust_db_lookup:read
LMDB read setup and trust database read work after the update-thread lock is held. This currently includes the primary lookup, optional hash work requested by trust lookup, the top-level /usr symlink retry path, and read cleanup.
- response:total
Work after the rule loop has completed. This includes syslog/debug formatting, rule/source bookkeeping, decision metric updates, response selection, audit metadata preparation, fanotify response writes, and descriptor close work.
- response:syslog_debug_format:total
Formatting of syslog or debug output after a decision. Lazy field lookups needed only for logging may appear as nested stage timings.
- response:audit_metadata:total
Preparation of fanotify audit response metadata before the response write. When audit metadata needs object trust, trust lookup time can be nested here.
- response:fanotify_write
The actual write(2) of the fanotify response to the kernel.
Notes
The notes footer reports the largest queued-time contributor when queue wait was observed, the largest helper contributor, the largest decision phase contributor, and the slowest observed row by maximum time.
For cache hit, miss, collision, eviction, subject deferral, and early subject eviction counters, see fapolicyd.metrics(5).
If adding a timing count, total, or bucket counter would overflow, fapolicyd disarms timing, records an overflow stop reason, and writes the timing report instead of allowing counters to wrap and skew the data.
Files
- /run/fapolicyd/fapolicyd.timing
Manual decision timing report.
- /run/fapolicyd/fapolicyd.state
Runtime state report containing timing control state.
- /run/fapolicyd/fapolicyd.metrics
Runtime metrics report containing cache counters.
See Also
fapolicyd(8), fapolicyd-cli(8), fapolicyd.conf(5), fapolicyd.state(5), fapolicyd.metrics(5), and fapolicyd.trust(5).
Referenced By
fapolicyd(8), fapolicyd-cli(8), fapolicyd.conf(5), fapolicyd.metrics(5), fapolicyd.state(5).