pdfgrep - Man Page

search PDF files for a regular expression

Examples (TL;DR)

Find lines that match pattern in a PDF: pdfgrep pattern file.pdf
Include file name and page number for each matched line: pdfgrep [-H|--with-filename] [-n|--page-number] pattern file.pdf
Do a case-insensitive search for lines that begin with file_name and return the first 3 matches: pdfgrep [-m|--max-count] 3 [-i|--ignore-case] '^file_name' file.pdf
Find pattern in files with a .pdf extension in the current directory recursively: pdfgrep [-r|--recursive] pattern
Find pattern on files that match a specific glob in the current directory recursively: pdfgrep [-r|--recursive] --include '*book.pdf' pattern

Synopsis

pdfgrep [OPTION...] PATTERN FILE...
pdfgrep [OPTION...] {-e PATTERN|-f FILE}... FILE...
pdfgrep [OPTION...] -r|-R PATTERN [FILE|DIR...]
pdfgrep [OPTION...] -r|-R {-e PATTERN|-f FILE}... [FILE|DIR...]

Description

Search for PATTERN in each PDF FILE and print matching lines. By default, PATTERN is an extended regular expression.

pdfgrep tries to be mostly compatible with GNU grep with some PDF-specific distinctions and additional options. Most notably, -n prints page instead of line numbers.

Options

General Information

--help: Print a short summary of the options.
-V, --version: Show version information.

Pattern Interpretation

-F, --fixed-strings: Interpret PATTERN as a list of fixed strings separated by newlines, any of which is to be matched.
-P, --perl-regexp: Interpret PATTERN as a Perl compatible regular expression (PCRE2). See pcre2syntax(3) for a quick overview.

Matching Control

-e PATTERN, --regexp=PATTERN: Use PATTERN as the pattern to search for. If this option is specified multiple times or combined with --file, all patterns are tried in turn until one of them matches.
-f FILE, --file=FILE: Read patterns from FILE, one per line. If FILE contains multiple patterns or if this option is applied multiple times or combined with -e, all patterns are tried in turn until one of them matches. An empty pattern list matches nothing.
-i, --ignore-case: Ignore case distinctions in both the PATTERN and the input files.

General Output Control

-c, --count

Suppress normal output. Instead print the number of matches for each input file. Note that unlike grep, multiple matches on the same page will be counted individually.

-p, --page-count

Like -c, but prints the number of matches per page. Implies -n.

--color WHEN

Surround file names, page numbers and matched text with escape sequences to display them in color on the terminal. WHEN can be:

always	Always use colors, even when stdout is not a terminal.
never	Do not use colors.
auto	Use colors only when stdout is a terminal (this is the default).

-L, --files-without-match

Suppress normal output. Instead print the name of each input file that doesn’t contain a match. This works well with -Z, but many other output options like -n or -c are ignored when -L is specified.

-l, --files-with-matches

Suppress normal output. Instead print the name of each input file that contains a match. This works well with -Z, but many other output options like -n or -c are ignored when -l is specified.

-m, --max-count NUM

Stop reading a file after NUM matches. When the -c or --count option is also used, pdfgrep does not output a count greater than NUM.

-o, --only-matching

Print only the matched part of a line without any surrounding context.

-q, --quiet

Suppress all normal output to stdout. Exit immediately with exit status 0 if a match is found, even in case of errors. Use this if you only care about the presence of matches, not their number or content.

Line Prefix Control

-H, --with-filename

Print the file name for each match. This is the default setting when there is more than one file to search.

-h, --no-filename

Suppress the prefixing of file name on output. This is the default setting when there is only one file to search.

-n, --page-number[=TYPE]

Prefix each match with the number of the page where it was found.

The optional argument TYPE controls how page numbers are determined. If present, it can be one of the following:

index	The index of the page in the PDF, starting at 1 for the first page. This is the default.
label	The PDF page label. This can be an arbitrary string per page, set by the PDF creator (e.g. roman numerals for the first few pages).

-Z, --null

Output a null byte (called NUL in ASCII and '\0' in C) instead of the colon that usually separates a filename from the rest of the line. This option makes the output unambiguous in the presence of colons, spaces or newlines in the filename. It can be used in conjunction with commands such as xargs -0 or perl -0.

--match-prefix-separator SEP

Changes the colon used to separate filename, line number and text in the output to SEP, which can be an arbitrary string. This is useful when filenames contain colons, but only for interactive usage. For scripting, --null should be used.

Context Control

-A NUM, --after-context=NUM: Print NUM lines of context after matching lines. Contiguous groups of matches are separated by a line containing --. With -o, this option has no effect.
-B NUM, --before-context=NUM: Print NUM lines of context before matching lines. Contiguous groups of matches are separated by a line containing --. With -o, this option has no effect.
-C NUM, --context=NUM: Print NUM lines of context before and after matching lines. Contiguous groups of matches are separated by a line containing --. With -o, this option has no effect.

File Selection

-r, --recursive: Recursively search all files (restricted by --include and --exclude) under each directory, following symlinks only if they are on the command line.
-R, --dereference-recursive: Same as -r, but follows all symlinks.
--exclude=GLOB: Skip files whose base name matches GLOB. See glob(7) for wildcards you can use. You can use this option multiple times to exclude more patterns. It takes precedence over --include. Note, that in- and excludes apply only to files found via --recursive and not to the argument list.
--include=GLOB: Only search files whose base name matches GLOB. See --exclude for details. The default is *.[Pp][Dd][Ff].

Other Options

--cache

Use a cache for the rendered text to speed up the operation on large files.

--password=PASSWORD

Use PASSWORD to decrypt the PDF-files. Can be specified multiple times; all passwords will be tried on all PDFs. Note that this password will show up in your command history and the output of ps(1). So please do not use this if the security of PASSWORD is important.

--page-range=RANGE

Limit search to a specified set of pages. RANGE is a comma separated list of either a single page number or a range expression of the form PAGE1-PAGE2. Example: 2-3,5,7-10.

--debug

Enable debug output. Note: Due to limitations of poppler before version 0.30.0, some debug output is also printed without --debug when using such a poppler version.

--warn-empty

Print a warning to stderr if a PDF contains no searchable text. This is the case for PDFs that consist only of images, for example scanned documents.

--unac

Remove accents and ligatures from both the search pattern and the PDF documents. This is useful if you want to search for a word containing "ae", but the PDF uses the single character "æ" instead. See unac(3) and unaccent(1) for details.

This option is experimental and only available if pdfgrep is compiled with unac support.

Exit Status

Normally, the exit status is 0 if at least one match is found, 1 if no match is found and 2 if an error occurred. But if the --quiet or -q option is used and a match was found, pdfgrep will return 0 regardless of errors.

Environment Variables

The behavior of pdfgrep is affected by the following environment variable.

GREP_COLORS: Specifies the colors and other attributes used to highlight various parts of the output. The syntax and values are like GREP_COLORS of grep. See grep(1) for more details. Currently only the capabilities mt, ms, mc, fn, ln and se are used by pdfgrep, where mt, ms and mc have the same effect.

Files

${XDG_CACHE_HOME}/pdfgrep/*: Cache files written and used when --cache is enabled. At most 200 cache entries older than a day are retained.

Examples

Print the first ten lines matching pattern and print their page number:

pdfgrep -n --max-count 10 pattern foo.pdf

Search all .pdf files whose names begin with foo recursively in the current directory:

pdfgrep -r --include "foo*.pdf" pattern

Search all PDFs in the current directory for foo that also contain bar:

pdfgrep -Z --files-with-matches "bar" *.pdf | xargs -0 pdfgrep -H foo

Search all .pdf files that are smaller than 12M recursively in the current directory:

find . -name "*.pdf" -size -12M -print0 | xargs -0 pdfgrep pattern

Note that in contrast to the previous examples, this task could not be solved with pdfgrep alone, but the Unix tools find(1) and xargs(1) had to be used. That’s because pdfgrep itself doesn’t include options to exclude files by their size. But as you see, it doesn’t have to!

Search all .pdf files in the current directory in parallel on a multcore CPU

find . -name "*.pdf" -print0 | parallel -q0 pdfgrep -H foobar

This uses GNU parallel(1) in addition fo find(1) to search multiple files in parallel on multicore processors. Doing this can lead to a good speedup if you have multiple files to search and an underused CPU.