B<GT>,I<R>,I<dis>  : B<GT> discount for r E<lt>= I<R>, r is the freq of a ngram.
                  Linear discount for those r E<gt> I<R>, i.e. r'=r*dis
                  0 E<lt>E<lt> dis E<lt> 1.0, for example 0.999 
      B<ABS>,[I<dis>] : Absolute discount r'=r-I<dis>. And I<dis> is optional
                  0 E<lt>E<lt> I<dis> E<lt> cut[k]+1.0, normally I<dis> E<lt> 1.0.
      LIN,[I<dis>] : Linear discount r'=r*dis. And dis is optional
                  0 E<lt> dis E<lt> 1.0

Note

-n must be given before -c -b. And -c must give right number of cut-off, also -ds must appear exactly N times specifying the discounts for 1-gram, 2-gram..., respectively.

BREAKER-IDs could be SentenceTokens or ParagraphTokens. Conceptually, these ids have no meaning when they appeared in the middle of n-gram.

EXCLUDE-IDs could be ambiguious-ids. Conceptually, n-grams which contain those ids are meaningless.

We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs directly from IDNGRAM file, because some low-level information is still useful in it.

Example

Following example read 'all.id3gram' and write trigram model 'all.slm'.

At 1-gram level, use Good-Turing discount with cut-off 0, i<R>=8, dis=0.9995. At 2-gram level, use Absolute discount with cut-off 3, dis auto-calc. At 3-gram level, use Absolute discount with cut-off 2, dis auto-calc. Word id 10,11,12 are breakers (sentence/para/paper breaker, etc). Exclude-ID is 9. Lexicon contains 200000 words. The result languagme model uses -log(pr).

slmbuild -l -n 3 -o all.slm -w 200000 -c 0,3,2 -d GT,8,0.9995 -d ABS -d ABS -b 10,11,12 -e 9 all.id3gram

Author

Originally written by Phill.Zhang <phill.zhang@sun.com>. Currently maintained by Kov.Chai <tchaikov@gmail.com>.

Info

2026-01-17 perl v5.42.0 User Contributed Perl Documentation

slmbuild - Man Page

Synopsis

Description

OPTIONS All the following options are mandatory.

Note

Example

Author

See Also

Info