[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference thebay::joyoflex

Title:The Joy of Lex
Notice:A Notes File even your grammar could love
Moderator:THEBAY::SYSTEM
Created:Fri Feb 28 1986
Last Modified:Mon Jun 02 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1192
Total number of notes:42769

333.0. "Hyphenation algorithms?" by DRCS::GODWIN () Wed Mar 11 1987 13:01

    Typesetting systems often include a hyphenation facility using
    hyphenation dictionaries, logical hyphenation or a combination
    of both.
    
    I am curious about "logical hyphenation". What are the typical rules
    that form the basis of such a system? 

    Peter
    
T.RTitleUserPersonal
Name
DateLines
333.1You'll be sorry you asked!DRAGON::MCVAYPete McVay, VRO TelecomWed Mar 11 1987 17:0468
    Knuth uses an algorithm called "Liang's method", published in 1982.
    It is supposedly extremely efficient and works in any language that
    uses the Roman alphabet.  It avoids such horror stories as "The-rapists
    who pre-ached on wee-knights."

    Here's a direct quote from "The TeXbook":

    Here's how it works, using the word 'hyphenation' as an example...
    The given word is first extended by special markers at either end;
    in this case we obtain

                                .hyphenation.

    if '.' denotes the special marker.  The extended word has subwords

                          . h y p h e n a t i o n .
    of length one,

                     .h hy yp ph he en na at ti io on n.
    of length two,

                 .hy hyp yph phe hen ena nat ati tio ion on.

    of length three, and so on.  Each subword of length k is a pattern
    that defines k+1 small integer values relating to the desirability
    of hyphens in the positions between and adjacent to its letters.
     We can show these values by attaching them as subscripts; for example,
    '0h0e2n0' means that the values corresponding to the subword 'hen'
    are 0, 0, 2, and 0, where the 2 relates to hyphens between the 'e'
    and the 'n'.  The interletter values are entirely zero for all subwords
    [except for patterns provided in a specia 'pattern dictionary' to
    the hyphenation utility]...and in this case, only the subwords
    
                         0h0y3p0h0 0h0e2n0 0h0e0n0a4
                   0h0e0n5a0t0 1n0a0 0n2a0t0 1t0i0o0 2i0o0
    
    occur as special patterns.  TeX now computes the maximum interletter
    value that occurs at each subword touching each interletter position.
    for example, between 'e' and 'n' there are four relevant values in this
    case (2 from 0h0e2n0, 0 from 0h0e0n0a4, 0 from 0h0e0n5a0t0, and 1 from
    1n0a0); the maximum of these is 2.  The result of all the maximizations
    is 
    
                          .0h0y3p0h0e2n5a4t2i0o0n0.

    Now comes the final step: a hyphen is considered to be acceptable
    between two letters if the associated interletter value is ODD.
    Thus, two potential breakpoints have been found: 'hy-phen-ation'.
    Similarly, the word 'concatenation' contains the patterns
    
               0o2n0 0o0n1c0 1c0a0 1n0a0 0n2a0t0 1t0i0o0 2i0o0
    
    and this yields '0c0o2n1c0a0t0e1n2a1t2i0o0n0', i.e., 'con-cate-na-tion'.

    ...Plain TeX loads exactly 4447 patterns into TeX's memory, beginning
    with '0.0a0c0h4' and ending with '4z1z2' and '0z4z0y0'.  The
    interletter values in these patterns are all between 0 and 5; a
    large odd value like the 5 in '0h5e0l0o0' forces desirable hyphen
    points in words like 'bach-e-lor' and 'ech-e-lon', while a large
    even value like the 4 in '0h0a0c0h4' suppresses undesirable hyphens
    in words like 'tooth-aches'.  Liang derived these patterns by preparing
    a special version of Webster's Pocket Dictionary (Merriam, 1966)
    that contains about 50,000 words including derived forms.  Then
    he checked a a preliminary set of patterns obtained from this data
    against an up-to-date hyphenation dictionary of about 115,000 words
    obtained from a publisher; errors found in this run led to the addition
    of about 1000 words like 'camp-fire', 'Af-ghan-i-stan', and
    'bio-rhythm' to the pocket dictionary list.
333.2oh, THAT KnuthDELNI::GOLDSTEINWAC-E Ideology & PlanningWed Mar 18 1987 18:092
    If you understood all that at first reading,
    You may be able to use TeX!
333.3WRONGO::PARMENTERNo CIA spooks in White HouseTue Sep 20 1988 21:0711
    
    The Government Printing Office Style Manual lists the rules for
    most of the roman alphabet languages.  _Words Into Type_ lists a
    less complete set of these rules, and _The Chicago Manual of Style_
    also has a listing of the rules.
    
    These rules are in a natural language (English), not a computer
    language.  So it is left as an exercise for the reader to write
    the software which actually does the line breaking.
    
    David
333.4other division rules?SUBWAY::KABELdoryphoreThu Jan 30 1992 15:2115