[ art / civ / cult / cyb / diy / drg / feels / layer / lit / λ / q / r / sci / sec / tech / w / zzz ] archive provided by lainchan.jp

lainchan archive - /λ/ - 22736



File: 1490880941587-0.png (3.01 KB, 300x26, out-1.png)

File: 1490880941587-1.png (22.61 KB, 300x202, out-2.png)

File: 1490880941587-2.png (22.68 KB, 300x202, out-3.png)

No.22736

I started working on a text justification algorithm last night, as a fun exercise to see how they work. I've intentionally not looked at the Knuth-Plass algorithm.

I'm doing this in Haskell. I won't give all the code, but enough to follow what I'm doing, hopefully. Here are some basic types:

  -- | A line of text is a non-empty list of words interspersed with spaces of varying size.
data Line = Line String [(Int, String)]

-- | A text justifier takes a line width, list of word (and word fragment) sizes, a minimum space width, a list of words, and produces a list of lines
type Justifier = Int -> [(String, Int)] -> Int -> [String] -> [Line]

out-1.png was just a test, to make sure I could actually render things to an image.

  justify1 :: Justifier
justify1 _ _ iota (w:ws) = [Line w [(iota, w') | w' <- ws]]
justify1 _ _ _ [] = []

out-2.png is a simple greedy ragged-right: put words on a line until they don't fit any more.

  justify2 :: Justifier
justify2 width sizes iota = go ([], 0) where
go (sofar, len) (w:ws) =
let newlen = len + iota + fromMaybe 0 (lookup w sizes)
in if newlen > width
then case reverse sofar of
(word:rest) -> toLine word rest : go ([], 0) (w:ws)
[] -> Line w [] : go ([], 0) ws
else go (w:sofar, newlen) ws
go (sofar, _) [] = case reverse sofar of
(word:rest) -> [toLine word rest]
[] -> []

toLine word rest = Line word [(iota, s) | s <- rest]

out-3.png is what I have decided to call "web browser text justification", it's the awful algorithm which makes justification such a no-no on webpages. First, allocate words to lines with ragged-right, then evenly spread spaces to use up the extra space. It can be implemented as a simple modification of the ragged right function.

  justify3 :: Justifier
justify3 = padWords justify2

padWords :: Justifier -> Justifier
padWords justifier width sizes iota = padWords' width sizes . justifier width sizes iota

padWords' :: Int -> [(String, Int)] -> [Line] -> [Line]
padWords' width sizes = go where
go [] = []
go [lastLine] = [lastLine]
go (l@(Line w rest):ls) =
let slack = width - lineLen sizes l
gaps = lineWords l - 1
wordSlack = slack `div` gaps
extraSlack = slack - wordSlack * gaps
extraSlackPos = 42 `mod` (gaps - 1)
in Line w (go' wordSlack extraSlack extraSlackPos rest) : go ls

go' wordSlack extraSlack extraSlackPos ((gap,w):ws)
| extraSlackPos == 0 = (gap + wordSlack + extraSlack, w) : go' wordSlack 0 0 ws
| otherwise = (gap + wordSlack, w) : go' wordSlack extraSlack extraSlackPos ws
go' _ _ _ [] = []

  No.22737

File: 1490881198503-0.png (22.81 KB, 200x126, out-4.png)

File: 1490881198503-1.png (22.43 KB, 200x126, out-5.png)

(cont)

This is the limit of what we can do without hyphenation.

out-3.png uses a very naive hyphenation strategy, it just splits words wherever it likes (the fragments' function returns all breakings of a word)

  justify4 :: Justifier
justify4 = hyphenated fragments'

fragments' :: String -> [(String, String)]
fragments' s0 = (map go . init . tail) (zip (inits s0) (tails s0)) where
go (h, t) = (h ++ "-", t)

hyphenated :: (String -> [(String, String)]) -> Justifier
hyphenated hyphenator = padWords justifier where
justifier width sizes iota = go ([], 0) where
go ([], _) (w:ws) = go ([w], fromMaybe 0 (lookup w sizes)) ws
go (sofar, len) (w:ws)
| fits len w = go (w:sofar, len + iota + fromMaybe 0 (lookup w sizes)) ws
| otherwise = case dropWhile (not . fits len . fst) . sortOn (Down . length . fst) $ hyphenator w of
((h,t):_) -> case reverse (h:sofar) of
(word:rest) -> toLine word rest : go ([], 0) (t:ws)
_ -> error "unreachable"
[] -> case reverse sofar of
(word:rest) -> toLine word rest : go ([], 0) (w:ws)
[] -> Line w [] : go ([], 0) ws
go (sofar, _) [] = case reverse sofar of
(word:rest) -> [toLine word rest]
[] -> []

fits len w = len + iota + fromMaybe 0 (lookup w sizes) <= width

toLine word rest = Line word [(iota, s) | s <- rest]

out-4.png uses the Knuth-Liang hyphenation algorithm, here with the Latin rules because I am typesetting Lorem Ipsum. This produces much prettier results.

  justify5 :: Justifier
justify5 = hyphenated knuthHyphenator

knuthHyphenator :: String -> [(String, String)]
knuthHyphenator w =
let prefixes = (init . scanl1 (++)) (hyphenate latin w)
in map (\prefix -> (prefix++"-", drop (length prefix) w)) prefixes

  No.22738

File: 1490881430076.png (22.3 KB, 200x126, out-6.png)

(cont)

>>22737
Whoops, I meant out-4.png and out-5.png there.

Until now, all these algorithms have been greedy. Once they commit a word (or a fragment of a word) to a line, that decision is never revisited. I suspect that out-5.png is the limit of what can be achieved with greedy algorithms: it looks good, but not great.

This brings me to my current version of the algorithm. When a word doesn't fit on a line in its entirety, all possible hyphenations are tried (including just shoving the word onto the next line entirely). Afterwards, the least-bad paragraph is picked. out-6.png shows the result.

  justify6 :: Justifier
justify6 = leastBad justifier where
justifier width sizes iota = map (padWords' width sizes) . go ([], 0) where
go ([], _) (w:ws) = go ([w], fromMaybe 0 (lookup w sizes)) ws
go (sofar, len) (w:ws)
| fits len w = go (w:sofar, len + iota + fromMaybe 0 (lookup w sizes)) ws
| otherwise =
(case reverse sofar of
(word:rest) -> [toLine word rest : ls | ls <- go ([], 0) (w:ws)]
[] -> [toLine w [] : ls | ls <- go ([], 0) ws])
++
[ toLine word rest : ls
| (h,t) <- knuthHyphenator w
, fits len h
, let (word:rest) = reverse (h:sofar)
, ls <- go ([], 0) (t:ws)
]
go (sofar, _) [] = case reverse sofar of
(word:rest) -> [[toLine word rest]]
[] -> [[]]

fits len w = len + iota + fromMaybe 0 (lookup w sizes) <= width

toLine word rest = Line word [(iota, s) | s <- rest]

leastBad :: (Int -> [(String, Int)] -> Int -> [String] -> [[Line]]) -> Justifier
leastBad justifier width sizes iota ws = case sortOn badness (justifier width sizes iota ws) of
(leastBad:_) -> leastBad
[] -> []
where
-- the badness of a paragraph is the pair @(sum of line badnesses,
-- number of hyphenated lines)@.
badness ls =
let
-- the badness of a line is the difference between the biggest
-- space and the smallest possible space, raised to the third
-- power. This is nonlinear so that one particularly bad line
-- is worse than a few slightly bad lines.
lbadness (Line _ rest) = (maximum (iota : map fst rest) - iota) ^ 3
lhyphenated (Line w rest) = last (last (w : map snd rest)) == '-'
in (sum (map lbadness ls), length (filter lhyphenated ls))

I am pretty happy with tis, but I suspect there is more I could achieve. I'm not really sure what yet, though.

  No.22745

File: 1490970214517-0.png (64.09 KB, 200x51, out-rich-rr3.png)

File: 1490970214517-1.png (60.62 KB, 200x89, out-rich-rl2.png)

File: 1490970214517-2.png (62.71 KB, 200x59, out-rich-justify3.png)

(cont)

I've added a couple of new features to my algorithm: rich text (bold, italic, and monospace), variable line-lengths, and variable left-indents. The attached images show some different configurations of the same string, changing formatting every three words.

For this, I had to change the line type to include fonts, and allow spaces before even the first word:

  type Line = NonEmpty (Int, String, Font) 

I measure the width of every word and partial word in the input, with the exception of monospaced spans, which never get broken over lines:

  type RichText = [(String, Font)]

-- | Font styles.
data Font = Normal | Bold | Italic | Monospace
deriving (Bounded, Enum, Eq, Ord, Read, Show)

-- | Get the size of every word in a string, font-aware.
getWordSizes :: RichText -> IO [((String, Font), Int)]
getWordSizes = fmap (nub . concat) . mapM go where
go w@(_, Monospace) = do
size <- getStringSize w
pure [(w, size)]
go (s, f) =
let wsize w = let wf = (w, f) in getStringSize wf >>= \size -> pure (wf, size)
in mapM wsize (concatMap fragments (words s))

-- | Get the size of a string.
getStringSize :: (String, Font) -> IO Int
getStringSize (s, f) = do
((x1,_), _, (x2,_), _) <- GD.measureString (fontName f) fontSize 0 (0, 0) s 0
pure (x2-x1)

The algorithm isn't actually that much more complex than justify6 I had in >>22738:

  indentsAndLineLengths
:: (Int -> (Int, Int))
-- ^ Argument is the line number (starting from 0). First result is
-- the left indent, second is the line length.
-> (Paragraph -> Paragraph)
-- ^ Post-processing function.
-> Justifier
indentsAndLineLengths lenf postf _ sizes iota = leastBad iota . map postf . go 0 ([], 0) . concatMap pre where
-- pre-processing: split up spans into words
pre w@(_, Monospace) = [w]
pre (s, f) = [(s', f) | s' <- words s]

go n ([], _) (w:ws) = go n ([w], wordSize sizes w) ws
go n (sofar, len) (w:ws)
| fits n len w = go n (w:sofar, len + iota + wordSize sizes w) ws
| otherwise =
(case reverse sofar of
(word:rest) -> [toLine n word rest : ls | ls <- go (n+1) ([], 0) (w:ws)]
[] -> [toLine n w [] : ls | ls <- go (n+1) ([], 0) ws])
++
[ toLine n word rest : ls
| (h,t) <- hyphens w
, fits n len h
, let (word:rest) = reverse (h:sofar)
, ls <- go (n+1) ([], 0) (if null (fst t) then ws else (t:ws))
]
go n (sofar, _) [] = case reverse sofar of
(word:rest) -> [[toLine n word rest]]
[] -> [[]]

fits n len w = len + iota + wordSize sizes w <= snd (lenf n)

toLine n (w,f) rest = (fst (lenf n), w, f):|[(iota, s, f') | (s, f') <- rest]

hyphens (w, Monospace) = [((w, Monospace), ("", Monospace))] -- never hyphenate monospaced spans
hyphens (w, f) = map (\(h,t) -> ((h,f),(t,f))) (knuthHyphenator w)

This perfoms no justification by itself. The postf parameter can adjust word spacing and line indents to get ragged left or justified text.

I don't have any more ideas for how to improve this. I think I might call it done and go read the Knuth-Plass paper to see how close I got.

  No.22754

Very nice, OP. Next add images inline with the text :)
I'm curious, how are you rendering these in Haskell?

  No.22755

>>22754
I'm using the GD bindings (https://hackage.haskell.org/package/gd) to produce png files. All the sizes and distances are in terms of pixels: allowed line length, size of words, length of spaces between words.

>images

I think I can see how to do that. Say you have an image Xpx wide and want a result Ypx wide. You render your text with the first N lines (Y - X - some gap)px wide and the rest Ypx wide. Then the renderer just puts the image in the top-let, offsets the first N lines by (X + some gap) pixels, and renders the remaining lines normally. N will be (height of image / height of line).

  No.22756

File: 1491063789663.png (32.99 KB, 200x194, out-shape-lain.png)

>>22754
Here you go, lain. It was just as I said:

  -- | Justify text to flow around a lain image.
lainJ :: Int -> Justifier
lainJ lineHeight width0 sizes = indentsAndLineLengths lenf justify width0 sizes where
justify = padWords sizes (snd . lenf)
lenf n
| n <= lainImageHeight `div` lineHeight = (lainImageWidth + lainImageGap, width0 - lainImageWidth - lainImageGap)
| otherwise = (0, width0)

-- | Render text + lain image.
lainR :: [((String, Font), Int)] -> String -> Paragraph -> IO ()
lainR sizes fname ls0 = do
img <- renderImage sizes ls0
lainImg <- GD.loadGifFile lainImageFile
GD.copyRegion (0,0) (lainImageWidth,lainImageHeight) lainImg (0,0) img
GD.savePngFile fname img

-- | Pixel height of the lain image.
lainImageHeight :: Int
lainImageHeight = 300

-- | Pixel width of the lain image.
lainImageWidth :: Int
lainImageWidth = 250

-- | Pixel size of horizontal gap between lain image and text
lainImageGap :: Int
lainImageGap = 15

-- | Filename of the lain image.
lainImageFile :: String
lainImageFile = "lain.gif"

  No.22757

>>22756
Lainon delivers! That was quick, and looks fuarrrking nice too.

  No.22761

File: 1491082456523.png (1.13 MB, 32x200, doc.png)

This will be the last update, unless I have any fantastic new idea. I wrote some combinators on images and rendered this thread.

I think the result is pretty nice.