Specify the algorithm for splitting text into cells

This commit is contained in:
Kovid Goyal
2025-04-10 10:47:07 +05:30
parent b32a5492c5
commit 3f919cbc56
2 changed files with 124 additions and 2 deletions

View File

@@ -191,8 +191,7 @@ have is the Unicode standard. Unfortunately, the Unicode standard has a new
version almost every year and actually changes the width assigned to some
characters in different versions. Furthermore, to actually get the "correct"
width for a string using that standard one has to do grapheme segmentation,
which is an `extremely complex algorithm
<https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`__.
which is a :ref:`complex algorithm, specified below <gseg>`.
Expecting all terminals and all terminal programs to have both up-to-date
character databases and a bug free implementation of this algorithm is not
realistic.
@@ -344,3 +343,123 @@ their interactions with multicell characters.
**Delete lines** (``CSI M`` aka ``DL``)
When deleting ``n`` lines at cursor position ``y`` any multicell character
that intersects the deleted lines must be erased.
.. _gseg:
The algorithm for splitting text into cells
------------------------------------------------
.. note::
Notation: :code:`[start, stop, step]` means the integeres from :code:`start`
to :code:`stop` in increments of :code:`step`. When the step is not
specified, it defaults to one.
Here, we specify how a terminal must split up text into cells, where a cell is
a width one unit in the character grid the terminal displays.
The basis for the algorithm is the
`Grapheme segmentation algorithm <https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`__
from the Unicode standard. However, that algorithm alone is insufficient to
fully specify text handling for terminals. The full algorithm is specified
below. When a terminal receives a Unicode character:
#. First check if the character is an ASCII control code, and handle it
appropriately. ASCII control codes are the characters less than 32 and the
character 127 (DEL). The NUL character (0) must be discarded.
#. Next, check if the character is *invalid*, and if it is, discard it
and finish processing. Invalid characters are characters with Unicode category :code:`Cc or Cs`
and 66 additional characters: :code:`[0xfdd0, 0xfdef]`, :code:`[0xfffe, 0x10ffff-1, 0x10000]`
and :code:`[0xffff, 0x10ffff, 0x10000]`.
#. Next, check if there is a previous cell before the
current cursor position. This means either the cursor is at x > 0 in which
case the previous cell is at x-1 on the same line, or the previous cell is
the last cell of the previous line, provided there is no line break
between the previous and current lines.
#. Next, calculate the width in cells of the received
character, which can be 0, 1, or 2 depending on the character properties in
the Unicode standard.
#. If there is no previous cell and the character width is zero, the character
is discarded and processing of the character is finished.
#. If there is a previous cell, the
`Grapheme segmentation algorithm UAX29-C1-1 <https://www.unicode.org/reports/tr29/#C1-1>`__
is used to determine if there is a grapheme boundary between the previous cell and the current character.
#. If there is no boundary the current character is added to the previous
cell and processing of the character is finished. See the :ref:`var_select`
section below for handling of Unicode Variation selectors.
#. If there is a boundary, but the width of the current character is zero
it is added to the previous cell and processing is finished.
#. The character is added to the current cell and the cursor is moved forward
(right) by either 1 or 2 cells depending on the width of the character.
It remains to specify how to calculate the width in cells of a Unicode
character. To do this, characters are divided into various classes, as
described by the rules below, in order of decreasing priority:
#. Regional indicators: 26 characters starting at :code:`0x1F1E6`. These all
have width 2
#. Doublewidth: Parse `EastAsianWidth.txt
<https://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt>`__ from
the Unicode standard. All characters marked :code:`W` or :code:`F` have
width two. All characters in the following ranges have width two *unless*
they are marked as :code:`A` in :code:`EastAsianWidth.txt`: :code:`[0x3400,
0x4DBF], [0x4E00, 0x9FFF], [0xF900, 0xFAFF], [0x20000, 0x2FFFD], [0x30000, 0x3FFFD]`
.. _wide_emoji_rule:
#. Wide Emoji: Parse `emoji-sequences.txt
<https://www.unicode.org/Public/emoji/latest/emoji-sequences.txt>`__ from
the Unicode standard. All :code:`Basic_Emoji` have width two unless they are
followed by :code:`FE0F` in the file. The leading copdepoints in all
:code:`RGI_Emoji_Modifier_Sequence` and :code:`RGI_Emoji_Tag_Sequence` have width two.
All codepoints in :code:`RGI_Emoji_Flag_Sequence` have width two.
#. Marks: These are all zero width characters. They are characters with Unicode
categories whose first Letter is :code:`M` or :code:`S`. Additionally,
characters with Unicode category: :code:`Cf`. Finally, they include
all modifier codepoints from :code:`RGI_Emoji_Modifier_Sequence` in the
:ref:`Wide emoji rule <wide_emoji_rule>`.
#. All remaining codepoints have a width of one cell.
.. _var_select:
Unicode variation selectors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are two codepoints (:code:`U+FE0E` and :code:`U+FE0F`) that can actually
alter the width of the previous codepoint. When adding a codepoint to the
previous cell these have to be handled specially.
``U+FE0E`` - Variation Selector 15
When the previous cell has width two and the last character in the previous
cell is one of the ``Basic_Emoji`` codepoint from the :ref:`Wide emoji rule
<wide_emoji_rule>` that is *not* followed by ``FEOF`` then the width of the
previous cell is decreased to one.
``U+FE0F`` - Variation Selector 16
When the previous cell has width one and the last character in the previous
cell is one of the ``Basic_Emoji`` codepoint from the :ref:`Wide emoji rule
<wide_emoji_rule>` that is followed by ``FEOF`` then the width of the
previous cell is increased to two.
Note that the rule for ``U+FE0E`` is particularly problematic for terminals as
it means that the width of a string cannot be determined without knowing the
width of the screen it will be rendered on. This is because when there is only
one cell left on the current line and a wide emoji is received it wraps onto
the next line. If subsequently a ``U+FE0E`` is received, the emoji becomes one
cell wide but it is *not* moved back to the previous line.
To avoid this issue, it is recommended applications detect when ``U+FE0E`` is
present and in such cases use the width part of the text sizing protocol
to control rendering.