Specify the algorithm for splitting text into cells

2026-07-27 02:31:45 +02:00 · 2025-04-10 10:47:07 +05:30
parent b32a5492c5
commit 3f919cbc56
2 changed files with 124 additions and 2 deletions
--- a/docs/text-sizing-protocol.rst
+++ b/docs/text-sizing-protocol.rst
@@ -191,8 +191,7 @@ have is the Unicode standard. Unfortunately, the Unicode standard has a new
 version almost every year and actually changes the width assigned to some
 characters in different versions. Furthermore, to actually get the "correct"
 width for a string using that standard one has to do grapheme segmentation,
-which is an `extremely complex algorithm
+which is a :ref:`complex algorithm, specified below <gseg>`.
 <https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`__.
 Expecting all terminals and all terminal programs to have both up-to-date
 character databases and a bug free implementation of this algorithm is not
 realistic.
@@ -344,3 +343,123 @@ their interactions with multicell characters.
 **Delete lines** (``CSI M`` aka ``DL``)
    When deleting ``n`` lines at cursor position ``y`` any multicell character
    that intersects the deleted lines must be erased.
 .. _gseg:
 The algorithm for splitting text into cells
 ------------------------------------------------
 .. note::
   Notation: :code:`[start, stop, step]` means the integeres from :code:`start`
   to :code:`stop` in increments of :code:`step`. When the step is not
   specified, it defaults to one.
 Here, we specify how a terminal must split up text into cells, where a cell is
 a width one unit in the character grid the terminal displays.
 The basis for the algorithm is the
 `Grapheme segmentation algorithm <https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`__
 from the Unicode standard. However, that algorithm alone is insufficient to
 fully specify text handling for terminals. The full algorithm is specified
 below. When a terminal receives a Unicode character:
 #. First check if the character is an ASCII control code, and handle it
   appropriately. ASCII control codes are the characters less than 32 and the
   character 127 (DEL). The NUL character (0) must be discarded.
 #. Next, check if the character is *invalid*, and if it is, discard it
   and finish processing. Invalid characters are characters with Unicode category :code:`Cc or Cs`
   and 66 additional characters: :code:`[0xfdd0, 0xfdef]`, :code:`[0xfffe, 0x10ffff-1, 0x10000]`
   and :code:`[0xffff, 0x10ffff, 0x10000]`.
 #. Next, check if there is a previous cell before the
   current cursor position. This means either the cursor is at x > 0 in which
   case the previous cell is at x-1 on the same line, or the previous cell is
   the last cell of the previous line, provided there is no line break
   between the previous and current lines.
 #. Next, calculate the width in cells of the received
   character, which can be 0, 1, or 2 depending on the character properties in
   the Unicode standard.
 #. If there is no previous cell and the character width is zero, the character
   is discarded and processing of the character is finished.
 #. If there is a previous cell, the
   `Grapheme segmentation algorithm UAX29-C1-1 <https://www.unicode.org/reports/tr29/#C1-1>`__
   is used to determine if there is a grapheme boundary between the previous cell and the current character.
 #. If there is no boundary the current character is added to the previous
   cell and processing of the character is finished. See the :ref:`var_select`
   section below for handling of Unicode Variation selectors.
 #. If there is a boundary, but the width of the current character is zero
   it is added to the previous cell and processing is finished.
 #. The character is added to the current cell and the cursor is moved forward
   (right) by either 1 or 2 cells depending on the width of the character.
 It remains to specify how to calculate the width in cells of a Unicode
 character. To do this, characters are divided into various classes, as
 described by the rules below, in order of decreasing priority:
 #. Regional indicators: 26 characters starting at :code:`0x1F1E6`. These all
   have width 2
 #. Doublewidth: Parse `EastAsianWidth.txt
   <https://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt>`__ from
   the Unicode standard. All characters marked :code:`W` or :code:`F` have
   width two. All characters in the following ranges have width two *unless*
   they are marked as :code:`A` in :code:`EastAsianWidth.txt`: :code:`[0x3400,
   0x4DBF], [0x4E00, 0x9FFF], [0xF900, 0xFAFF], [0x20000, 0x2FFFD], [0x30000, 0x3FFFD]`
 .. _wide_emoji_rule:
 #. Wide Emoji: Parse `emoji-sequences.txt
   <https://www.unicode.org/Public/emoji/latest/emoji-sequences.txt>`__ from
   the Unicode standard. All :code:`Basic_Emoji` have width two unless they are
   followed by :code:`FE0F` in the file. The leading copdepoints in all
   :code:`RGI_Emoji_Modifier_Sequence` and :code:`RGI_Emoji_Tag_Sequence` have width two.
   All codepoints in :code:`RGI_Emoji_Flag_Sequence` have width two.
 #. Marks: These are all zero width characters. They are characters with Unicode
   categories whose first Letter is :code:`M` or :code:`S`. Additionally,
   characters with Unicode category: :code:`Cf`. Finally, they include
   all modifier codepoints from :code:`RGI_Emoji_Modifier_Sequence` in the
   :ref:`Wide emoji rule <wide_emoji_rule>`.
 #. All remaining codepoints have a width of one cell.
 .. _var_select:
 Unicode variation selectors
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 There are two codepoints (:code:`U+FE0E` and :code:`U+FE0F`) that can actually
 alter the width of the previous codepoint. When adding a codepoint to the
 previous cell these have to be handled specially.
 ``U+FE0E`` - Variation Selector 15
  When the previous cell has width two and the last character in the previous
  cell is one of the ``Basic_Emoji`` codepoint from the :ref:`Wide emoji rule
  <wide_emoji_rule>` that is *not* followed by ``FEOF`` then the width of the
  previous cell is decreased to one.
 ``U+FE0F`` - Variation Selector 16
  When the previous cell has width one and the last character in the previous
  cell is one of the ``Basic_Emoji`` codepoint from the :ref:`Wide emoji rule
  <wide_emoji_rule>` that is followed by ``FEOF`` then the width of the
  previous cell is increased to two.
 Note that the rule for ``U+FE0E`` is particularly problematic for terminals as
 it means that the width of a string cannot be determined without knowing the
 width of the screen it will be rendered on. This is because when there is only
 one cell left on the current line and a wide emoji is received it wraps onto
 the next line. If subsequently a ``U+FE0E`` is received, the emoji becomes one
 cell wide but it is *not* moved back to the previous line.
 To avoid this issue, it is recommended applications detect when ``U+FE0E`` is
 present and in such cases use the width part of the text sizing protocol
 to control rendering.
--- a/kitty_tests/datatypes.py
+++ b/kitty_tests/datatypes.py
@@ -682,3 +682,6 @@ class TestDataTypes(BaseTest):
        s.draw('a' * s.columns)
        s.draw('\u0306')
        self.ae(str(s.line(0)), 'a' * s.columns + '\u0306')
        s.reset()
        s.draw('\0')
        self.ae(str(s.line(0)), '')