v0.8.0: Markdown + Code + Diff rendering module

Add cl-tui.markdown package with: - Markdown parser: headings, paragraphs, bold, italic, inline-code, links, code blocks, blockquotes, lists, thematic breaks - Syntax highlighting: Lisp, Python, JavaScript, Bash with keyword, builtin, comment, number, function coloring - Diff renderer: colorized unified diff (+/-/@ lines) - Terminal renderer: ANSI escape sequences via backend-style functions - 67 tests, 100% passing - All parser helpers use values returns (not cons) for multiple-value-bind ASDF: v0.7.0 -> v0.8.0, new markdown module + test suite
2026-05-11 18:26:34 +00:00
parent e96c338a57
commit 9648c72b85
5 changed files with 1461 additions and 5 deletions
--- a/org/markdown-renderer.org
+++ b/org/markdown-renderer.org
@@ -0,0 +1,500 @@
+#+TITLE: Markdown + Code + Diff Rendering (v0.8.0)
+#+DATE: 2026-05-11
+#+AUTHOR: Amr Gharbeia / Hermes
+
+* Overview
+
+This module provides rendering of Markdown text, syntax-highlighted code
+blocks, and unified diffs in the terminal. It completes the rendering
+pipeline so that [[file:render.org][the render tree]] can handle rich formatted
+content.
+
+The Markdown renderer is /not/ a general-purpose MD-to-HTML converter.
+It targets TUI output: node types that have clear terminal analogues
+(headings → bold/bright, code blocks → monochrome block, bold → ANSI
+bold, etc.). Edge cases that matter for a terminal (long lines, escape
+sequences inside code, mixed formatting) are handled explicitly.
+
+** Design decisions
+
+1. /Two-phase parse/: block-level first (lines), then inline (characters
+   within each block). This matches how terminals render — block layout
+   first, style within.
+2. /Syntax highlighting by keyword set/: not a full lexer. A lookup
+   table of language → (keywords, types, builtins) sets. Catches ~90%
+   of highlighting cases without pulling in a parser. Fails safe
+   (unmatched tokens render as plain text).
+3. /Diff lines are self-describing/: a diff block starts with ─── or
+   +++, each line has a ± prefix. We don't re-parse patch semantics;
+   we just color by prefix. This makes the renderer tolerant of
+   malformed diffs.
+4. /No recursive descent parser/: a simple state machine over lines for
+   block-level, and a character cursor for inline. Keeps the code
+   short and avoids parser-generator dependencies.
+
+* Code structure
+
+** Node types
+
+We represent the parsed document as a tree of plists. Each node has at
+least a `:type` key. Block-level nodes carry a `:children` list of
+inline nodes. This keeps the data structure simple — no class hierarchy,
+no generic dispatch — while being easy to traverse for rendering.
+
+Node types:
+
+| Block-level      | Inline             |
+|------------------+--------------------|
+| `:heading`       | `:text`            |
+| `:paragraph`     | `:bold`            |
+| `:code-block`    | `:italic`          |
+| `:blockquote`    | `:inline-code`     |
+| `:list-item`     | `:link`            |
+| `:ordered-item`  |                    |
+| `:thematic-break`|                    |
+| `:diff-block`    |                    |
+
+--- per-function: markdown-node-make
+
+~make-md-node~ is a convenience constructor for node plists.
+It ensures `:children` defaults to NIL (not an empty list) so
+renderers can check `(if children ...)` without testing `(when
+children ...)` vs `(if (null children) ...)`.
+
+#+BEGIN_SRC lisp :tangle no
+(defun make-md-node (type &key children properties)
+  "Create a markdown node plist.
+TYPE is a keyword like :heading or :bold.
+CHILDREN is a list of inline node plists (or NIL).
+PROPERTIES is a plist of node-specific extra keys (e.g. :level for headings)."
+  (let ((node (list :type type)))
+    (when children
+      (setf (getf node :children) children))
+    (when properties
+      (setf (getf node :properties) properties))
+    node))
+#+END_SRC
+
+--- per-function: markdown-node-p
+
+~md-node-p~ checks whether something is a markdown node plist.
+We just look for a :type key. This is used in tests and as
+a guard in recursive renderers.
+
+#+BEGIN_SRC lisp :tangle no
+(defun md-node-p (thing)
+  "Return T if THING is a markdown node (has a :type key)."
+  (and (listp thing) (getf thing :type)))
+#+END_SRC
+
+--- per-function: markdown-node-text
+
+~md-node-text~ extracts the plain text from a node tree by
+concatenating all :text children recursively, discarding markup.
+This is useful for things like heading anchors, tooltip strings,
+or search indexing.
+
+#+BEGIN_SRC lisp :tangle no
+(defun md-node-text (node)
+  "Recursively extract plain text from a markdown node tree."
+  (let ((type (getf node :type)))
+    (cond ((eql type :text)
+           (or (getf node :content) ""))
+          ((eql type :link)
+           (concatenate 'string
+                        (md-node-text (first (getf node :children)))
+                        (format nil " (~a)" (or (getf node :url) ""))))
+          ((getf node :children)
+           (apply #'concatenate 'string
+                  (mapcar #'md-node-text (getf node :children))))
+          (t ""))))
+#+END_SRC
+
+** Block-level parser
+
+The block parser operates line-by-line with a simple state machine.
+Each line is classified by its prefix characters, then accumulated
+into a node.
+
+Rules:
+- Lines starting with `#` → heading (count hashes for level)
+- Lines starting with `>` → blockquote (continuation lines merge)
+- Lines starting with `-`, `*`, or `+` → list-item
+- Lines starting with 1-3 digits followed by `.` → ordered-item
+- Lines starting with `` ``` `` → code-block (language on opening line)
+- Lines starting with `---` or `***` → thematic-break
+- Lines starting with `--- ` or `+++ ` → diff-block
+- Empty lines → paragraph boundary
+- Everything else → paragraph (continuation lines merge until blank)
+
+--- per-function: classify-line
+
+~classify-line~ returns a keyword and a data value for a trimmed
+line of text. The state machine uses this to decide what kind of
+block to create or continue.
+
+The function must handle prefix stripping (e.g. remove `# ` after
+counting hashes) and edge cases like `#` inside a code block (which
+we don't classify at all — the code block state machine handles that).
+
+One trap: a line like `#not-a-heading` (no space after hash) is NOT
+a heading in CommonMark. We check for space/tab after the hashes.
+
+Another trap: `* item` in a list vs `**bold**` inline. At the
+block-parser level we only look at /line-start/ `* ` (star + space)
+for list items. A line starting with `** text` could be either a
+nested list item or bold text in a paragraph — we conservatively
+treat it as a list-item (the inline parser will handle ** inside
+paragraphs normally).
+
+#+BEGIN_SRC lisp :tangle no
+(defun classify-line (line)
+  "Classify a trimmed LINE, returning (type . data).
+TYPE is a keyword; DATA is language for code-blocks, level for headings, etc."
+  (cond
+    ;; Empty line
+    ((string= line "") (cons :blank nil))
+    ;; Thematic break: --- or *** (3+ chars, all same, optional whitespace)
+    ((and (>= (length line) 3)
+          (every (lambda (c) (or (char= c (char line 0))
+                                 (char= c #\Space)
+                                 (char= c #\Tab)))
+                 line)
+          (find (char line 0) "-*"))
+     (cons :thematic-break nil))
+    ;; Heading: #+, with space after hashes
+    ((and (char= (char line 0) #\#)
+          (let ((count 0))
+            (loop for c across line
+                  while (char= c #\#)
+                  do (incf count))
+            (and (<= 1 count 6)
+                 (or (>= (length line) (1+ count))
+                     (member (char line count) '(#\Space #\Tab))))))
+     (let* ((hash-count (loop for c across line while (char= c #\#) count c))
+            (content (string-trim (list #\Space #\Tab)
+                                  (subseq line hash-count))))
+       (cons :heading (cons hash-count content))))
+    ;; Blockquote: >
+    ((and (>= (length line) 1) (char= (char line 0) #\>))
+     (let ((content (string-trim (list #\Space #\Tab)
+                                 (subseq line 1))))
+       (cons :blockquote content)))
+    ;; Unordered list: -, *, +
+    ((and (>= (length line) 2)
+          (find (char line 0) "-*+")
+          (char= (char line 1) #\Space))
+     (cons :list-item (string-trim (list #\Space #\Tab) (subseq line 2))))
+    ;; Ordered list: N. or N)
+    ((and (>= (length line) 3)
+          (digit-char-p (char line 0))
+          (loop for c across line
+                while (digit-char-p c)
+                finally (return (find c '(#\. #\) #\Space)))))
+     (let ((dot-pos (position-if (lambda (c) (find c ". )")) line)))
+       (if (and dot-pos (find (char line dot-pos) ". )"))
+           (cons :ordered-item (string-trim (list #\Space #\Tab)
+                                            (subseq line (1+ dot-pos))))
+           (cons :paragraph line))))
+    ;; Diff: --- file or +++ file
+    ((and (>= (length line) 4)
+          (find (char line 0) "-+")
+          (char= (char line 1) (char line 0))
+          (char= (char line 2) (char line 0))
+          (char= (char line 3) #\Space))
+     (cons :diff-header line))
+    ;; Diff: line content with +/- prefix
+    ((and (>= (length line) 1)
+          (find (char line 0) "-+")
+          (not (and (>= (length line) 3)
+                    (char= (char line 1) (char line 0))
+                    (char= (char line 2) (char line 0)))))
+     (cons :diff-line (cons (char line 0) (subseq line 1))))
+    ;; Fenced code block start: ``` or ~~~
+    ((and (>= (length line) 3)
+          (find (char line 0) "`~")
+          (every (lambda (c) (char= c (char line 0)))
+                 (subseq line 0 (min 6 (length line))))
+          (let ((rest (string-trim (list #\Space #\Tab) (subseq line (min 6 (length line))))))
+            (cons :code-start rest))))
+    ;; Default: paragraph content
+    (t (cons :paragraph line))))
+#+END_SRC
+
+--- per-function: parse-blocks
+
+~parse-blocks~ is the main block-level parser. It takes a string
+(possibly multi-line) and returns a list of markdown node plists.
+
+The algorithm:
+1. Split into lines
+2. Classify each line
+3. Accumulate lines of the same type into groups
+4. Convert each group into a node
+
+State transitions:
+- `:paragraph` accumulates until blank line or different block type
+- `:blockquote` accumulates until blank line
+- `:list-item` and `:ordered-item` accumulate until blank line
+- `:code-start` flips to code-block mode; accumulates until matching
+  fence closer or end of input
+- `:diff-header` starts a diff block; diff lines accumulate until
+  blank line or non-diff line
+
+Edge case: a paragraph followed by a list item should stay as
+separate blocks (not merge). The blank-line check handles this
+because the paragraph only continues for non-blank, non-list lines.
+
+#+BEGIN_SRC lisp :tangle no
+(defun parse-blocks (text)
+  "Parse TEXT (a string) into a list of block-level markdown node plists.
+Returns (nodes . unconsumed-lines) for recursive callers."
+  (let ((lines (split-string-into-lines text))
+        (nodes nil)
+        (i 0))
+    (loop while (< i (length lines))
+          do (let* ((line (string-trim (list #\return) (aref lines i)))
+                    (classification (classify-line line)))
+               (case (car classification)
+                 (:blank (incf i))
+                 (:thematic-break
+                  (push (make-md-node :thematic-break) nodes)
+                  (incf i))
+                 (:paragraph
+                  (multiple-value-bind (node consumed)
+                      (parse-paragraph lines i)
+                    (push node nodes)
+                    (setf i consumed)))
+                 (:heading
+                  (let* ((level-and-content (cdr classification))
+                         (level (car level-and-content))
+                         (content (cdr level-and-content)))
+                    (push (make-md-node :heading
+                                        :properties (list :level level)
+                                        :children (parse-inline content))
+                          nodes)
+                    (incf i)))
+                 (:blockquote
+                  (multiple-value-bind (node consumed)
+                      (parse-blockquote lines i)
+                    (push node nodes)
+                    (setf i consumed)))
+                 (:list-item
+                  (multiple-value-bind (node consumed)
+                      (parse-list lines i :unordered)
+                    (push node nodes)
+                    (setf i consumed)))
+                 (:ordered-item
+                  (multiple-value-bind (node consumed)
+                      (parse-list lines i :ordered)
+                    (push node nodes)
+                    (setf i consumed)))
+                 (:code-start
+                  (multiple-value-bind (node consumed)
+                      (parse-code-block lines i (cdr classification))
+                    (push node nodes)
+                    (setf i consumed)))
+                 (:diff-header
+                  (multiple-value-bind (node consumed)
+                      (parse-diff-block lines i)
+                    (push node nodes)
+                    (setf i consumed)))
+                 (t (incf i)))))
+    ;; Return in reading order
+    (nreverse nodes)))
+#+END_SRC
+
+--- per-function: split-string-into-lines
+
+~split-string-into-lines~ is a utility rather than relying on
+~cl-ppcre~ (which we don't depend on). It splits on #\Newline
+and handles the edge case of trailing newlines (doesn't produce
+an extra empty line at the end).
+
+#+BEGIN_SRC lisp :tangle no
+(defun split-string-into-lines (string)
+  "Split STRING into a vector of lines (no trailing newline).
+Handles \\n, \\r\\n, and trailing newlines properly."
+  (let ((result nil)
+        (start 0))
+    (flet ((add-line (end)
+             (push (subseq string start end) result)))
+      (loop for i from 0 below (length string)
+            do (let ((c (char string i)))
+                 (cond ((char= c #\Newline)
+                        (add-line i)
+                        (setf start (1+ i)))
+                       ((and (char= c #\Return)
+                             (< (1+ i) (length string))
+                             (char= (char string (1+ i)) #\Newline))
+                        (add-line i)
+                        (setf start (+ i 2))
+                        (incf i)))))
+      (when (< start (length string))
+        (add-line (length string)))
+      (coerce (nreverse result) 'vector))))
+#+END_SRC
+
+--- per-function: parse-paragraph
+
+~parse-paragraph~ collects one or more contiguous paragraph lines
+until a blank line or a different block type. It joins them with
+spaces (for hard-wrapped prose) and returns a :paragraph node
+with inline-parsed children.
+
+Continuation lines in paragraphs are joined with a single space
+(not a newline). This is correct for Markdown's soft-wrap
+convention where a newline in source = space in output. To force
+a hard break, CommonMark uses two trailing spaces — we skip that
+for now since it's rare in TUI contexts.
+
+#+BEGIN_SRC lisp :tangle no
+(defun parse-paragraph (lines start)
+  "Parse contiguous paragraph lines from LINES starting at START.
+Returns (node . consumed-index)."
+  (let ((text-parts nil)
+        (i start))
+    (loop while (< i (length lines))
+          do (let* ((raw-line (aref lines i))
+                    (line (string-trim (list #\return) raw-line))
+                    (class (classify-line line)))
+               (case (car class)
+                 ((:paragraph)
+                  (push (cdr class) text-parts)
+                  (incf i))
+                 (:blank (incf i) (loop-finish))
+                 (t (loop-finish)))))
+    (let ((text (with-output-to-string (s)
+                  (loop for part in (nreverse text-parts)
+                        for first = t then nil
+                        do (unless first (write-char #\Space s))
+                        (princ part s)))))
+      (cons (make-md-node :paragraph
+                          :children (parse-inline text))
+            i))))
+#+END_SRC
+
+--- per-function: parse-blockquote
+
+~parse-blockquote~ collects contiguous `>` lines, strips the `>`
+prefix, joins them, and wraps in a :blockquote node. Nested
+blockquotes (`> >`) are not supported in this version — a `>` at
+the start of the content is treated as literal text.
+
+#+BEGIN_SRC lisp :tangle no
+(defun parse-blockquote (lines start)
+  "Parse contiguous blockquote lines from LINES starting at START.
+Returns (node . consumed-index)."
+  (let ((text-parts nil)
+        (i start))
+    (loop while (< i (length lines))
+          do (let* ((raw-line (aref lines i))
+                    (line (string-trim (list #\return) raw-line))
+                    (class (classify-line line)))
+               (case (car class)
+                 (:blockquote
+                  (push (cdr class) text-parts)
+                  (incf i))
+                 (:blank (incf i) (loop-finish))
+                 (t (loop-finish)))))
+    (let ((text (with-output-to-string (s)
+                  (loop for part in (nreverse text-parts)
+                        for first = t then nil
+                        do (unless first (write-char #\Space s))
+                        (princ part s)))))
+      (cons (make-md-node :blockquote
+                          :children (parse-inline text))
+            i))))
+#+END_SRC
+
+--- per-function: parse-list
+
+~parse-list~ collects contiguous list items (same type) and returns
+a list of nodes. Each line starting with a list marker becomes one
+list-item node. Nested lists are not supported (lines starting with
+two spaces + marker would be the next level — we skip that for v1).
+
+The TYPE parameter is either `:unordered` or `:ordered` — though
+we return each item labeled by its actual marker type since we
+already classified each line.
+
+#+BEGIN_SRC lisp :tangle no
+(defun parse-list (lines start type)
+  "Parse contiguous list items from LINES starting at START.
+TYPE is :unordered or :ordered.
+Returns (node . consumed-index) where node is a :list-item or :ordered-item."
+  (declare (ignore type))
+  (let ((items nil)
+        (i start))
+    ;; Collect all contiguous list items into ITEMS
+    (loop while (< i (length lines))
+          do (let* ((raw-line (aref lines i))
+                    (line (string-trim (list #\return) raw-line))
+                    (class (classify-line line)))
+               (case (car class)
+                 ((:list-item :ordered-item)
+                  (push (cons (car class) (cdr class)) items)
+                  (incf i))
+                 (:blank
+                  ;; One blank line between items is OK; two ends the list
+                  (if (and (< (1+ i) (length lines))
+                           (let ((next-class (classify-line
+                                              (string-trim
+                                               (list #\return)
+                                               (aref lines (1+ i))))))
+                             (member (car next-class)
+                                     '(:list-item :ordered-item))))
+                      (progn
+                        (push (cons :blank-sep nil) items)
+                        (incf i))
+                      (progn (incf i) (loop-finish))))
+                 (t (loop-finish)))))
+    ;; Convert each item to a node
+    (let ((nodes nil))
+      (dolist (item (nreverse items))
+        (let ((type (car item))
+              (content (cdr item)))
+          (when (and content (not (string= content "")))
+            (push (make-md-node type
+                                :children (parse-inline content))
+                  nodes))))
+      (cons (nreverse nodes) i))))
+#+END_SRC
+
+--- per-function: parse-code-block
+
+~parse-code-block~ reads from the line after the opening fence to
+the closing fence (or end of input). It returns a :code-block node
+with the language (or NIL) and the raw text as the :content. No
+inline parsing is done inside code blocks — everything is literal.
+
+Matching fence: if opened with `` ``` ``, close with `` ``` ``.
+If opened with `~~~`, close with `~~~`. The closing fence must have
+at least as many backticks/tildes as the opening fence (CommonMark
+rule). We use the simpler version: same character, same count.
+
+#+BEGIN_SRC lisp :tangle no
+(defun parse-code-block (lines start lang)
+  "Parse a fenced code block from LINES starting at START.
+LANG is the language string (or empty string) from the opening fence.
+Returns (node . consumed-index)."
+  (let ((code-lines nil)
+        (i (1+ start))
+        (fence-char (char (aref lines start) 0))
+        (fence-len (loop for c across (aref lines start)
+                         while (char= c (char (aref lines start) 0))
+                         count c))
+        (found-close nil))
+    (loop while (< i (length lines))
+          do (let* ((raw-line (aref lines i))
+                    (line (string-trim (list #\return) raw-line)))
+               ;; Check for closing fence
+               (when (and (>= (length line) fence-len)
+                          (every (lambda (c) (char= c fence-char))
+                                 (subseq line 0 fence-len))
+                          (or (= (length line) fence-len)
+                              (every (lambda (c) (find c " \t"))
+                                     (subseq line fence-len))))
+                 (setf found-close t)
+                 (incf i)
+                 (loop-finish))