diff --git a/docs/examples/text_decoration.md b/docs/examples/text_decoration.md new file mode 100644 index 0000000..02f0dd6 --- /dev/null +++ b/docs/examples/text_decoration.md @@ -0,0 +1,288 @@ +# Adding New Text Decorators + +A step-by-step example of adding a new text decoration style via **inline rules**. + +## Goal + +Text surrounded by double-carets (e.g. `^^like this^^`) will be given the `` tag in the output HTML. + +## Inline Rules + +Markdown-It processes inline sequences of text in **two** passes, each with their own list of rules: + +- Tokenization +- Post Processing + +The Tokenization phase is responsible for **identifying** inline markers, like `**` (bold/strong text) or `^^` (our new "small text" delimiter). +It is unaware of marker nesting, or whether markers form matched pairs. + +The Post Processing phase handles **matching** pairs of tokens. +This phase holds a lot of hidden complexity. +Base Markdown supports a single asterisk for italics/emphasis, double asterisk for bold/strong text, and triple asterisk for both styles combined. +Even if a new plugin isn't implementing such a nuanced delimiter, an awareness of the complexity helps the developer inject code in the proper locations. + +> [!IMPORTANT] +> Every matched-pair inline marker should provide **both** a tokenization and post-processing rule. + +## Entry Point + +The new rule will be named `smalltext`. +The plugin entry point will look like the following: + +```typescript +export default function smalltext_plugin(md: MarkdownIt) { + md.inline.ruler.after("emphasis", "smalltext", smalltext_tokenize) + md.inline.ruler2.after("emphasis", "smalltext", smalltext_postProcess) +} + +function smalltext_tokenize(state: StateInline, silent: boolean) { + return false +} + +function smalltext_postProcess(state: StateInline) { + return false +} +``` + +Note the use of `ruler2` to register the post-processing step. +This pattern is unique to matched-pair inline marker rules: +it isn't seen anywhere else in the library (e.g. for block or core rules). + +## Tokenization + +All that needs to happen here is identifying the string `^^`, +adding a Token to `state.tokens`, +and adding a Delimiter to `state.delimiters`. + +> [!TIP] +> A `delimiter` points to a token and provides extra information: +> +> - whether that token is a valid choice for opening or closing styled text +> - a pointer to the matching end token +> - information about how many characters the token is (useful for disambiguating italics and bold) +> +> Most of this information is used in the `balance_pairs` post-processing rule. +> So long as the `delimiters` array is constructed well in the tokenization phase, +> the developer doesn't need to worry about the complexity within `balance_pairs`. + +```typescript +function smalltext_tokenize(state: StateInline, silent: boolean) { + const start = state.pos + const marker = state.src.charCodeAt(start) + + if (silent) { + return false + } + + if (marker !== 0x5e /* ^ */) { + return false + } + + const scanned = state.scanDelims(state.pos, true) + let len = scanned.length + const ch = String.fromCharCode(marker) + + if (len < 2) { + return false + } + + let token + + if (len % 2) { + token = state.push("text", "", 0) + token.content = ch + len-- + } + + for (let i = 0; i < len; i += 2) { + token = state.push("text", "", 0) + token.content = ch + ch + + state.delimiters.push({ + marker, + length: 0, // disable "rule of 3" length checks meant for emphasis + token: state.tokens.length - 1, + end: -1, // This pointer is filled in by the core balance_pairs post-processing rule + open: scanned.can_open, + close: scanned.can_close, + jump: 0 + }) + } + + state.pos += scanned.length + + return true +} +``` + +Note the `scanDelims` call. +It handles determining whether a given sequence of characters (`^` in this case) can start or end an inline styling sequence. + +A single caret will have no meaning in this plugin, +so much of the complexity in this rule is removed: + +- For an odd-numbered length of carets, the first caret is added as plain text +- The `length` property of the delimiters is always set to zero, skipping unnecessary logic in the `balance_pairs` rule + +Note also that **no matching was attempted in the tokenization phase**. +The `end` property is always set to `-1`. +The `balance_pairs` rule does all the heavy lifting later on, behind the scenes. + +## Post Processing + +### Grunt Work + +The main logic of this rule will go into a utility function, called `postProcess`. +The top-level rule function gets a confusing bit of grunt work: + +```typescript +function smalltext_postProcess(state: StateInline) { + const tokens_meta = state.tokens_meta + const max = state.tokens_meta.length + + postProcess(state, state.delimiters) + + for (let curr = 0; curr < max; curr++) { + if (tokens_meta[curr]?.delimiters) { + postProcess(state, tokens_meta[curr]?.delimiters || []) + } + } + + // post-process return value is unused + return false +} + +function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) { + return +} +``` + +> [!TIP] +> What is `tokens_meta`? +> +> Every time a token with a positive `nesting` value is pushed to the inline state's tokens (i.e. an opening tag), +> the inline state does the following: +> +> - throws the current `delimiters` array onto a stack +> - creates a new, empty `delimiters` array, exposing it as `state.delimiters` +> - gives the open-tag token a `token_meta` object with the new `delimiters` array +> - *also* stores the `token_meta` object in `state.tokens_meta` +> +> The intrepid reader will notice that in the tokenization rule, **the created delimiters were likely being pushed to different arrays** throughout execution. +> +> Now, in post-processing, each `delimiters` array will only hold delimiters at matching nesting levels. +> +> If the details of this implementation are of interest, check out [the source](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/state_inline.mjs#L60). + +### Main Logic + +As previously mentioned, `balance_pairs` took care of building out and cleaning up the delimiter data. +This post-processing rule will mainly read the data and add tokens as appropriate: + +```typescript +function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) { + let token + const loneMarkers = [] + const max = delimiters.length + + for (let i = 0; i < max; i++) { + const startDelim = delimiters[i] + + if (startDelim.marker !== 0x5e /* ^ */) { + continue + } + + // balance_pairs wrote the appropriate `end` pointer value here. + // If it's still -1, there was a balancing problem, + // and the delimiter can be ignored. + if (startDelim.end === -1) { + continue + } + + const endDelim = delimiters[startDelim.end] + + token = state.tokens[startDelim.token] + token.type = "smalltext_open" + token.tag = "small" + token.nesting = 1 + token.markup = "^^" + token.content = "" + + token = state.tokens[endDelim.token] + token.type = "smalltext_close" + token.tag = "small" + token.nesting = -1 + token.markup = "^^" + token.content = "" + + if ( + state.tokens[endDelim.token - 1].type === "text" && + state.tokens[endDelim.token - 1].content === "^" + ) { + loneMarkers.push(endDelim.token - 1) + } + } + + // If a marker sequence has an odd number of characters, it is split + // like this: `^^^^^` -> `^` + `^^` + `^^`, leaving one marker at the + // start of the sequence. + // + // So, we have to move all those markers after subsequent closing tags. + // + while (loneMarkers.length) { + const i = loneMarkers.pop() || 0 + let j = i + 1 + + while (j < state.tokens.length && state.tokens[j].type === "smalltext_close") { + j++ + } + + j-- + + if (i !== j) { + token = state.tokens[j] + state.tokens[j] = state.tokens[i] + state.tokens[i] = token + } + } +} +``` + +The lone-marker handling is a point of interest. +While a five- or seven-character sequence of carets is unlikely, +it could still be matched with a different string of carets elsewhere in the line of text. +Due to how tokenization runs, +both the opening **and** closing sequences are split leaving the lone caret at the start: + +``` +^^^^^^^hey this text would actually be small^^^^^^^ + +gets parsed somewhat like this: + +^ ^^ ^^ ^^ hey this text would actually be small ^ ^^ ^^ ^^ +| | | | | | +| | opening tag | | open and close +| open and close | balanced closing tag +lone caret lone caret +``` + +Because the very first caret in the opening sequence is *not* placed within the `` tags, +neither should the first caret in the closing sequence. +The end of the post-processing rule handles that edge case. + +## Conclusion + +That's everything! + +This rule is almost a verbatim copy of the [strikethrough rule](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/strikethrough.mjs) in the core library. +If a full-on emphasis-style rule is desired, the [source code](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/emphasis.mjs) isn't much longer, +thanks in large part to the heavy lifting that [balance_pairs](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/balance_pairs.mjs) accomplishes. + +> [!CAUTION] +> +> If the plugin being developed is a "standalone" inline element without a open/close pair +> (think about links `[text](url)` or images `![alt text](source "title")`), +> **the post-processing infrastructure can be safely ignored**! +> Markdown parsing is complicated enough. +> Please don't introduce any unnecessary complexity! +