diff --git a/docs/examples/document_post_processing.md b/docs/examples/document_post_processing.md new file mode 100644 index 0000000..d6fd0e0 --- /dev/null +++ b/docs/examples/document_post_processing.md @@ -0,0 +1,130 @@ +# Document-Wide Post Processing + +An overview of how to tweak and augment the token stream just before rendering. + +## Goal + +The output document will be surrounded by `
` tags. Second-level headings (`h2`) will also trigger section breaks (i.e. `
`) immediately preceding the heading. + +## Core Rules + +The top-level rule pipeline turning raw Markdown into a token array consists of **core rules**. +The *block* and *inline* rule pipelines are run within a single "wrapper" rule in the core pipeline. +The wrapper rules appear relatively early in the [core pipeline](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/parser_core.mjs#L19). + +```javascript +const _rules = [ + ['normalize', r_normalize], + ['block', r_block], + ['inline', r_inline], + ['linkify', r_linkify], + ['replacements', r_replacements], + ['smartquotes', r_smartquotes], + ['text_join', r_text_join] +] +``` + +Core rules typically do *not* scan through the source text or interpret Markdown syntax. +Rather, they usually modify or augment the token stream after an initial pass over the Markdown is complete. + +> [!NOTE] +> The `normalize` rule is an exception. +> It modifies the raw markdown (`state.src`), +> *normalizing* (as the name implies) idiosyncrasies like platform-specific newlines and null characters. + +Core rules can do much more, +but "post-processing" tasks are the most common use case. + +## Entry Point + +The new rule will be called `sectionize`. +The plugin entry point will look like the following: + +```typescript +export default function sectionize_plugin(md: MarkdownIt) { + md.core.ruler.push("sectionize", sectionize) +} + +function sectionize(state: StateCore) { + return +} +``` + +The new rule is pushed to the very end of the core pipeline. +While there are valid reasons to insert plugin rules elsewhere in the pipeline, +pushing to the end is a good default choice. + +> [!IMPORTANT] +> When in doubt, always put plugin rules at the end of the pipeline. +> This strategy minimizes the potential of breaking other rules' assumptions about state. + +In this case specifically, surrounding the document with `
` tags will **increase the nesting level** of every other token in the document. +Certain rules might iterate over the token stream and keep a running-total nesting level, +making assumptions about nesting level zero (for example). +Placing the new rule at the very end keeps it from affecting those other rules. + +## Section Insertion Logic + +Because we will be inserting tokens into the token array, +we will iterate *backwards* over the existing array so that our index pointer isn't affected by the insertions. + +```typescript +function sectionize(state: StateCore) { + const slugs: Record = {} + const toProcess: Array<{ slug: string; anchor: Token; target: Token }> = [] + + // Iterate backwards since we're splicing elements into the array + for (let i = state.tokens.length - 1; i >= 0; i--) { + const token = state.tokens[i] + + if (token.type === "heading_open" && token.tag === "h2") { + const { open, close } = getSectionPair(state) + state.tokens.splice(i, 0, close, open) + } + } + + // ...The plugin isn't quite done yet +} + +function getSectionPair(state: StateCore) { + const open = new state.Token("section_open", "section", 1) + open.block = true + const close = new state.Token("section_close", "section", -1) + close.block = true + + return { open, close } +} +``` + +At this point, the tokens array now has a `
` pair immediately preceding each `

`. +However, the document itself is not yet wrapped in an overarching section. + +There are two cases to consider: + +- The document originally started with a `h2`, so it now starts with `

` +- The document did not start with a `h2` + +Both cases are addressed with just a few lines of code: + +```typescript +function sectionize(state: StateCore) { + // ...iteration logic from above + + if (state.tokens[0].type === "section_close") { + state.tokens.push(state.tokens.shift()!) + } else { + const { open, close } = getSectionPair(state) + state.tokens.unshift(open) + state.tokens.push(close) + } +} +``` + +## Conclusion + +That's right: simple augmentation tasks like sectionization are straightforward to implement with core rule plugins. +No traversal of `state.src` is required, +because this rule is running *after* all of the block and inline rule sets. + +With a careful selection of rule positioning (defaulting to the end of the pipeline when in doubt), +post-processing rules are some of the simplest to write. diff --git a/docs/examples/text_decoration.md b/docs/examples/text_decoration.md new file mode 100644 index 0000000..02f0dd6 --- /dev/null +++ b/docs/examples/text_decoration.md @@ -0,0 +1,288 @@ +# Adding New Text Decorators + +A step-by-step example of adding a new text decoration style via **inline rules**. + +## Goal + +Text surrounded by double-carets (e.g. `^^like this^^`) will be given the `` tag in the output HTML. + +## Inline Rules + +Markdown-It processes inline sequences of text in **two** passes, each with their own list of rules: + +- Tokenization +- Post Processing + +The Tokenization phase is responsible for **identifying** inline markers, like `**` (bold/strong text) or `^^` (our new "small text" delimiter). +It is unaware of marker nesting, or whether markers form matched pairs. + +The Post Processing phase handles **matching** pairs of tokens. +This phase holds a lot of hidden complexity. +Base Markdown supports a single asterisk for italics/emphasis, double asterisk for bold/strong text, and triple asterisk for both styles combined. +Even if a new plugin isn't implementing such a nuanced delimiter, an awareness of the complexity helps the developer inject code in the proper locations. + +> [!IMPORTANT] +> Every matched-pair inline marker should provide **both** a tokenization and post-processing rule. + +## Entry Point + +The new rule will be named `smalltext`. +The plugin entry point will look like the following: + +```typescript +export default function smalltext_plugin(md: MarkdownIt) { + md.inline.ruler.after("emphasis", "smalltext", smalltext_tokenize) + md.inline.ruler2.after("emphasis", "smalltext", smalltext_postProcess) +} + +function smalltext_tokenize(state: StateInline, silent: boolean) { + return false +} + +function smalltext_postProcess(state: StateInline) { + return false +} +``` + +Note the use of `ruler2` to register the post-processing step. +This pattern is unique to matched-pair inline marker rules: +it isn't seen anywhere else in the library (e.g. for block or core rules). + +## Tokenization + +All that needs to happen here is identifying the string `^^`, +adding a Token to `state.tokens`, +and adding a Delimiter to `state.delimiters`. + +> [!TIP] +> A `delimiter` points to a token and provides extra information: +> +> - whether that token is a valid choice for opening or closing styled text +> - a pointer to the matching end token +> - information about how many characters the token is (useful for disambiguating italics and bold) +> +> Most of this information is used in the `balance_pairs` post-processing rule. +> So long as the `delimiters` array is constructed well in the tokenization phase, +> the developer doesn't need to worry about the complexity within `balance_pairs`. + +```typescript +function smalltext_tokenize(state: StateInline, silent: boolean) { + const start = state.pos + const marker = state.src.charCodeAt(start) + + if (silent) { + return false + } + + if (marker !== 0x5e /* ^ */) { + return false + } + + const scanned = state.scanDelims(state.pos, true) + let len = scanned.length + const ch = String.fromCharCode(marker) + + if (len < 2) { + return false + } + + let token + + if (len % 2) { + token = state.push("text", "", 0) + token.content = ch + len-- + } + + for (let i = 0; i < len; i += 2) { + token = state.push("text", "", 0) + token.content = ch + ch + + state.delimiters.push({ + marker, + length: 0, // disable "rule of 3" length checks meant for emphasis + token: state.tokens.length - 1, + end: -1, // This pointer is filled in by the core balance_pairs post-processing rule + open: scanned.can_open, + close: scanned.can_close, + jump: 0 + }) + } + + state.pos += scanned.length + + return true +} +``` + +Note the `scanDelims` call. +It handles determining whether a given sequence of characters (`^` in this case) can start or end an inline styling sequence. + +A single caret will have no meaning in this plugin, +so much of the complexity in this rule is removed: + +- For an odd-numbered length of carets, the first caret is added as plain text +- The `length` property of the delimiters is always set to zero, skipping unnecessary logic in the `balance_pairs` rule + +Note also that **no matching was attempted in the tokenization phase**. +The `end` property is always set to `-1`. +The `balance_pairs` rule does all the heavy lifting later on, behind the scenes. + +## Post Processing + +### Grunt Work + +The main logic of this rule will go into a utility function, called `postProcess`. +The top-level rule function gets a confusing bit of grunt work: + +```typescript +function smalltext_postProcess(state: StateInline) { + const tokens_meta = state.tokens_meta + const max = state.tokens_meta.length + + postProcess(state, state.delimiters) + + for (let curr = 0; curr < max; curr++) { + if (tokens_meta[curr]?.delimiters) { + postProcess(state, tokens_meta[curr]?.delimiters || []) + } + } + + // post-process return value is unused + return false +} + +function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) { + return +} +``` + +> [!TIP] +> What is `tokens_meta`? +> +> Every time a token with a positive `nesting` value is pushed to the inline state's tokens (i.e. an opening tag), +> the inline state does the following: +> +> - throws the current `delimiters` array onto a stack +> - creates a new, empty `delimiters` array, exposing it as `state.delimiters` +> - gives the open-tag token a `token_meta` object with the new `delimiters` array +> - *also* stores the `token_meta` object in `state.tokens_meta` +> +> The intrepid reader will notice that in the tokenization rule, **the created delimiters were likely being pushed to different arrays** throughout execution. +> +> Now, in post-processing, each `delimiters` array will only hold delimiters at matching nesting levels. +> +> If the details of this implementation are of interest, check out [the source](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/state_inline.mjs#L60). + +### Main Logic + +As previously mentioned, `balance_pairs` took care of building out and cleaning up the delimiter data. +This post-processing rule will mainly read the data and add tokens as appropriate: + +```typescript +function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) { + let token + const loneMarkers = [] + const max = delimiters.length + + for (let i = 0; i < max; i++) { + const startDelim = delimiters[i] + + if (startDelim.marker !== 0x5e /* ^ */) { + continue + } + + // balance_pairs wrote the appropriate `end` pointer value here. + // If it's still -1, there was a balancing problem, + // and the delimiter can be ignored. + if (startDelim.end === -1) { + continue + } + + const endDelim = delimiters[startDelim.end] + + token = state.tokens[startDelim.token] + token.type = "smalltext_open" + token.tag = "small" + token.nesting = 1 + token.markup = "^^" + token.content = "" + + token = state.tokens[endDelim.token] + token.type = "smalltext_close" + token.tag = "small" + token.nesting = -1 + token.markup = "^^" + token.content = "" + + if ( + state.tokens[endDelim.token - 1].type === "text" && + state.tokens[endDelim.token - 1].content === "^" + ) { + loneMarkers.push(endDelim.token - 1) + } + } + + // If a marker sequence has an odd number of characters, it is split + // like this: `^^^^^` -> `^` + `^^` + `^^`, leaving one marker at the + // start of the sequence. + // + // So, we have to move all those markers after subsequent closing tags. + // + while (loneMarkers.length) { + const i = loneMarkers.pop() || 0 + let j = i + 1 + + while (j < state.tokens.length && state.tokens[j].type === "smalltext_close") { + j++ + } + + j-- + + if (i !== j) { + token = state.tokens[j] + state.tokens[j] = state.tokens[i] + state.tokens[i] = token + } + } +} +``` + +The lone-marker handling is a point of interest. +While a five- or seven-character sequence of carets is unlikely, +it could still be matched with a different string of carets elsewhere in the line of text. +Due to how tokenization runs, +both the opening **and** closing sequences are split leaving the lone caret at the start: + +``` +^^^^^^^hey this text would actually be small^^^^^^^ + +gets parsed somewhat like this: + +^ ^^ ^^ ^^ hey this text would actually be small ^ ^^ ^^ ^^ +| | | | | | +| | opening tag | | open and close +| open and close | balanced closing tag +lone caret lone caret +``` + +Because the very first caret in the opening sequence is *not* placed within the `` tags, +neither should the first caret in the closing sequence. +The end of the post-processing rule handles that edge case. + +## Conclusion + +That's everything! + +This rule is almost a verbatim copy of the [strikethrough rule](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/strikethrough.mjs) in the core library. +If a full-on emphasis-style rule is desired, the [source code](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/emphasis.mjs) isn't much longer, +thanks in large part to the heavy lifting that [balance_pairs](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/balance_pairs.mjs) accomplishes. + +> [!CAUTION] +> +> If the plugin being developed is a "standalone" inline element without a open/close pair +> (think about links `[text](url)` or images `![alt text](source "title")`), +> **the post-processing infrastructure can be safely ignored**! +> Markdown parsing is complicated enough. +> Please don't introduce any unnecessary complexity! +