Merge 6eb127d6db into 0fe7ccb4b7

4 months ago · 2d2f085b53
2 changed files with 418 additions and 0 deletions
--- a/docs/examples/document_post_processing.md
+++ b/docs/examples/document_post_processing.md
@ -0,0 +1,130 @@
 # Document-Wide Post Processing
 An overview of how to tweak and augment the token stream just before rendering.
 ## Goal
 The output document will be surrounded by `<section>` tags. Second-level headings (`h2`) will also trigger section breaks (i.e. `</section><section>`) immediately preceding the heading.
 ## Core Rules
 The top-level rule pipeline turning raw Markdown into a token array consists of **core rules**.
 The *block* and *inline* rule pipelines are run within a single "wrapper" rule in the core pipeline.
 The wrapper rules appear relatively early in the [core pipeline](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/parser_core.mjs#L19).
 ```javascript
 const _rules = [
  ['normalize',      r_normalize],
  ['block',          r_block],
  ['inline',         r_inline],
  ['linkify',        r_linkify],
  ['replacements',   r_replacements],
  ['smartquotes',    r_smartquotes],
  ['text_join',      r_text_join]
 ]
 ```
 Core rules typically do *not* scan through the source text or interpret Markdown syntax.
 Rather, they usually modify or augment the token stream after an initial pass over the Markdown is complete.
 > [!NOTE]
 > The `normalize` rule is an exception.
 > It modifies the raw markdown (`state.src`),
 > *normalizing* (as the name implies) idiosyncrasies like platform-specific newlines and null characters.
 Core rules can do much more,
 but "post-processing" tasks are the most common use case.
 ## Entry Point
 The new rule will be called `sectionize`.
 The plugin entry point will look like the following:
 ```typescript
 export default function sectionize_plugin(md: MarkdownIt) {
  md.core.ruler.push("sectionize", sectionize)
 }
 function sectionize(state: StateCore) {
  return
 }
 ```
 The new rule is pushed to the very end of the core pipeline.
 While there are valid reasons to insert plugin rules elsewhere in the pipeline,
 pushing to the end is a good default choice.
 > [!IMPORTANT]
 > When in doubt, always put plugin rules at the end of the pipeline.
 > This strategy minimizes the potential of breaking other rules' assumptions about state.
 In this case specifically, surrounding the document with `<section>` tags will **increase the nesting level** of every other token in the document.
 Certain rules might iterate over the token stream and keep a running-total nesting level,
 making assumptions about nesting level zero (for example).
 Placing the new rule at the very end keeps it from affecting those other rules.
 ## Section Insertion Logic
 Because we will be inserting tokens into the token array,
 we will iterate *backwards* over the existing array so that our index pointer isn't affected by the insertions.
 ```typescript
 function sectionize(state: StateCore) {
  const slugs: Record<string, boolean> = {}
  const toProcess: Array<{ slug: string; anchor: Token; target: Token }> = []
  // Iterate backwards since we're splicing elements into the array
  for (let i = state.tokens.length - 1; i >= 0; i--) {
    const token = state.tokens[i]
    if (token.type === "heading_open" && token.tag === "h2") {
      const { open, close } = getSectionPair(state)
      state.tokens.splice(i, 0, close, open)
    }
  }
  // ...The plugin isn't quite done yet
 }
 function getSectionPair(state: StateCore) {
  const open = new state.Token("section_open", "section", 1)
  open.block = true
  const close = new state.Token("section_close", "section", -1)
  close.block = true
  return { open, close }
 }
 ```
 At this point, the tokens array now has a `</section><section>` pair immediately preceding each `<h2>`.
 However, the document itself is not yet wrapped in an overarching section.
 There are two cases to consider:
 - The document originally started with a `h2`, so it now starts with `</section>`
 - The document did not start with a `h2`
 Both cases are addressed with just a few lines of code:
 ```typescript
 function sectionize(state: StateCore) {
  // ...iteration logic from above
  if (state.tokens[0].type === "section_close") {
    state.tokens.push(state.tokens.shift()!)
  } else {
    const { open, close } = getSectionPair(state)
    state.tokens.unshift(open)
    state.tokens.push(close)
  }
 }
 ```
 ## Conclusion
 That's right: simple augmentation tasks like sectionization are straightforward to implement with core rule plugins.
 No traversal of `state.src` is required,
 because this rule is running *after* all of the block and inline rule sets.
 With a careful selection of rule positioning (defaulting to the end of the pipeline when in doubt),
 post-processing rules are some of the simplest to write.
--- a/docs/examples/text_decoration.md
+++ b/docs/examples/text_decoration.md
@ -0,0 +1,288 @@
 # Adding New Text Decorators
 A step-by-step example of adding a new text decoration style via **inline rules**.
 ## Goal
 Text surrounded by double-carets (e.g. `^^like this^^`) will be given the `<small>` tag in the output HTML.
 ## Inline Rules
 Markdown-It processes inline sequences of text in **two** passes, each with their own list of rules:
 - Tokenization
 - Post Processing
 The Tokenization phase is responsible for **identifying** inline markers, like `**` (bold/strong text) or `^^` (our new "small text" delimiter).
 It is unaware of marker nesting, or whether markers form matched pairs.
 The Post Processing phase handles **matching** pairs of tokens.
 This phase holds a lot of hidden complexity.
 Base Markdown supports a single asterisk for italics/emphasis, double asterisk for bold/strong text, and triple asterisk for both styles combined.
 Even if a new plugin isn't implementing such a nuanced delimiter, an awareness of the complexity helps the developer inject code in the proper locations.
 > [!IMPORTANT]
 > Every matched-pair inline marker should provide **both** a tokenization and post-processing rule.
 ## Entry Point
 The new rule will be named `smalltext`.
 The plugin entry point will look like the following:
 ```typescript
 export default function smalltext_plugin(md: MarkdownIt) {
  md.inline.ruler.after("emphasis", "smalltext", smalltext_tokenize)
  md.inline.ruler2.after("emphasis", "smalltext", smalltext_postProcess)
 }
 function smalltext_tokenize(state: StateInline, silent: boolean) {
  return false
 }
 function smalltext_postProcess(state: StateInline) {
  return false
 }
 ```
 Note the use of `ruler2` to register the post-processing step.
 This pattern is unique to matched-pair inline marker rules:
 it isn't seen anywhere else in the library (e.g. for block or core rules).
 ## Tokenization
 All that needs to happen here is identifying the string `^^`,
 adding a Token to `state.tokens`,
 and adding a Delimiter to `state.delimiters`.
 > [!TIP]
 > A `delimiter` points to a token and provides extra information:
 >
 > - whether that token is a valid choice for opening or closing styled text
 > - a pointer to the matching end token
 > - information about how many characters the token is (useful for disambiguating italics and bold)
 >
 > Most of this information is used in the `balance_pairs` post-processing rule.
 > So long as the `delimiters` array is constructed well in the tokenization phase,
 > the developer doesn't need to worry about the complexity within `balance_pairs`.
 ```typescript
 function smalltext_tokenize(state: StateInline, silent: boolean) {
  const start = state.pos
  const marker = state.src.charCodeAt(start)
  if (silent) {
    return false
  }
  if (marker !== 0x5e /* ^ */) {
    return false
  }
  const scanned = state.scanDelims(state.pos, true)
  let len = scanned.length
  const ch = String.fromCharCode(marker)
  if (len < 2) {
    return false
  }
  let token
  if (len % 2) {
    token = state.push("text", "", 0)
    token.content = ch
    len--
  }
  for (let i = 0; i < len; i += 2) {
    token = state.push("text", "", 0)
    token.content = ch + ch
    state.delimiters.push({
      marker,
      length: 0, // disable "rule of 3" length checks meant for emphasis
      token: state.tokens.length - 1,
      end: -1, // This pointer is filled in by the core balance_pairs post-processing rule
      open: scanned.can_open,
      close: scanned.can_close,
      jump: 0
    })
  }
  state.pos += scanned.length
  return true
 }
 ```
 Note the `scanDelims` call.
 It handles determining whether a given sequence of characters (`^` in this case) can start or end an inline styling sequence.
 A single caret will have no meaning in this plugin,
 so much of the complexity in this rule is removed:
 - For an odd-numbered length of carets, the first caret is added as plain text
 - The `length` property of the delimiters is always set to zero, skipping unnecessary logic in the `balance_pairs` rule
 Note also that **no matching was attempted in the tokenization phase**.
 The `end` property is always set to `-1`.
 The `balance_pairs` rule does all the heavy lifting later on, behind the scenes.
 ## Post Processing
 ### Grunt Work
 The main logic of this rule will go into a utility function, called `postProcess`.
 The top-level rule function gets a confusing bit of grunt work:
 ```typescript
 function smalltext_postProcess(state: StateInline) {
  const tokens_meta = state.tokens_meta
  const max = state.tokens_meta.length
  postProcess(state, state.delimiters)
  for (let curr = 0; curr < max; curr++) {
    if (tokens_meta[curr]?.delimiters) {
      postProcess(state, tokens_meta[curr]?.delimiters || [])
    }
  }
  // post-process return value is unused
  return false
 }
 function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) {
  return
 }
 ```
 > [!TIP]
 > What is `tokens_meta`?
 >
 > Every time a token with a positive `nesting` value is pushed to the inline state's tokens (i.e. an opening tag),
 > the inline state does the following:
 >
 > - throws the current `delimiters` array onto a stack
 > - creates a new, empty `delimiters` array, exposing it as `state.delimiters`
 > - gives the open-tag token a `token_meta` object with the new `delimiters` array
 > - *also* stores the `token_meta` object in `state.tokens_meta`
 >
 > The intrepid reader will notice that in the tokenization rule, **the created delimiters were likely being pushed to different arrays** throughout execution.
 >
 > Now, in post-processing, each `delimiters` array will only hold delimiters at matching nesting levels.
 >
 > If the details of this implementation are of interest, check out [the source](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/state_inline.mjs#L60).
 ### Main Logic
 As previously mentioned, `balance_pairs` took care of building out and cleaning up the delimiter data.
 This post-processing rule will mainly read the data and add tokens as appropriate:
 ```typescript
 function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) {
  let token
  const loneMarkers = []
  const max = delimiters.length
  for (let i = 0; i < max; i++) {
    const startDelim = delimiters[i]
    if (startDelim.marker !== 0x5e /* ^ */) {
      continue
    }
    // balance_pairs wrote the appropriate `end` pointer value here.
    // If it's still -1, there was a balancing problem,
    // and the delimiter can be ignored.
    if (startDelim.end === -1) {
      continue
    }
    const endDelim = delimiters[startDelim.end]
    token = state.tokens[startDelim.token]
    token.type = "smalltext_open"
    token.tag = "small"
    token.nesting = 1
    token.markup = "^^"
    token.content = ""
    token = state.tokens[endDelim.token]
    token.type = "smalltext_close"
    token.tag = "small"
    token.nesting = -1
    token.markup = "^^"
    token.content = ""
    if (
      state.tokens[endDelim.token - 1].type === "text" &&
      state.tokens[endDelim.token - 1].content === "^"
    ) {
      loneMarkers.push(endDelim.token - 1)
    }
  }
  // If a marker sequence has an odd number of characters, it is split
  // like this: `^^^^^` -> `^` + `^^` + `^^`, leaving one marker at the
  // start of the sequence.
  //
  // So, we have to move all those markers after subsequent closing tags.
  //
  while (loneMarkers.length) {
    const i = loneMarkers.pop() || 0
    let j = i + 1
    while (j < state.tokens.length && state.tokens[j].type === "smalltext_close") {
      j++
    }
    j--
    if (i !== j) {
      token = state.tokens[j]
      state.tokens[j] = state.tokens[i]
      state.tokens[i] = token
    }
  }
 }
 ```
 The lone-marker handling is a point of interest.
 While a five- or seven-character sequence of carets is unlikely,
 it could still be matched with a different string of carets elsewhere in the line of text.
 Due to how tokenization runs,
 both the opening **and** closing sequences are split leaving the lone caret at the start:
 ```
 ^^^^^^^hey this text would actually be small^^^^^^^
 gets parsed somewhat like this:
 ^ ^^ ^^ ^^ hey this text would actually be small ^ ^^ ^^ ^^
 | |     |                                        | |  |
 | |     opening tag                              | |  open and close
 | open and close                                 | balanced closing tag
 lone caret                                       lone caret
 ```
 Because the very first caret in the opening sequence is *not* placed within the `<small>` tags,
 neither should the first caret in the closing sequence.
 The end of the post-processing rule handles that edge case.
 ## Conclusion
 That's everything!
 This rule is almost a verbatim copy of the [strikethrough rule](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/strikethrough.mjs) in the core library.
 If a full-on emphasis-style rule is desired, the [source code](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/emphasis.mjs) isn't much longer,
 thanks in large part to the heavy lifting that [balance_pairs](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/balance_pairs.mjs) accomplishes.
 > [!CAUTION]
 >
 > If the plugin being developed is a "standalone" inline element without a open/close pair
 > (think about links `[text](url)` or images `![alt text](source "title")`),
 > **the post-processing infrastructure can be safely ignored**!
 > Markdown parsing is complicated enough.
 > Please don't introduce any unnecessary complexity!