
committed by
GitHub

2 changed files with 418 additions and 0 deletions
@ -0,0 +1,130 @@ |
|||||
|
# Document-Wide Post Processing |
||||
|
|
||||
|
An overview of how to tweak and augment the token stream just before rendering. |
||||
|
|
||||
|
## Goal |
||||
|
|
||||
|
The output document will be surrounded by `<section>` tags. Second-level headings (`h2`) will also trigger section breaks (i.e. `</section><section>`) immediately preceding the heading. |
||||
|
|
||||
|
## Core Rules |
||||
|
|
||||
|
The top-level rule pipeline turning raw Markdown into a token array consists of **core rules**. |
||||
|
The *block* and *inline* rule pipelines are run within a single "wrapper" rule in the core pipeline. |
||||
|
The wrapper rules appear relatively early in the [core pipeline](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/parser_core.mjs#L19). |
||||
|
|
||||
|
```javascript |
||||
|
const _rules = [ |
||||
|
['normalize', r_normalize], |
||||
|
['block', r_block], |
||||
|
['inline', r_inline], |
||||
|
['linkify', r_linkify], |
||||
|
['replacements', r_replacements], |
||||
|
['smartquotes', r_smartquotes], |
||||
|
['text_join', r_text_join] |
||||
|
] |
||||
|
``` |
||||
|
|
||||
|
Core rules typically do *not* scan through the source text or interpret Markdown syntax. |
||||
|
Rather, they usually modify or augment the token stream after an initial pass over the Markdown is complete. |
||||
|
|
||||
|
> [!NOTE] |
||||
|
> The `normalize` rule is an exception. |
||||
|
> It modifies the raw markdown (`state.src`), |
||||
|
> *normalizing* (as the name implies) idiosyncrasies like platform-specific newlines and null characters. |
||||
|
|
||||
|
Core rules can do much more, |
||||
|
but "post-processing" tasks are the most common use case. |
||||
|
|
||||
|
## Entry Point |
||||
|
|
||||
|
The new rule will be called `sectionize`. |
||||
|
The plugin entry point will look like the following: |
||||
|
|
||||
|
```typescript |
||||
|
export default function sectionize_plugin(md: MarkdownIt) { |
||||
|
md.core.ruler.push("sectionize", sectionize) |
||||
|
} |
||||
|
|
||||
|
function sectionize(state: StateCore) { |
||||
|
return |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
The new rule is pushed to the very end of the core pipeline. |
||||
|
While there are valid reasons to insert plugin rules elsewhere in the pipeline, |
||||
|
pushing to the end is a good default choice. |
||||
|
|
||||
|
> [!IMPORTANT] |
||||
|
> When in doubt, always put plugin rules at the end of the pipeline. |
||||
|
> This strategy minimizes the potential of breaking other rules' assumptions about state. |
||||
|
|
||||
|
In this case specifically, surrounding the document with `<section>` tags will **increase the nesting level** of every other token in the document. |
||||
|
Certain rules might iterate over the token stream and keep a running-total nesting level, |
||||
|
making assumptions about nesting level zero (for example). |
||||
|
Placing the new rule at the very end keeps it from affecting those other rules. |
||||
|
|
||||
|
## Section Insertion Logic |
||||
|
|
||||
|
Because we will be inserting tokens into the token array, |
||||
|
we will iterate *backwards* over the existing array so that our index pointer isn't affected by the insertions. |
||||
|
|
||||
|
```typescript |
||||
|
function sectionize(state: StateCore) { |
||||
|
const slugs: Record<string, boolean> = {} |
||||
|
const toProcess: Array<{ slug: string; anchor: Token; target: Token }> = [] |
||||
|
|
||||
|
// Iterate backwards since we're splicing elements into the array |
||||
|
for (let i = state.tokens.length - 1; i >= 0; i--) { |
||||
|
const token = state.tokens[i] |
||||
|
|
||||
|
if (token.type === "heading_open" && token.tag === "h2") { |
||||
|
const { open, close } = getSectionPair(state) |
||||
|
state.tokens.splice(i, 0, close, open) |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
// ...The plugin isn't quite done yet |
||||
|
} |
||||
|
|
||||
|
function getSectionPair(state: StateCore) { |
||||
|
const open = new state.Token("section_open", "section", 1) |
||||
|
open.block = true |
||||
|
const close = new state.Token("section_close", "section", -1) |
||||
|
close.block = true |
||||
|
|
||||
|
return { open, close } |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
At this point, the tokens array now has a `</section><section>` pair immediately preceding each `<h2>`. |
||||
|
However, the document itself is not yet wrapped in an overarching section. |
||||
|
|
||||
|
There are two cases to consider: |
||||
|
|
||||
|
- The document originally started with a `h2`, so it now starts with `</section>` |
||||
|
- The document did not start with a `h2` |
||||
|
|
||||
|
Both cases are addressed with just a few lines of code: |
||||
|
|
||||
|
```typescript |
||||
|
function sectionize(state: StateCore) { |
||||
|
// ...iteration logic from above |
||||
|
|
||||
|
if (state.tokens[0].type === "section_close") { |
||||
|
state.tokens.push(state.tokens.shift()!) |
||||
|
} else { |
||||
|
const { open, close } = getSectionPair(state) |
||||
|
state.tokens.unshift(open) |
||||
|
state.tokens.push(close) |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
## Conclusion |
||||
|
|
||||
|
That's right: simple augmentation tasks like sectionization are straightforward to implement with core rule plugins. |
||||
|
No traversal of `state.src` is required, |
||||
|
because this rule is running *after* all of the block and inline rule sets. |
||||
|
|
||||
|
With a careful selection of rule positioning (defaulting to the end of the pipeline when in doubt), |
||||
|
post-processing rules are some of the simplest to write. |
@ -0,0 +1,288 @@ |
|||||
|
# Adding New Text Decorators |
||||
|
|
||||
|
A step-by-step example of adding a new text decoration style via **inline rules**. |
||||
|
|
||||
|
## Goal |
||||
|
|
||||
|
Text surrounded by double-carets (e.g. `^^like this^^`) will be given the `<small>` tag in the output HTML. |
||||
|
|
||||
|
## Inline Rules |
||||
|
|
||||
|
Markdown-It processes inline sequences of text in **two** passes, each with their own list of rules: |
||||
|
|
||||
|
- Tokenization |
||||
|
- Post Processing |
||||
|
|
||||
|
The Tokenization phase is responsible for **identifying** inline markers, like `**` (bold/strong text) or `^^` (our new "small text" delimiter). |
||||
|
It is unaware of marker nesting, or whether markers form matched pairs. |
||||
|
|
||||
|
The Post Processing phase handles **matching** pairs of tokens. |
||||
|
This phase holds a lot of hidden complexity. |
||||
|
Base Markdown supports a single asterisk for italics/emphasis, double asterisk for bold/strong text, and triple asterisk for both styles combined. |
||||
|
Even if a new plugin isn't implementing such a nuanced delimiter, an awareness of the complexity helps the developer inject code in the proper locations. |
||||
|
|
||||
|
> [!IMPORTANT] |
||||
|
> Every matched-pair inline marker should provide **both** a tokenization and post-processing rule. |
||||
|
|
||||
|
## Entry Point |
||||
|
|
||||
|
The new rule will be named `smalltext`. |
||||
|
The plugin entry point will look like the following: |
||||
|
|
||||
|
```typescript |
||||
|
export default function smalltext_plugin(md: MarkdownIt) { |
||||
|
md.inline.ruler.after("emphasis", "smalltext", smalltext_tokenize) |
||||
|
md.inline.ruler2.after("emphasis", "smalltext", smalltext_postProcess) |
||||
|
} |
||||
|
|
||||
|
function smalltext_tokenize(state: StateInline, silent: boolean) { |
||||
|
return false |
||||
|
} |
||||
|
|
||||
|
function smalltext_postProcess(state: StateInline) { |
||||
|
return false |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
Note the use of `ruler2` to register the post-processing step. |
||||
|
This pattern is unique to matched-pair inline marker rules: |
||||
|
it isn't seen anywhere else in the library (e.g. for block or core rules). |
||||
|
|
||||
|
## Tokenization |
||||
|
|
||||
|
All that needs to happen here is identifying the string `^^`, |
||||
|
adding a Token to `state.tokens`, |
||||
|
and adding a Delimiter to `state.delimiters`. |
||||
|
|
||||
|
> [!TIP] |
||||
|
> A `delimiter` points to a token and provides extra information: |
||||
|
> |
||||
|
> - whether that token is a valid choice for opening or closing styled text |
||||
|
> - a pointer to the matching end token |
||||
|
> - information about how many characters the token is (useful for disambiguating italics and bold) |
||||
|
> |
||||
|
> Most of this information is used in the `balance_pairs` post-processing rule. |
||||
|
> So long as the `delimiters` array is constructed well in the tokenization phase, |
||||
|
> the developer doesn't need to worry about the complexity within `balance_pairs`. |
||||
|
|
||||
|
```typescript |
||||
|
function smalltext_tokenize(state: StateInline, silent: boolean) { |
||||
|
const start = state.pos |
||||
|
const marker = state.src.charCodeAt(start) |
||||
|
|
||||
|
if (silent) { |
||||
|
return false |
||||
|
} |
||||
|
|
||||
|
if (marker !== 0x5e /* ^ */) { |
||||
|
return false |
||||
|
} |
||||
|
|
||||
|
const scanned = state.scanDelims(state.pos, true) |
||||
|
let len = scanned.length |
||||
|
const ch = String.fromCharCode(marker) |
||||
|
|
||||
|
if (len < 2) { |
||||
|
return false |
||||
|
} |
||||
|
|
||||
|
let token |
||||
|
|
||||
|
if (len % 2) { |
||||
|
token = state.push("text", "", 0) |
||||
|
token.content = ch |
||||
|
len-- |
||||
|
} |
||||
|
|
||||
|
for (let i = 0; i < len; i += 2) { |
||||
|
token = state.push("text", "", 0) |
||||
|
token.content = ch + ch |
||||
|
|
||||
|
state.delimiters.push({ |
||||
|
marker, |
||||
|
length: 0, // disable "rule of 3" length checks meant for emphasis |
||||
|
token: state.tokens.length - 1, |
||||
|
end: -1, // This pointer is filled in by the core balance_pairs post-processing rule |
||||
|
open: scanned.can_open, |
||||
|
close: scanned.can_close, |
||||
|
jump: 0 |
||||
|
}) |
||||
|
} |
||||
|
|
||||
|
state.pos += scanned.length |
||||
|
|
||||
|
return true |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
Note the `scanDelims` call. |
||||
|
It handles determining whether a given sequence of characters (`^` in this case) can start or end an inline styling sequence. |
||||
|
|
||||
|
A single caret will have no meaning in this plugin, |
||||
|
so much of the complexity in this rule is removed: |
||||
|
|
||||
|
- For an odd-numbered length of carets, the first caret is added as plain text |
||||
|
- The `length` property of the delimiters is always set to zero, skipping unnecessary logic in the `balance_pairs` rule |
||||
|
|
||||
|
Note also that **no matching was attempted in the tokenization phase**. |
||||
|
The `end` property is always set to `-1`. |
||||
|
The `balance_pairs` rule does all the heavy lifting later on, behind the scenes. |
||||
|
|
||||
|
## Post Processing |
||||
|
|
||||
|
### Grunt Work |
||||
|
|
||||
|
The main logic of this rule will go into a utility function, called `postProcess`. |
||||
|
The top-level rule function gets a confusing bit of grunt work: |
||||
|
|
||||
|
```typescript |
||||
|
function smalltext_postProcess(state: StateInline) { |
||||
|
const tokens_meta = state.tokens_meta |
||||
|
const max = state.tokens_meta.length |
||||
|
|
||||
|
postProcess(state, state.delimiters) |
||||
|
|
||||
|
for (let curr = 0; curr < max; curr++) { |
||||
|
if (tokens_meta[curr]?.delimiters) { |
||||
|
postProcess(state, tokens_meta[curr]?.delimiters || []) |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
// post-process return value is unused |
||||
|
return false |
||||
|
} |
||||
|
|
||||
|
function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) { |
||||
|
return |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
> [!TIP] |
||||
|
> What is `tokens_meta`? |
||||
|
> |
||||
|
> Every time a token with a positive `nesting` value is pushed to the inline state's tokens (i.e. an opening tag), |
||||
|
> the inline state does the following: |
||||
|
> |
||||
|
> - throws the current `delimiters` array onto a stack |
||||
|
> - creates a new, empty `delimiters` array, exposing it as `state.delimiters` |
||||
|
> - gives the open-tag token a `token_meta` object with the new `delimiters` array |
||||
|
> - *also* stores the `token_meta` object in `state.tokens_meta` |
||||
|
> |
||||
|
> The intrepid reader will notice that in the tokenization rule, **the created delimiters were likely being pushed to different arrays** throughout execution. |
||||
|
> |
||||
|
> Now, in post-processing, each `delimiters` array will only hold delimiters at matching nesting levels. |
||||
|
> |
||||
|
> If the details of this implementation are of interest, check out [the source](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/state_inline.mjs#L60). |
||||
|
|
||||
|
### Main Logic |
||||
|
|
||||
|
As previously mentioned, `balance_pairs` took care of building out and cleaning up the delimiter data. |
||||
|
This post-processing rule will mainly read the data and add tokens as appropriate: |
||||
|
|
||||
|
```typescript |
||||
|
function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) { |
||||
|
let token |
||||
|
const loneMarkers = [] |
||||
|
const max = delimiters.length |
||||
|
|
||||
|
for (let i = 0; i < max; i++) { |
||||
|
const startDelim = delimiters[i] |
||||
|
|
||||
|
if (startDelim.marker !== 0x5e /* ^ */) { |
||||
|
continue |
||||
|
} |
||||
|
|
||||
|
// balance_pairs wrote the appropriate `end` pointer value here. |
||||
|
// If it's still -1, there was a balancing problem, |
||||
|
// and the delimiter can be ignored. |
||||
|
if (startDelim.end === -1) { |
||||
|
continue |
||||
|
} |
||||
|
|
||||
|
const endDelim = delimiters[startDelim.end] |
||||
|
|
||||
|
token = state.tokens[startDelim.token] |
||||
|
token.type = "smalltext_open" |
||||
|
token.tag = "small" |
||||
|
token.nesting = 1 |
||||
|
token.markup = "^^" |
||||
|
token.content = "" |
||||
|
|
||||
|
token = state.tokens[endDelim.token] |
||||
|
token.type = "smalltext_close" |
||||
|
token.tag = "small" |
||||
|
token.nesting = -1 |
||||
|
token.markup = "^^" |
||||
|
token.content = "" |
||||
|
|
||||
|
if ( |
||||
|
state.tokens[endDelim.token - 1].type === "text" && |
||||
|
state.tokens[endDelim.token - 1].content === "^" |
||||
|
) { |
||||
|
loneMarkers.push(endDelim.token - 1) |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
// If a marker sequence has an odd number of characters, it is split |
||||
|
// like this: `^^^^^` -> `^` + `^^` + `^^`, leaving one marker at the |
||||
|
// start of the sequence. |
||||
|
// |
||||
|
// So, we have to move all those markers after subsequent closing tags. |
||||
|
// |
||||
|
while (loneMarkers.length) { |
||||
|
const i = loneMarkers.pop() || 0 |
||||
|
let j = i + 1 |
||||
|
|
||||
|
while (j < state.tokens.length && state.tokens[j].type === "smalltext_close") { |
||||
|
j++ |
||||
|
} |
||||
|
|
||||
|
j-- |
||||
|
|
||||
|
if (i !== j) { |
||||
|
token = state.tokens[j] |
||||
|
state.tokens[j] = state.tokens[i] |
||||
|
state.tokens[i] = token |
||||
|
} |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
The lone-marker handling is a point of interest. |
||||
|
While a five- or seven-character sequence of carets is unlikely, |
||||
|
it could still be matched with a different string of carets elsewhere in the line of text. |
||||
|
Due to how tokenization runs, |
||||
|
both the opening **and** closing sequences are split leaving the lone caret at the start: |
||||
|
|
||||
|
``` |
||||
|
^^^^^^^hey this text would actually be small^^^^^^^ |
||||
|
|
||||
|
gets parsed somewhat like this: |
||||
|
|
||||
|
^ ^^ ^^ ^^ hey this text would actually be small ^ ^^ ^^ ^^ |
||||
|
| | | | | | |
||||
|
| | opening tag | | open and close |
||||
|
| open and close | balanced closing tag |
||||
|
lone caret lone caret |
||||
|
``` |
||||
|
|
||||
|
Because the very first caret in the opening sequence is *not* placed within the `<small>` tags, |
||||
|
neither should the first caret in the closing sequence. |
||||
|
The end of the post-processing rule handles that edge case. |
||||
|
|
||||
|
## Conclusion |
||||
|
|
||||
|
That's everything! |
||||
|
|
||||
|
This rule is almost a verbatim copy of the [strikethrough rule](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/strikethrough.mjs) in the core library. |
||||
|
If a full-on emphasis-style rule is desired, the [source code](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/emphasis.mjs) isn't much longer, |
||||
|
thanks in large part to the heavy lifting that [balance_pairs](https://github.com/markdown-it/markdown-it/blob/0fe7ccb4b7f30236fb05f623be6924961d296d3d/lib/rules_inline/balance_pairs.mjs) accomplishes. |
||||
|
|
||||
|
> [!CAUTION] |
||||
|
> |
||||
|
> If the plugin being developed is a "standalone" inline element without a open/close pair |
||||
|
> (think about links `[text](url)` or images ``), |
||||
|
> **the post-processing infrastructure can be safely ignored**! |
||||
|
> Markdown parsing is complicated enough. |
||||
|
> Please don't introduce any unnecessary complexity! |
||||
|
|
Loading…
Reference in new issue