9.4 KiB
Adding New Text Decorators
A step-by-step example of adding a new text decoration style via inline rules.
Goal
Text surrounded by double-carets (e.g. ^^like this^^
) will be given the <small>
tag in the output HTML.
Inline Rules
Markdown-It processes inline sequences of text in two passes, each with their own list of rules:
- Tokenization
- Post Processing
The Tokenization phase is responsible for identifying inline markers, like **
(bold/strong text) or ^^
(our new "small text" delimiter).
It is unaware of marker nesting, or whether markers form matched pairs.
The Post Processing phase handles matching pairs of tokens. This phase holds a lot of hidden complexity. Base Markdown supports a single asterisk for italics/emphasis, double asterisk for bold/strong text, and triple asterisk for both styles combined. Even if a new plugin isn't implementing such a nuanced delimiter, an awareness of the complexity helps the developer inject code in the proper locations.
[!IMPORTANT] Every matched-pair inline marker should provide both a tokenization and post-processing rule.
Entry Point
The new rule will be named smalltext
.
The plugin entry point will look like the following:
export default function smalltext_plugin(md: MarkdownIt) {
md.inline.ruler.after("emphasis", "smalltext", smalltext_tokenize)
md.inline.ruler2.after("emphasis", "smalltext", smalltext_postProcess)
}
function smalltext_tokenize(state: StateInline, silent: boolean) {
return false
}
function smalltext_postProcess(state: StateInline) {
return false
}
Note the use of ruler2
to register the post-processing step.
This pattern is unique to matched-pair inline marker rules:
it isn't seen anywhere else in the library (e.g. for block or core rules).
Tokenization
All that needs to happen here is identifying the string ^^
,
adding a Token to state.tokens
,
and adding a Delimiter to state.delimiters
.
[!TIP] A
delimiter
points to a token and provides extra information:
- whether that token is a valid choice for opening or closing styled text
- a pointer to the matching end token
- information about how many characters the token is (useful for disambiguating italics and bold)
Most of this information is used in the
balance_pairs
post-processing rule. So long as thedelimiters
array is constructed well in the tokenization phase, the developer doesn't need to worry about the complexity withinbalance_pairs
.
function smalltext_tokenize(state: StateInline, silent: boolean) {
const start = state.pos
const marker = state.src.charCodeAt(start)
if (silent) {
return false
}
if (marker !== 0x5e /* ^ */) {
return false
}
const scanned = state.scanDelims(state.pos, true)
let len = scanned.length
const ch = String.fromCharCode(marker)
if (len < 2) {
return false
}
let token
if (len % 2) {
token = state.push("text", "", 0)
token.content = ch
len--
}
for (let i = 0; i < len; i += 2) {
token = state.push("text", "", 0)
token.content = ch + ch
state.delimiters.push({
marker,
length: 0, // disable "rule of 3" length checks meant for emphasis
token: state.tokens.length - 1,
end: -1, // This pointer is filled in by the core balance_pairs post-processing rule
open: scanned.can_open,
close: scanned.can_close,
jump: 0
})
}
state.pos += scanned.length
return true
}
Note the scanDelims
call.
It handles determining whether a given sequence of characters (^
in this case) can start or end an inline styling sequence.
A single caret will have no meaning in this plugin, so much of the complexity in this rule is removed:
- For an odd-numbered length of carets, the first caret is added as plain text
- The
length
property of the delimiters is always set to zero, skipping unnecessary logic in thebalance_pairs
rule
Note also that no matching was attempted in the tokenization phase.
The end
property is always set to -1
.
The balance_pairs
rule does all the heavy lifting later on, behind the scenes.
Post Processing
Grunt Work
The main logic of this rule will go into a utility function, called postProcess
.
The top-level rule function gets a confusing bit of grunt work:
function smalltext_postProcess(state: StateInline) {
const tokens_meta = state.tokens_meta
const max = state.tokens_meta.length
postProcess(state, state.delimiters)
for (let curr = 0; curr < max; curr++) {
if (tokens_meta[curr]?.delimiters) {
postProcess(state, tokens_meta[curr]?.delimiters || [])
}
}
// post-process return value is unused
return false
}
function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) {
return
}
[!TIP] What is
tokens_meta
?Every time a token with a positive
nesting
value is pushed to the inline state's tokens (i.e. an opening tag), the inline state does the following:
- throws the current
delimiters
array onto a stack- creates a new, empty
delimiters
array, exposing it asstate.delimiters
- gives the open-tag token a
token_meta
object with the newdelimiters
array- also stores the
token_meta
object instate.tokens_meta
The intrepid reader will notice that in the tokenization rule, the created delimiters were likely being pushed to different arrays throughout execution.
Now, in post-processing, each
delimiters
array will only hold delimiters at matching nesting levels.If the details of this implementation are of interest, check out the source.
Main Logic
As previously mentioned, balance_pairs
took care of building out and cleaning up the delimiter data.
This post-processing rule will mainly read the data and add tokens as appropriate:
function postProcess(state: StateInline, delimiters: StateInline.Delimiter[]) {
let token
const loneMarkers = []
const max = delimiters.length
for (let i = 0; i < max; i++) {
const startDelim = delimiters[i]
if (startDelim.marker !== 0x5e /* ^ */) {
continue
}
// balance_pairs wrote the appropriate `end` pointer value here.
// If it's still -1, there was a balancing problem,
// and the delimiter can be ignored.
if (startDelim.end === -1) {
continue
}
const endDelim = delimiters[startDelim.end]
token = state.tokens[startDelim.token]
token.type = "smalltext_open"
token.tag = "small"
token.nesting = 1
token.markup = "^^"
token.content = ""
token = state.tokens[endDelim.token]
token.type = "smalltext_close"
token.tag = "small"
token.nesting = -1
token.markup = "^^"
token.content = ""
if (
state.tokens[endDelim.token - 1].type === "text" &&
state.tokens[endDelim.token - 1].content === "^"
) {
loneMarkers.push(endDelim.token - 1)
}
}
// If a marker sequence has an odd number of characters, it is split
// like this: `^^^^^` -> `^` + `^^` + `^^`, leaving one marker at the
// start of the sequence.
//
// So, we have to move all those markers after subsequent closing tags.
//
while (loneMarkers.length) {
const i = loneMarkers.pop() || 0
let j = i + 1
while (j < state.tokens.length && state.tokens[j].type === "smalltext_close") {
j++
}
j--
if (i !== j) {
token = state.tokens[j]
state.tokens[j] = state.tokens[i]
state.tokens[i] = token
}
}
}
The lone-marker handling is a point of interest. While a five- or seven-character sequence of carets is unlikely, it could still be matched with a different string of carets elsewhere in the line of text. Due to how tokenization runs, both the opening and closing sequences are split leaving the lone caret at the start:
^^^^^^^hey this text would actually be small^^^^^^^
gets parsed somewhat like this:
^ ^^ ^^ ^^ hey this text would actually be small ^ ^^ ^^ ^^
| | | | | |
| | opening tag | | open and close
| open and close | balanced closing tag
lone caret lone caret
Because the very first caret in the opening sequence is not placed within the <small>
tags,
neither should the first caret in the closing sequence.
The end of the post-processing rule handles that edge case.
Conclusion
That's everything!
This rule is almost a verbatim copy of the strikethrough rule in the core library. If a full-on emphasis-style rule is desired, the source code isn't much longer, thanks in large part to the heavy lifting that balance_pairs accomplishes.
[!CAUTION]
If the plugin being developed is a "standalone" inline element without a open/close pair (think about links
[text](url)
or images
), the post-processing infrastructure can be safely ignored! Markdown parsing is complicated enough. Please don't introduce any unnecessary complexity!