Update CommonMark spec to 0.20

10 years ago · 7b961ee1ef
2 changed files with 802 additions and 597 deletions
--- a/test/fixtures/commonmark/good.txt
+++ b/test/fixtures/commonmark/good.txt
--- a/test/fixtures/commonmark/spec.txt
+++ b/test/fixtures/commonmark/spec.txt
@ -1,8 +1,8 @@
 ---
 title: CommonMark Spec
 author: John MacFarlane
-version: 0.19
-date: 2015-04-27
+version: 0.20
+date: 2015-06-08
 license: '[CC-BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/)'
 ...

@ -212,12 +212,8 @@ to a certain encoding.
 A [line](@line) is a sequence of zero or more [character]s
 followed by a [line ending] or by the end of file.

-A [line ending](@line-ending) is, depending on the platform, a
-newline (`U+000A`), carriage return (`U+000D`), or
-carriage return + newline.
-
-For security reasons, a conforming parser must strip or replace the
-Unicode character `U+0000`.
+A [line ending](@line-ending) is a newline (`U+000A`), carriage return
+(`U+000D`), or carriage return + newline.

 A line containing no characters, or a line containing only spaces
 (`U+0020`) or tabs (`U+0009`), is called a [blank line](@blank-line).
@ -239,7 +235,10 @@ carriage return (`U+000D`), newline (`U+000A`), or form feed
 [Unicode whitespace](@unicode-whitespace) is a sequence of one
 or more [unicode whitespace character]s.

-A [non-space character](@non-space-character) is anything but `U+0020`.
+A [space](@space) is `U+0020`.
+
+A [non-space character](@non-space-character) is any character
+that is not a [whitespace character].

 An [ASCII punctuation character](@ascii-punctuation-character)
 is `!`, `"`, `#`, `$`, `%`, `&`, `'`, `(`, `)`,
@ -250,9 +249,10 @@ A [punctuation character](@punctuation-character) is an [ASCII
 punctuation character] or anything in
 the unicode classes `Pc`, `Pd`, `Pe`, `Pf`, `Pi`, `Po`, or `Ps`.

-## Tab expansion
+## Preprocessing

-Tabs in lines are expanded to spaces, with a tab stop of 4 characters:
+Tabs in lines are immediately expanded to [spaces][space], with a tab
+stop of 4 characters:

 .
 →foo→baz→→bim
@ -270,14 +270,19 @@ Tabs in lines are expanded to spaces, with a tab stop of 4 characters:
 </code></pre>
 .

+## Insecure characters
+
+For security reasons, the Unicode character `U+0000` must be replaced
+with the replacement character (`U+FFFD`).
+
 # Blocks and inlines

 We can think of a document as a sequence of
-[blocks](@block)---structural
-elements like paragraphs, block quotations,
-lists, headers, rules, and code blocks.  Blocks can contain other
-blocks, or they can contain [inline](@inline) content:
-words, spaces, links, emphasized text, images, and inline code.
+[blocks](@block)---structural elements like paragraphs, block
+quotations, lists, headers, rules, and code blocks.  Some blocks (like
+block quotes and list items) contain other blocks; others (like
+headers and paragraphs) contain [inline](@inline) content---text,
+links, emphasized text, images, code, and so on.

 ## Precedence

@ -528,12 +533,12 @@ consists of a string of characters, parsed as inline content, between an
 opening sequence of 1--6 unescaped `#` characters and an optional
 closing sequence of any number of `#` characters.  The opening sequence
 of `#` characters cannot be followed directly by a
-[non-space character].
-The optional closing sequence of `#`s must be preceded by a space and may be
-followed by spaces only.  The opening `#` character may be indented 0-3
-spaces.  The raw contents of the header are stripped of leading and
-trailing spaces before being parsed as inline content.  The header level
-is equal to the number of `#` characters in the opening sequence.
+[non-space character]. The optional closing sequence of `#`s must be
+preceded by a [space] and may be followed by spaces only.  The opening
+`#` character may be indented 0-3 spaces.  The raw contents of the
+header are stripped of leading and trailing spaces before being parsed
+as inline content.  The header level is equal to the number of `#`
+characters in the opening sequence.

 Simple headers:

@ -561,16 +566,21 @@ More than six `#` characters is not a header:
 <p>####### foo</p>
 .

-A space is required between the `#` characters and the header's
-contents.  Note that many implementations currently do not require
-the space.  However, the space was required by the [original ATX
-implementation](http://www.aaronsw.com/2002/atx/atx.py), and it helps
-prevent things like the following from being parsed as headers:
+At least one space is required between the `#` characters and the
+header's contents, unless the header is empty.  Note that many
+implementations currently do not require the space.  However, the
+space was required by the
+[original ATX implementation](http://www.aaronsw.com/2002/atx/atx.py),
+and it helps prevent things like the following from being parsed as
+headers:

 .
 #5 bolt
+
+#foobar
 .
 <p>#5 bolt</p>
+<p>#foobar</p>
 .

 This is not a header, because the first `#` is escaped:
@ -1024,7 +1034,41 @@ paragraph.)
 </code></pre>
 .

-The contents are literal text, and do not get parsed as Markdown:
+If there is any ambiguity between an interpretation of indentation
+as a code block and as indicating that material belongs to a [list
+item][list items], the list item interpretation takes precedence:
+
+.
+  - foo
+
+    bar
+.
+<ul>
+<li>
+<p>foo</p>
+<p>bar</p>
+</li>
+</ul>
+.
+
+.
+1.  foo
+
+    - bar
+.
+<ol>
+<li>
+<p>foo</p>
+<ul>
+<li>bar</li>
+</ul>
+</li>
+</ol>
+.
+
+
+The contents of a code block are literal text, and do not get parsed
+as Markdown:

 .
    <a/>
@ -2325,9 +2369,16 @@ foo</p>
 </blockquote>
 .

-Laziness only applies to lines that are continuations of
-paragraphs. Lines containing characters or indentation that indicate
-block structure cannot be lazy.
+Laziness only applies to lines that would have been continuations of
+paragraphs had they been prepended with `>`.  For example, the
+`>` cannot be omitted in the second line of
+
+``` markdown
+> foo
+> ---
+```
+
+without changing the meaning:

 .
 > foo
@ -2339,6 +2390,15 @@ block structure cannot be lazy.
 <hr />
 .

+Similarly, if we omit the `>` in the second line of
+
+``` markdown
+> - foo
+> - bar
+```
+
+then the block quote ends after the first line:
+
 .
 > - foo
 - bar
@ -2353,6 +2413,9 @@ block structure cannot be lazy.
 </ul>
 .

+For the same reason, we can't omit the `>` in front of
+subsequent lines of an indented or fenced code block:
+
 .
 >     foo
    bar
@ -3835,9 +3898,11 @@ item:
 - b
  - c
   - d
-  - e
- - f
- g
+    - e
+   - f
+  - g
+ - h
+- i
 .
 <ul>
 <li>a</li>
@ -3847,9 +3912,31 @@ item:
 <li>e</li>
 <li>f</li>
 <li>g</li>
+<li>h</li>
+<li>i</li>
 </ul>
 .

+.
+1. a
+
+  2. b
+
+    3. c
+.
+<ol>
+<li>
+<p>a</p>
+</li>
+<li>
+<p>b</p>
+</li>
+<li>
+<p>c</p>
+</li>
+</ol>
+.
+
 This is a loose list, because there is a blank line between
 two of the list items:

@ -4277,13 +4364,14 @@ corresponding codepoints.
 [Decimal entities](@decimal-entities)
 consist of `&#` + a string of 1--8 arabic digits + `;`. Again, these
 entities need to be recognised and transformed into their corresponding
-unicode codepoints. Invalid unicode codepoints will be written as the
-"unknown codepoint" character (`0xFFFD`)
+unicode codepoints. Invalid unicode codepoints will be replaced by
+the "unknown codepoint" character (`U+FFFD`).  For security reasons,
+the codepoint `U+0000` will also be replaced by `U+FFFD`.

 .
-&#35; &#1234; &#992; &#98765432;
+&#35; &#1234; &#992; &#98765432; &#0;
 .
-<p># Ӓ Ϡ �</p>
+<p># Ӓ Ϡ � �</p>
 .

 [Hexadecimal entities](@hexadecimal-entities)
@ -5063,9 +5151,9 @@ both left- and right-flanking, because it is preceded by
 punctuation:

 .
-foo-_(bar)_
+foo-__(bar)__
 .
-<p>foo-<em>(bar)</em></p>
+<p>foo-<strong>(bar)</strong></p>
 .


@ -5177,9 +5265,9 @@ both left- and right-flanking, because it is followed by
 punctuation:

 .
-_(bar)_.
+__(bar)__.
 .
-<p><em>(bar)</em>.</p>
+<p><strong>(bar)</strong>.</p>
 .

 Rule 9:
@ -6086,6 +6174,7 @@ that [matches] a [link reference definition] elsewhere in the document.

 A [link label](@link-label)  begins with a left bracket (`[`) and ends
 with the first right bracket (`]`) that is not backslash-escaped.
+Between these brackets there must be at least one non-[whitespace character].
 Unescaped square bracket characters are not allowed in
 [link label]s.  A link label can have at most 999
 characters inside the square brackets.
@ -6332,6 +6421,30 @@ backslash-escaped:
 <p><a href="/uri">foo</a></p>
 .

+A [link label] must contain at least one non-[whitespace character]:
+
+.
+[]
+
+[]: /uri
+.
+<p>[]</p>
+<p>[]: /uri</p>
+.
+
+.
+[
+ ]
+
+[
+ ]: /uri
+.
+<p>[
+]</p>
+<p>[
+]: /uri</p>
+.
+
 A [collapsed reference link](@collapsed-reference-link)
 consists of a [link label] that [matches] a
 [link reference definition] elsewhere in the