When `--validate-xml` has been given and XML::Simple does not appear
to be present but XML::Parser does, use it instead.
The output messages are slightly different on errors, but they're
still clear and validation still happens.
Even though XML::Parser is one of the possible backends for
XML::Simple, either one can be present without the other.
This change makes `--validate-xml` more widely available.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
A "backticks-delimited code block" must be at the left margin,
however, the entire block can be shifted 1, 2 or 3 characters to
the right using spaces. These are all the same:
```
line
```
```
line
```
```
line
```
```
line
```
When support for the slightly shifted code blocks was originally
added, the effect on tab expansion was overlooked.
The shifting in of the code block serves to make the raw markdown
source perhaps a tiny bit more readable, but the shifting in is NOT
a logical part of the lines in the code block themselves.
Therefore, remove any "shift in" spaces before expanding tabs within
the code block so that the result ends up looking exactly the same
no matter whether the code block has been shifted in at all or not.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Introduce a new `--validate-xml` (and corresponding `--no-validate-xml`)
option that performs a simple XML validation of the output using
`XML::Simple` and fails with an error if any problems are found.
Because a new module (`XML::Simple`) is required to do this, it's
NOT the default option (in which case `XML::Simple` need not be
present).
When `--validate-xml` is combined with `--sanitize` (which is the
default) the output can be included in an XHTML page with confidence
it will not invalidate the page's XML.
Because the `--html4tags` option produces non-XML output, it is
incompatible with `--validate-xml` and if both are given an immediate
error will be reported.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Unless the new `--no-sanitize` option is given, then inspect each
raw tag encountered in the input and "sanitize" it before outputting
it. The new `--sanitize` option is now the default.
As before, any tags not on the "approved" list have their "<" escaped
and become ordinary text in the output. With this commit, two new
tags are now on the "approved" list: `<map>` and `<area>`.
Each "approved" tag encountered has all of its attributes inspected
and any that are not on the "approved" list for that tag are expunged.
Any others that aren't in canonical form are corrected.
Empty tags are normalized as well (e.g. "<p/>" becomes "<p></p>"
but "<img ...>" becomes "<img ... />" etc.).
With this change, Markdown.pl output should be reasonably "safe"
for use as no code tags or attributes are permitted.
While client-side image maps are now allowed, server-side are not
(the "ismap" attribute will be silently expunged). Client-side
image maps should always be preferred to server-side anyway, that's
why they were created in the first place.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There's no point to including an empty header row in the output
table. It looks ugly.
While the header row is syntactically required in order to recognize
the table markup, omit it from the output if all its column cells
are empty ignoring any whitespace.
This gives a more pleasing result. Additionally add an extra class
tag "...-table-nohdr" in addition to the usual "...-table" tag to
allow easy custom formatting of these headerless tables if desired.
Empty body rows are always preserved because they can always be
omitted without breaking the table syntax.
Update the docs to describe the new behavior.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Prevent loss of backslash-escaping in tables by avoiding an effective
de-backslash operation that was incorrect. Now `\\\\` always ends up
producing two backslashes whether it's in a table or not.
Adjust the regexps so that the `|` in `\|` does not get mistaken for
the final `|` of a table row when that table row has omitted the
final trailing `|`.
Also silently handle a lone `\` at the end of a table row by pretending
it was doubled rather than breaking the table.
While in that area of the code, remove a line that computed a value
which was never used.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The new `--wiki` option (with optional argument) specifies how to
transform [[wiki style links]] into URLs.
There are a veritable plethora of options available to affect
the transformation.
Absolute URL wiki style links continue to be recognized even without
the `--wiki` option.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Move all the special cases to the top of the _ProcessWikiLink
function.
Add an exception to handle a fragment-only location.
All unhandled wiki style links now fall out the bottom of
the function.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The standard explicitly prohibits nesting of "a" tags.
Prevent nesting from occuring when only "markdown" input
is present. If explicit "<a ...>...</a>" tags are present
in the input they will be (mostly) left alone even if
they've been incorrectly nested.
Avoid producing mojibake links in the case that the URL
itself appears to contain another URL. (The wayback
machine links often look like this.)
In essence, once a link (either an "a" or "img") tag has been
generated/processed, avoid processing it again in order to
make sure no accidental mojibake occurs.
One consequence of this change is that "Automatic Links"
that are NOT surrounded by '<' and '>' will now only be
recognized if they occur at the beginning of the input or
after a whitespace (a newline qualifies) character.
This also helps to eliminate unintended double linkification.
Furthermore, URLs containing peculiar characters in them (e.g.
single quote and/or double quote) should be far less
troublesome now as well.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Before doing any other processing, eliminate any control
characters that might be present.
There should not be any, but just in case.
The only "control" characters kept are ASCII codes
9, 10, 12 and 13 (tab, nl, ff, cr).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although ' is specified as part of the XML standard,
some older end user clients may not actually recognize it.
Use ' instead of ' to avoid any difficulty.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The documentation claims that all of these work:
[1]: url1.txt "title 1"
[2]: url2.txt 'title 2'
[3]: url3.txt (title 3)
However, the single-quote version was not being accepted.
Update the regular expression to grok the single-quote
variant and require the correct matching closing quote
in order to match. This corrects the invalid acceptance
of mismatched quote titles such as "title) and (title".
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
With this source:
* A
* B
C
D
* E
* F
G
* H
I
The parser was getting confused about where each unordered list
actually ended when nesting the processed inner list inside the
outer list.
Address this by:
1) Giving "```" style code blocks their own hash just in case to
make sure they can never collide with html blocks.
2) Temporarily (the indentation gets removed before final output)
indent nested list tags by the current list nesting level to
ensure there's no confusion about where each list/sublist
starts and ends.
3) Make sure the list closing tag is followed by two newlines
rather than one to avoid potentially not applying markup to
the immediately following line.
Since a few patterns had minor adjustments for these changes, those
patterns also had a few unnecessarily capturing groups changed to
non-capturing groups in the hope that some miniscule performance
gain can be squeezed out with the change.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Previously this:
>
This should be a block quote now
Would not get recognized as a blockquote, now it will.
Previously the lone ">" got left on the line all by
itself as though it had been escaped.
Clearly that was improper.
Alternatively, it could have been picked up as its very
own empty blockquote, but that seems like the less
desirable resolution for the issue.
A ">" at the beginning of a line always signals the beginning
of a blockquote (unless it's escaped) and that blockquote then
continues on until it encounters a blank line.
Therefore the new interpretation must be correct, the old
interpretation was clearly wrong and the "empty blockquote
of its own" interpretation is also clearly wrong since it's
not immediately followed by a blank line. But this:
>
This will not be a block quote
With the blank line inserted, the above ">" does really now
end up correctly in an "empty blockquote of its own".
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When determining whether or not to add the "--imageroot" or
"--htmlroot" prefix to a relative link, ignore any query string
that may be present. The fragment (if present) was already
being ignored.
Allow URLs given in reference lines to be wrapped like so:
[1]: \
mZmYiIiHd3d2ZmZlVVVURERDMzMyIiIhEREQAAACwAAAAAFwAXAAAExxDISau9Mg\
She8DURhhHWRLDB26FkSjKqxxFqlbBWOwF4fOGgsCycRkInI+ocEAQNBNWq0caCJ\
i9aSqqGwwIL4MAsRATeMMMEykYHBLIt7DNHETrAPrBihVwDAh2ansBXygaAj5sa1\
x7iTUAKomEBU53B0hGVoVMTleEg0hkCD0DJAhwAlVcQT6nLwgHR1liUQNaqgkMDT\
NWXWkSbS6lZ0eKTUIWuTSbGzlNlkS3LSYksjtPK6YJCzEwNMAgbT9nKBwg6Onq6B\
EAOw== "title (100x100)"
This facilitates embedding small amounts of data directly in
the source without causing too much difficulty.
Update the documentation to describe this new feature.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Allow links of the form [...](#...) to find themselves on the
page the same way links of the form [...] can.
Be flexible accepting either '-' or '_' in place of spaces
in the heading name since fragment names may not contain spaces.
Refactor the code that manufactures "img" and "a" tags to both
simplify the code and make sure that all href, src, alt and title
attributes are fully and properly "escaped".
In addition, if the "title" for an image ends with something that
looks like "(512x342)", "(?x342)" or "(512x?)" then strip that out
of the title and set the appropriate width and height attributes
on the manufactured "img" tag. For example something like this:
![Nice pic](pic.jpg "Nice (500x300)")
or this:
![Nice pic][1]
[1]: <pic.jpg> "Nice (500x300)"
now produces this:
<img src="pic.jpg" alt="Nice pic" width="500" height="300" title="Nice" />
Update the syntax doc to mention these additions.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Instead of turning an empty URL into an href="" attribute that
effectively does nothing, change it into an href="#" attribute that
creates a link to the current page.
When adding a relative/image prefix leave fragment-only links
unmolested. They are meant to link somewhere on the current page
and must not be changed.
When inspecting the destination to determine whether to use the -i
prefix instead of the -r prefix when both are given, ignore any
trailing fragment. Fragments don't really make sense on image links
and should never actually be sent to the server anyway by a behaving
client, but match them properly in any case.
Also make sure that URLs only get a prefix added at most once.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When a and img tags are generated using the normal Markdown syntax
any prefixes specified with the -i and -r options are inserted as
appropriate.
Extend this processing to explicit a and img tags as well.
This makes sense because they should be handled the same way the
Markdown syntax generated tags are for consistency.
It's still possible to "escape" from the prefixes by using an
explicit scheme+host+port or the commonly supported (but not a
standard) //+host+port mechanism.
And it only matters if prefixes have been set with the -i and/or
-r options (the default is no prefixes) anyway.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The .svg/.svgz matching rule was matching .svz and .svgz by mistake.
Move the wayward '?' to the end so it matches .svg and .svgz as
originally intended.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Use the actual XML comment rule for parsing XML comments.
The leading delimiter is fixed as "<!--" and the trailing delimiter
is fixed as "-->".
In between the leading and trailing delimiters any characters other
than a "-" may be used and a "-" may be used provided it's followed
immediately with a non-"-" character.
Now that the clear beginning and end of comments can be properly
identified, there no longer needs to be a blank line following the
comment -- the end delimiter serves quite unambiguously. Relax the
ending match to just be end of line or end of document.
This makes comments parse much more like they're expected to.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Allow leading spaces before the backticks delimiters on the starting
and ending lines (up to one less than the indent width).
Then remove upto that number of leading spaces (based on the starting
backticks delimiter line) from each of the lines in the code block
itself.
This better matches how lax some other formatters are with backticks-
delimited code blocks parsing.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Markup is not allowed inside attributes. Make sure that everything
that ends up in alt="..." and title="..." has be properly escaped
to prevent it from acquiring markup during later processing phases.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Add support for basic tables.
Nested tables are not supported although tables themselves can
appear within lists and blockquotes and do work properly there.
The commonly used table syntax is recognized including the
left/right/center alignment indicators.
Inline markup within each column also works just fine.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When dealing with program arguments "<dir>" is highly problematic.
Both "<dir>" and "<menu>" have long been deprecated and there are
other tags readily available for making similar lists that are not
deprecated (and do not require use of style sheets either).
Therefore treat "<dir>" and "<menu>" as literal text unless the
new "--deprecated" option is used.
Other "deprecated" tags continue to be recognized and passed through
as they generally do not have non-deprecated equivalents that do not
also require use of style attributes or style sheets in some fashion.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While trying to keep all the various table-related tags together
is admirable, it makes it hard to be sure the tag is in the list
or not (an also looks bad compared to the other tags).
Therefore put the table-related tags into alphabetical order
just like the rest of them.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Automatically encode the leading '<' of non-html tag names so they
do not confuse the HTML parser or produce invalid HTML output.
This requires embedding a list of known HTML tags (a list of over
50 is now included).
This will also cause some "unsafe" tags that were previously being
passed through to be escaped (such as "script", "style", "object",
"embed" etc.).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When support for additional list markers was added in
51f3d63833 (Markdown.pl: support more list markers, 2017-01-10, 1.1.0),
a bug was inadvertently introduced that could cause adjacent sibling
list items to only recognize the first as a list item and as a side-effect
prevent markup from being recognized in the second.
The problem occurred when the matching pattern was split to run in
progressive matching mode and resulted in the sibling list items match
not always being matched by the progressive list item pattern (extra
possible \n's were preventing a match).
Fix this by adding a '+' in the correct location in the progressive pattern.
The side-effect was caused because any "leftover" (of which there shouldn't
be any) was not being processed for markup.
As a precaution, run any leftover through the block gamut markup processor
just in case even though there should never be any leftover.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Only treat (i), (v) and (x) as alpha if the previous
list marker was lower alpha (or upper alpha in the
case of (I), (V) and (X)).
Previously they were treated as alpha if the first
marker in the list was alpha, but if list marker
types were changed mid-list that could lead to
unexpected behavior.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
If the document contains footnote style links (e.g. [1],
[2], [3] ...) they look much better formatted so as to
retain the square brackets in the link text.
Do this for any footnote style link text consisting of
one to three digits.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When processing list items, while changes to the list marker style
are allowed for "ol" lists (ignored for "ul" lists), switching
from "ol" to "ul" or vice-versa is not allowed mid-list.
When a numbered/lettered marker is seen while processing a "ul"
list it was simply treated as '*'. This always produced the
correct result since the actual marker does not matter for "ul"
lists.
However, when a '*', '+', or '-' was seen while processing a
"ol" list it was always treated as a '1.' marker. This is,
however, incorrect if the list is not using decimal numbering.
Instead treat a "ul" marker encountered during "ol" list
processing as a repeat of the last marker seen. The lazy
list numbering will kick in and bump it up by one while
retaining the correct list marker style.
The same treatment is also now given to "ol" markers encountered
during "ul" list processing since it's simpler to code that way
even though it doesn't make a difference in output in that case.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
White space in '#', '##', '###', '####', '#####' and '######'
headers is already being normalized (i.e. leading and trailing
whitespace is stripped off).
Do the same thing for '=', '-', and '~' headers and, in addition,
do that for link ids, title text and alt text and for them also
replace internal runs of whitespace with a single space.
This makes the output nicer and more consistent and avoids
subtle bugs due to accidental inclusion of an extra space.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
In b7f3fc1c (Markdown.pl: do not mishandle double list markers,
2017-01-09, markdown_1.1.0), an attempt was made to avoid
having something like "* 1. item" be misinterpreted as an
unordered list containing an ordered sublist.
However, the change made to fix the problem only introduced
another problem where lists were not always being recognized
when they should be.
Fix the fix (it did have a bit of a kludgely side to it) so
it works properly and is less kludgely.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Backticks-delimited code blocks do not require a blank line
between them to be recognized (at least they're not supposed
to).
Recognize two such code blocks in a row by tweaking the
regex to use an assertion instead of an explicit match.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When using the "implicit link name" shortcut (i.e. the
link name is the same as the link text), the trailing '[]'
is unsightly.
Allow the trailing '[]' to be omitted when the omission is
unambiguous. In other words, if there is no preceding or
following pair of square brackets the trailing '[]' can safely
be omitted.
For example:
See any of [link 1][] [link 2][] [lnik 3][].
The trailing '[]' MUST NOT be omitted in this case because the
result:
See any of [link 1] [link 2] [link 3]
would be misinterpreted. But, if they're separated with commas
or words instead like so:
See any of [link 1], [link 2] or [link 3].
then they cannot be misinterpreted and the trailing '[]' can be
safely omitted making for a much nicer looking document.
To go with this change the basics.md and syntax.md documents
have been modified to take advantage of these new semantics.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Originally code blocks simply output this:
<pre><code>the code block text></code></pre>
However that sometimes led to unsatisfying formatting with
some browsers (especially text ones) and so that was ultimately
changed to this:
<div><dl></dl><pre><code>code block text</code></pre></div>
The div then additionally has a class to facilitate formatting
with a style sheet. The empty <dl></dl> causes agents that would
otherwise make a poor formatting choice to do "the right thing"(tm)
instead. However, the "<dl></dl>" kludge is unsatisfying for a
number of reasons.
Instead output this:
<div><pre></pre><pre><code>code block text</code></pre></div>
where the first div still has a class to facilitate formatting via
a style sheet, but the replacement "<pre></pre>" block has an
embedded style and is actually emitted like so:
<pre style="display:none"></pre>
While this is still a kludge, it's much more satisfying because:
1. The same element type is being used to force those recalictrant
text agents to do "the right thing"(tm).
2. The explict style="display:none" attribute completely protects
properly behaving agents from any unwanted side-effects from
the extra "<pre></pre>" tag pair.
Together with this change, the --stub stylesheet has also been
modified to use a more universally (i.e. page background is not
white) compatible styling for the code blocks themselves.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>