During initial processing, explict "block" tags are set aside to
avoid creating problems in the output later.
Adjust the matches to be case insensitive.
Also relax the extra-blank line before and after that only
prevents them being recognized where they need to be.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Add new `--base` option that allows a prefix to be specified to be
added to all bare fragment-only URL links.
Use of this option may be required in order for intra-document
fragment links to function properly within a document that makes
use of the `<base>` tag.
Make sure explicitly specified fragment-only URLs (i.e. given in
verbatim `<a>` tags) get hooked up to the proper destination if
possible.
They obviously are trying to refer to something in the same document
so make sure they get the same treatment to hook them up.
Do the same for fragment-only links inside wiki `[[`...`]]` links.
And for both of these (explicit `<a>` tags and `[[`...`]]` links)
make sure the new bare fragment-only URL prefix gets added if given.
While in there, adjust whitespace to match coding convention for
this file where needed in the sections that have been changed.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When processing something like this:
* first
> quoted
* second
It's imperative that the list tags and blockquote tags do not
become intermingled resulting in invalid output like so:
<p><ul>
<li>first</p>
<blockquote>
<p>quoted</li>
<li>second</li>
</ul></p>
</blockquote>
Instead, process the lists together with code blocks and
blockquotes so that cannot happen and instead we get the
correct output like so:
<ul>
<li>first
<blockquote>
<p>quoted</p>
</blockquote>
</li>
<li>second</li>
</ul>
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Now that the recognition requirements for atx-style headers have
been tightened a bit, process them ahead of setext-style headers
to avoid having a horizontal rule immediately after one of the
atx-style headers being grabbed as a setext-style header.
Previously this:
##### My H5
---
Would end up as a single `<h2>##### My H5</h2>`. Now it will more
correctly become `<h5>My H5</h5><hr />` instead.
This has been a longstanding problem that goes at least as far
back as version 1.0.1 and probably further.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Treat more than 6 consecutive '#'s as not a header.
Allow blank headers to be recognized which can be used for
spacers and/or formatting breaks.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
In b62cef825e (Markdown.pl: recognize top-level lists better,
2017-01-09, markdown_1.1.0), an attempt was made to recognize
obvious lists that were improperly being treated as wrapped paragraphs.
While that change offers a number of improvements (i.e. more lists
are recognized properly than were before), it does not go far enough.
Further enhance it to only require a single list marker to start a
list provided it's one of the unordered list markers. While it's
certainly possible that a lone "*", "+" or "-" got wrapped onto the
beginning of a line by itself, it's easy to correct; since lone
occurrences of those characters seems highly unlikely, choose the
list starting interpretation instead.
In addition, if the prior line ends with a colon (:) do not require
two markers to start the list, just one.
Furthermore, allow a single, optional, blank line between two markers
that do start a list.
With these changes, the vast majority of lists are recognized properly.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Previously, to match setext-style headers, only the top
three levels of atx-style headers were hooked up with ids.
Change this and hook up h4, h5 and h6 atx-style headers using
the same rules as for the others.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The language name specified for syntax highlighting generally has
to be composed of "word" characters (alphanumeric and "_") plus
"+", "-" and ".".
Allow a final trailing "#" on the language name so that c# and f#
can be used as language names.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The $g_nested_brackets recursive regular expression is already being
used to match nested and balanced '['...']' sequences.
Introduce a $g_nested_parens recursive regular expression that
matches nested and balanced '('...')' sequences; use it to match
the parenthesized portion of `[...](...)` and `![...](...)` links.
This eliminates a number of previous issues with links that contained
embedded parentheses and non-reference image links nested within
non-reference non-image links.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While a bit unusual, the reference id part of a reference link can
have nested '['...']' if wanted.
Therefore use $g_nested_brackets to match that side too.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Revise _DeTab yet again to provide a minor boost.
This really does currently seem to be the fastest detab mechanism
available.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The recursive regex `$g_nested_brackets` (see the source) correctly
matches properly balanced nested `[`...`]` text.
The normal anchor parsing already makes use of it.
Use it for the image parsing too so that this:
[![Alt](img.png)][link]
[link]:example.txt
Does not become the very broken:
<a href="img.png"><img src="example.txt" alt="Alt</a>" />
But instead becomes the correct:
<a href="example.txt"><img src="img.png" alt="Alt" /></a>
This problem has been present all the way back to Markdown.pl version
1.0.1 (2004-12-14) and likely earlier too.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Avoid starting a new list when seeing only one single orphan
list item that appears to be either a year or an initial (only
UPPERCASE though).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When list items are using paragraphs, make sure they do not cuddle
up too closely to the surrounding closing "</li>" when they contain
nested lists. Also include a "hint" for text-only browsers that
super compact cuddling is not wanted at that spot.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Be very careful to make sure that this:
open **FILE*** with *write*
Does not turn into the broken:
open <strong>FILE<em></strong> with *write</em>
But instead turns into the correct:
open <strong>FILE*</strong> with <em>write</em>
Handle italics inside bold by nesting a callout that finishes
up by "hiding" any leftover bold/italics markup characters.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Try very much harder to find a match for explicit `#anchorhere`
links in the document.
The implicit link shortcut works so much better and is far less
error prone.
Nevertheless, attempt to connect the stray anchor links via
extraordinary means.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
This, for example:
[Looking at a*](#something) is *good*
Must not produce this broken output:
<a href="#something">Looking at a<em></a> is *good</em>
But instead this:
<a href="#something">Looking at a*</a> is <em>good</em>
Achieve this by making a special pass to handle bold, italic and
strikethrough on the anchor text and then "hiding" any remaining
markup characters that might be confusingly matched up with characters
outside the anchor text.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Certain start tags (a, area, img, map) do not make sense unless
they have at least one attribute present.
If a completely attribute barren start tag for one of these elements
is found, treat it as invalid and escape the leading '<'.
This is an heuristic that shouldn't cause too many problems while
silently "correcting" incorrect input.
Either way (leaving the bare start tag with no attributes or escaping
it and potentially causing a fault as its end tag no longer has
anything to match up with), it's broken.
The question becomes then which breakage is more common in order
to handle that one in preference to the other.
With this change, the "it wasn't really a tag after all" situation
will now be considered more common than the "it was deliberatly an
invalid start tag with a matching end tag" situation.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
With `--sanitize`, minimized empty tags that should not be for XHTML
(e.g. "<p/>") have been automatically split into separate start and
end tags (e.g. "<p></p>").
Do the same in reverse for separate start and end tags that should
not be for XHTML (e.g. "<hr></hr>") and turn them into a single
minimized tag (e.g. "<hr />").
Additionally, when `--validate-xml-internal` is active, automatically
insert omitted optional end tags in (hopefully) the right places.
For example, "<ul><li>foo<li>bar</ul>" will automatically become
"<ul><li>foo</li><li>bar</li></ul>" thus making it valid XHTML.
When there are multiple errors reported (can only happen when there
are multiple opening tags missing their ending tags), report the
errors in reverse order (i.e. the first one reported will be the
largest line number) because that will often identify the source
of the trouble as the first error line due to the nature of tag
nesting.
Make a few related wordsmithing changes at the same time.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Add missing info for tfoot and thead.
Add "summary" to attributes for table.
Add "col" to list of empty element tags.
Add some explanatory comments to hash tables.
Add more text to `--sanitize` help.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Enhance the sanitation process slightly in order to perform simple
tag missing/mismatch validation. Almost all the work needed was
already being performed with the exception of keeping a tag stack.
Keep an active tag stack when `--validate-xml-internal` is active
and use it to find mismatched and/or missing open/closing tags.
Enable this by default unless `--no-sanitize` has been given (the
sanitize machinery does the validation and is required) or
`--validate-xml` or `--no-validate-xml` has been given explicitly.
Unlike the more comprehensive `--validate-xml`, this option operates
very quickly and does not require any additional XML modules to be
present. It's also compatible with `--html4tags`.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Avoid using Pod::Usage unless it's actually needed.
Avoid using XML::Simple or XML::Parser without --validate-xml.
Also correct the sense of the MT tests while in there.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
With `--raw`, no actual Markdown processing takes place, but the
input will still be sanitized (by default) and may optionally also
have --html4tags or --validate-xml used on it too.
The output's line endings will be normalized and the encoding
converted to UTF-8.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The old tab expansion code exhibits very poor performance when
passed the contents of entire files as input.
Mitigate this inefficiency by operating on one line at a time.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When `--sanitize` is active (the default), tags have been "sanitized"
as they were encountered. Unfortunately, not all tags get "encountered"
by the sanitation section.
Pre-existing "block" tags in the input are squirreled away to prevent
unintentional formatting "accidents".
Such tags were evading the sanitation engineer.
Instead of "sterlizing" when the tags are encountered during normal
formatting processing, perform full sanitation sterlization (provided
`--sanitize` is active) on the final, fully-formatted output.
By waiting until the end, all tags will be fully sterilized (even
those produced by Markdown.pl itself), no tag shall escape.
If `--validate-xml` has been requested (it's off by default), that
will happen _after_ full sanitation.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Rework the code that iterates through all the files given on
the command line and make sure any errors are reported with
a fatal result.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Given a table like this:
| H1 | H2 | H3 |
|----|----|----|
| x1 | x2 | x3 |
| y1 |too long|y3|
| z1 | z2 | z3 |
The problematic second row can now be split across mulitple lines
like this:
| H1 | H2 | H3 |
|----|----|----|
| x1 | x2 | x3 |
| y1 |too | y3 |\
| |long| |
| z1 | z2 | z3 |
While the example is contrived, even with "sloppy" tables, having
the ability to merge row data like this usually avoids the need for
unsightly long lines when an exceptional cell overflows excessively.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When `--validate-xml` has been given and XML::Simple does not appear
to be present but XML::Parser does, use it instead.
The output messages are slightly different on errors, but they're
still clear and validation still happens.
Even though XML::Parser is one of the possible backends for
XML::Simple, either one can be present without the other.
This change makes `--validate-xml` more widely available.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
A "backticks-delimited code block" must be at the left margin,
however, the entire block can be shifted 1, 2 or 3 characters to
the right using spaces. These are all the same:
```
line
```
```
line
```
```
line
```
```
line
```
When support for the slightly shifted code blocks was originally
added, the effect on tab expansion was overlooked.
The shifting in of the code block serves to make the raw markdown
source perhaps a tiny bit more readable, but the shifting in is NOT
a logical part of the lines in the code block themselves.
Therefore, remove any "shift in" spaces before expanding tabs within
the code block so that the result ends up looking exactly the same
no matter whether the code block has been shifted in at all or not.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Introduce a new `--validate-xml` (and corresponding `--no-validate-xml`)
option that performs a simple XML validation of the output using
`XML::Simple` and fails with an error if any problems are found.
Because a new module (`XML::Simple`) is required to do this, it's
NOT the default option (in which case `XML::Simple` need not be
present).
When `--validate-xml` is combined with `--sanitize` (which is the
default) the output can be included in an XHTML page with confidence
it will not invalidate the page's XML.
Because the `--html4tags` option produces non-XML output, it is
incompatible with `--validate-xml` and if both are given an immediate
error will be reported.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Unless the new `--no-sanitize` option is given, then inspect each
raw tag encountered in the input and "sanitize" it before outputting
it. The new `--sanitize` option is now the default.
As before, any tags not on the "approved" list have their "<" escaped
and become ordinary text in the output. With this commit, two new
tags are now on the "approved" list: `<map>` and `<area>`.
Each "approved" tag encountered has all of its attributes inspected
and any that are not on the "approved" list for that tag are expunged.
Any others that aren't in canonical form are corrected.
Empty tags are normalized as well (e.g. "<p/>" becomes "<p></p>"
but "<img ...>" becomes "<img ... />" etc.).
With this change, Markdown.pl output should be reasonably "safe"
for use as no code tags or attributes are permitted.
While client-side image maps are now allowed, server-side are not
(the "ismap" attribute will be silently expunged). Client-side
image maps should always be preferred to server-side anyway, that's
why they were created in the first place.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There's no point to including an empty header row in the output
table. It looks ugly.
While the header row is syntactically required in order to recognize
the table markup, omit it from the output if all its column cells
are empty ignoring any whitespace.
This gives a more pleasing result. Additionally add an extra class
tag "...-table-nohdr" in addition to the usual "...-table" tag to
allow easy custom formatting of these headerless tables if desired.
Empty body rows are always preserved because they can always be
omitted without breaking the table syntax.
Update the docs to describe the new behavior.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Prevent loss of backslash-escaping in tables by avoiding an effective
de-backslash operation that was incorrect. Now `\\\\` always ends up
producing two backslashes whether it's in a table or not.
Adjust the regexps so that the `|` in `\|` does not get mistaken for
the final `|` of a table row when that table row has omitted the
final trailing `|`.
Also silently handle a lone `\` at the end of a table row by pretending
it was doubled rather than breaking the table.
While in that area of the code, remove a line that computed a value
which was never used.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The new `--wiki` option (with optional argument) specifies how to
transform [[wiki style links]] into URLs.
There are a veritable plethora of options available to affect
the transformation.
Absolute URL wiki style links continue to be recognized even without
the `--wiki` option.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Move all the special cases to the top of the _ProcessWikiLink
function.
Add an exception to handle a fragment-only location.
All unhandled wiki style links now fall out the bottom of
the function.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The standard explicitly prohibits nesting of "a" tags.
Prevent nesting from occuring when only "markdown" input
is present. If explicit "<a ...>...</a>" tags are present
in the input they will be (mostly) left alone even if
they've been incorrectly nested.
Avoid producing mojibake links in the case that the URL
itself appears to contain another URL. (The wayback
machine links often look like this.)
In essence, once a link (either an "a" or "img") tag has been
generated/processed, avoid processing it again in order to
make sure no accidental mojibake occurs.
One consequence of this change is that "Automatic Links"
that are NOT surrounded by '<' and '>' will now only be
recognized if they occur at the beginning of the input or
after a whitespace (a newline qualifies) character.
This also helps to eliminate unintended double linkification.
Furthermore, URLs containing peculiar characters in them (e.g.
single quote and/or double quote) should be far less
troublesome now as well.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Before doing any other processing, eliminate any control
characters that might be present.
There should not be any, but just in case.
The only "control" characters kept are ASCII codes
9, 10, 12 and 13 (tab, nl, ff, cr).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although ' is specified as part of the XML standard,
some older end user clients may not actually recognize it.
Use ' instead of ' to avoid any difficulty.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The documentation claims that all of these work:
[1]: url1.txt "title 1"
[2]: url2.txt 'title 2'
[3]: url3.txt (title 3)
However, the single-quote version was not being accepted.
Update the regular expression to grok the single-quote
variant and require the correct matching closing quote
in order to match. This corrects the invalid acceptance
of mismatched quote titles such as "title) and (title".
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
With this source:
* A
* B
C
D
* E
* F
G
* H
I
The parser was getting confused about where each unordered list
actually ended when nesting the processed inner list inside the
outer list.
Address this by:
1) Giving "```" style code blocks their own hash just in case to
make sure they can never collide with html blocks.
2) Temporarily (the indentation gets removed before final output)
indent nested list tags by the current list nesting level to
ensure there's no confusion about where each list/sublist
starts and ends.
3) Make sure the list closing tag is followed by two newlines
rather than one to avoid potentially not applying markup to
the immediately following line.
Since a few patterns had minor adjustments for these changes, those
patterns also had a few unnecessarily capturing groups changed to
non-capturing groups in the hope that some miniscule performance
gain can be squeezed out with the change.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Previously this:
>
This should be a block quote now
Would not get recognized as a blockquote, now it will.
Previously the lone ">" got left on the line all by
itself as though it had been escaped.
Clearly that was improper.
Alternatively, it could have been picked up as its very
own empty blockquote, but that seems like the less
desirable resolution for the issue.
A ">" at the beginning of a line always signals the beginning
of a blockquote (unless it's escaped) and that blockquote then
continues on until it encounters a blank line.
Therefore the new interpretation must be correct, the old
interpretation was clearly wrong and the "empty blockquote
of its own" interpretation is also clearly wrong since it's
not immediately followed by a blank line. But this:
>
This will not be a block quote
With the blank line inserted, the above ">" does really now
end up correctly in an "empty blockquote of its own".
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When determining whether or not to add the "--imageroot" or
"--htmlroot" prefix to a relative link, ignore any query string
that may be present. The fragment (if present) was already
being ignored.
Allow URLs given in reference lines to be wrapped like so:
[1]: \
mZmYiIiHd3d2ZmZlVVVURERDMzMyIiIhEREQAAACwAAAAAFwAXAAAExxDISau9Mg\
She8DURhhHWRLDB26FkSjKqxxFqlbBWOwF4fOGgsCycRkInI+ocEAQNBNWq0caCJ\
i9aSqqGwwIL4MAsRATeMMMEykYHBLIt7DNHETrAPrBihVwDAh2ansBXygaAj5sa1\
x7iTUAKomEBU53B0hGVoVMTleEg0hkCD0DJAhwAlVcQT6nLwgHR1liUQNaqgkMDT\
NWXWkSbS6lZ0eKTUIWuTSbGzlNlkS3LSYksjtPK6YJCzEwNMAgbT9nKBwg6Onq6B\
EAOw== "title (100x100)"
This facilitates embedding small amounts of data directly in
the source without causing too much difficulty.
Update the documentation to describe this new feature.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Allow links of the form [...](#...) to find themselves on the
page the same way links of the form [...] can.
Be flexible accepting either '-' or '_' in place of spaces
in the heading name since fragment names may not contain spaces.
Refactor the code that manufactures "img" and "a" tags to both
simplify the code and make sure that all href, src, alt and title
attributes are fully and properly "escaped".
In addition, if the "title" for an image ends with something that
looks like "(512x342)", "(?x342)" or "(512x?)" then strip that out
of the title and set the appropriate width and height attributes
on the manufactured "img" tag. For example something like this:
![Nice pic](pic.jpg "Nice (500x300)")
or this:
![Nice pic][1]
[1]: <pic.jpg> "Nice (500x300)"
now produces this:
<img src="pic.jpg" alt="Nice pic" width="500" height="300" title="Nice" />
Update the syntax doc to mention these additions.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Instead of turning an empty URL into an href="" attribute that
effectively does nothing, change it into an href="#" attribute that
creates a link to the current page.
When adding a relative/image prefix leave fragment-only links
unmolested. They are meant to link somewhere on the current page
and must not be changed.
When inspecting the destination to determine whether to use the -i
prefix instead of the -r prefix when both are given, ignore any
trailing fragment. Fragments don't really make sense on image links
and should never actually be sent to the server anyway by a behaving
client, but match them properly in any case.
Also make sure that URLs only get a prefix added at most once.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>