When sanitizing an attribute with an implied value such
as "compact" or "checked", add the required space at the
end to avoid mashing up against any other attribute that
might be present.
For example, <ol compact start=10> now becomes the correct:
<ol compact="compact" start="10">
rather than the previously incorrect:
<ol compact="compact"start="10">
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While other targets could, potentially, represent legitimate
issues for concern, opening a new window generally does not
since that's typically a readily available option in the user
agent anyway when choosing to follow any individual link.
While using target="_blank" does not really represent any
security issue, it may be an annoyance issue, but that's something
for the author to address, not the sanitizer.
Although rel="nofollow" is _not_ part of the HTML 4 standard,
it may be very useful to avoid "endorsing" sites that are being
linked to. Since it does not introduce any risk of scripting
issues or other hidden issues, go ahead and allow it too.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
To avoid conflicting (too much) with setext-style H3 headers
that are delimited with a line of tildes, require exactly three
tildes to introduce a tilde-delimited code block.
And, while in there, clean up the backticks-delimited code
blocks pattern a tiny amount and allow either kind of code block
to be closed by more than the number of opening delimiters in
addition to exactly the same number of opening delimiters.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Adjust code to properly handle "empty" tags that are written as an open
plus closing tag but do not contain any whitespace in the opening tag.
The code already properly handles turning <hr noshade></hr> into
just <hr noshade="noshade" />, but it was failing to handle that
when the opening tag did not contain any whitespace such as <br></br>.
Adjust the code to return the proper value for the opening tag under
such a condition so that it's handled properly.
Previously a sequence such as <br></br> would fail as it would end
up being turned into <br /></br> which then fails XML validation.
Now it works properly and turns <br></br> into <br /> as it should
have been doing all along.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While closing tags are matched okay if they contain whitespace,
that whitespace was not being cleaned up in a comparable way to
the manner in which whitespace in an opening tag is being handled.
Make whitespace in closing tags be handled the same way as
whitespace in opening tags.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
With --strip-comments-lax even strictly invalid XML comments will
be stripped.
With --strip-comments-lax-only only strictly invalid XML comments
will be stripped.
Allowing strictly invalid XML comments to pass through to the output
would produce invalid XML.
By default such invalid comments end up having their leading '<'
escaped so that they become plain text in the output thereby avoiding
making it invalid XML.
However, if comments are being stripped out, there's no reason the
standard cannot be relaxed a little bit since the output will remain
valid XML as the comments will not be passed through to the output
in that case.
The two new options, --strip-comments-lax and --strip-comments-lax-only
provide a choice of behavior, strip all comments including the
strictly invalid ones, or just strip the strictly invalid ones.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
A tag such as this:
<span style="
lots: of;
stuff: in;
here: now;
"></span>
Is perfectly valid. Add the missing "s" pattern match
qualifier to make sure such attribute values do not end up
getting mangled.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
For %cellhalign allow the overlooked 'char' and 'charoff' attributes.
For table allow the overlooked 'frame' and 'rules' attributes.
For table, tr, th and td allow the 'bgcolor' attribute.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The line number mentioned in any error message gets generated by
counting from the beginning of the non-yaml output.
Of course, the final output will include any yaml table if generated.
Adjust the line number in any error messages by the number of lines
of preceding yaml table that will be included in the output.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Unless the new, heavily discouraged, `--keep-named-character-entities`
option has been given, always convert known named character entities
to their equivalent numerical entity.
All strict XML validators will complain about anything other than
the required-by-XML five entities (& < > " ')
unless an entity dictionary has been provided.
In addition, some older XHTML clients do not grok the ' entity.
Now only the universally supported four entities (& < > ")
will be preserved by default.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
It can be very convenient to be able to wrap the contents
in its own output "<div>". Add an option to do that with
an underlying corresponding API option to match.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There was absolutely no benefit to passing in an xmlcheck value
of 1 to the Markdown/ProcessRaw API. It was ignored and did NOT
result in any checking.
Change this so that any value other than a numeric 0 results
in XML checking when calling the API.
This makes the most sense and avoids creating obscure API bugs.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Normally there's no point to a "<br />" tag at the end of a
paragraph as the end of the paragraph will force a break anyway.
Unless that "br" tag contains a "clear='...'" attribute.
Make sure that 3 or more spaces at the end of a paragraph actually
turns into a "<br clear='all' />" tag but at the same time make 2
spaces at the end of a paragraph just go away as it serves no
purpose.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Add missing conjunction.
Update example of document that fails with --raw-html but
not --raw-xml. With the recent changes, the old example
no longer fails. Use a different example that still fails.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Each H1, H2, ... H6 generated courtesy of markdown markup has an
implicit anchor assigned based on the content of the element.
For example:
# This is an _H1_ header
Strip any inline markup (in this case the '_'s) out before creating
the implicit anchor. With this change, the text used to generate
the anchor for the above is just "This is an H1 header".
There are a couple of additional places where text that might have
inline markup gets turned into an identifier (implicit reference
links such as [thing][] or [thing] and wiki links without an
explicit link destination such as [[thing]]). Perform the same
tag stripping for them too before trying to find the destination.
Many links that should have connected previously now do.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Some @#%^@! are doing something like this:
```shell script
blah blah blah
```
That was not previously matching because only one optional "word"
was allowed trailing the opening "```" characters.
The single optional "word" is supposed to be a file extension type.
Clearly ".shell script" is _not_ a file extension!
Relax the rule somewhat. Multiple "words" are now allowed but only
the first will ever participate in choosing the syntax highlighting
(which currently never happens anyway).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When running _HashHTMLBlocks, there's a step where we
"match any empty block tags that should have been paired."
Exclude "p" from that list. Given a document like this:
<p>
text
That isolated "p" was getting sequestered away into its own
blob resulting in an output document like this:
<p>
</p><p>text</p>
By removing "p" from the list of "empty block tags that should
have been paired," we get this output instead:
<p>
text</p>
A nice improvement.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although "thead" and "tfoot" do, indeed, have an optional closing
tag, neither "td", "th" nor "tr" will auto-close them.
Therefore remove "thead" and "tfoot" from the list of tags that
"td", "th" and "tr" will auto-close.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The "bdo" (Bi-Directional Override) container element always requires
at least one attribute to be present for it to be valid. Specifically,
in this case, the "dir" attribute.
Add "bdo" to the `%taga1p` (TAGs requiring Attributes count of 1 Plus)
hash to reflect this. A bare "<bdo>" will now be passed through to
the output as "<bdo>" when using the default options.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Given an input document like this:
<div>
<p>
<pre>hi</pre>
</p>
</div>
It will validate just fine in `--raw-xml` mode. However, in normal
"html/xhtml" mode, the "pre" opening tag automatically closes the
currently open "p" tag leading to this:
<div>
<p>
</p><pre>hi</pre>
</p>
</div>
Without further intervention, the closing "p" tag that was already
there (just before the closing "div" tag), now has no matching open
"p" tag to close anymore -- the corresponding open tag is now the
open "div" section. Obviously the document fails to validate at
this point.
The naive fix simply has the closing tag that corresponds to the
opening tag that caused the "p" to be auto-closed to then automatically
re-open a "p" at that point producing this:
<div>
<p>
</p><pre>hi</pre><p>
</p>
</div>
While such a solution does work, it frequently ends up introducing
extra unwanted "p" sections.
Instead of reopening the "p" immediately upon seeing the closing
tag that matches the opening tag that auto-closed the "p", simply
set a "reopen p" flag.
When the "reopen p" flag is set and suitable conditions are met,
then go ahead and "reopen" a new "p" tag.
The exact conditions are a bit of an heuristic at the moment but
amount to clearing the "reopen p" flag when the next start tag is
seen and inserting a new "p" at that time only if the open tag is
a text level element opening tag.
Alternatively, if the "reopen p" flag is currently set and some
non-whitespace text shows up before seeing another open tag, re-open
a new "p" at that point (and clear the "reopen p" flag).
Finally, if the flag is currently set and a closing "p" tag appears,
just discard it and clear the "reopen p" flag. Essentially this
case has the effect of just moving the closing "p" tag.
With these changes, the troublesome document now produces this:
<div>
<p>
</p><pre>hi</pre>
</div>
An improvement on what came before. Some might argue that the empty
"p" section ought to simply be omitted entirely. Perhaps. But
there was an explicit open "p" tag in the text -- auto closing it
is one thing -- removing an explicit open tag entirely is something
else.
Additionally, since the validator validates in a "streamy" way,
that's much more difficult to accomplish since at the time the
initial opening "p" has been seen there's not yet any information
available about the fact it's about to be auto-closed while still
not containing any text and it therefore gets emitted to the output.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When commit c86fea4089 ("Markdown:
enhance link handling", 2019-10-20, markdown_1.1.8) did its thing,
a new global (%g_anchors_id) was introduced to keep track of all
the link ids being used/generated in order to better connect them
up to the links meant to target them.
Unfortunately, that hash was not getting cleared before processing
each new document. While this is mostly not a problem when running
from the command line since typically only one document ever gets
processed at once, if more than one document is processed at a time,
prior documents could affect the link fragment targets for subsequent
documents.
Correct the problem by properly resetting the global (along with
all the others that are also reset) before processing a new document.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The default YAML mode from the command line shows unknown
YAML options in a table prefixed to the output and applies
the ones it recognizes.
Make the API have the same default mode rather than silently
discarding unknown YAML options and ignoring known ones.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
If running as a plug in for either of the two original systems
that this was designed to "plug in" to, continue to use the
archaic, non-standard default for expansion width of physical
tabs. This setting does not affect the "indent level" width.
Otherwise, force the physical tab width expansion to default
to the expected and standard value.
This has been the behavior for some time already, except that
when "use"ing Markdown.pm and calling the API directly this
was being bypassed in favor of the old, archaic default.
With this change, the old, archaic default becomes isolated
to those two originally supported systems.
The setting can still, of course, be changed by using an option
to whatever is desired. The default though will now be more
sane for more clients.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Replace 'require' with 'use' in a few places where it
should have been "used" in the first place.
Make sure the essential package variables are initialized
inside a BEGIN block.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Given something that looks like this:
[1][]
[1]: https://example.com/
Ever since commit dfbf2b4e30
("Markdown.pl: retain square brackets around footnotes", 2017-01-19,
markdown_1.1.2), the link text has been rendered to include the
surrounding '[' and ']' because it just looks better that way and
produces a bigger link target.
Unfortunately that can result in the linked text being processed
again and producing a nexted anchor which is not only invalid
according to the XHTML specification but is also the wrong rendering
for the input.
Deal with this by hiding the '[' and ']' characters inside link
text the same way other characters within the link text are already
being hidden.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The actual anchor id values produced while processing a page
are not necessarily immediately obvious.
These implicit anchor id values are created for all markdown-
format H1-H6 headers by "processing" the text of the header.
Provide a new external function, ResolveFragment that can
hook up a fragment identifier to one of these automatically-
generated anchor id values by transforming it as needed.
The lookup table needed by ResolveFragment can be retrieved
after calling Markdown by first setting the 'anchors' key in
the passed in options HASH ref.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Provide a new urlfunc hook that can inspect/change all urls that
are in "a" "href" attributes and "img" "src" attributes.
Make the new SplitURL and unescapeXML routines exportable (@EXPORT_OK)
and rename the old escape function to be escapeXML and make
it exportable (@EXPORT_OK) too.
Add some nice comments to each of the newly exportable functions.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There are a few tags (e.g. `a`, `area`, `img`, `map`) that
require at least one attribute to be present in order to be
meaningful.
When these tags occur without any attributes they are treated
as non-tags and the leading `<` is escaped to `<`.
This can only happen when sanitize mode is active.
Although already partially implemented, it was not documented
in the help.
Add discussion of this to the help and make the implementation
more robust to catch more of these tags.
This is not intended to be a perversely pedantic change, but
rather to allow such meaningless tags to be used as plain text
without the need for escaping. For example the text:
The <a><c><e> process ...
Can be used exactly as-is and all of the `<`s will automatically
be escaped to `<` since none of them specify meaningful tags.
Of course, using the `--no-sanitize` option will disable this
behavior.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Take a hint from w3m and quietly fix up the six common entities
< > & " ' when they are missing their
trailing ';' provided whatever trailing character is there is not
alphanumeric, an equals sign or a semicolon.
Without this change this case the leading ampersand would have ended
up being escaped to & in these cases which seems likely to be
almost certainly incorrect.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When sanitize is active (--sanitize, the default), make sure all
"&" issues are checked. This includes things like bare "&" that
should be "&" but aren't. And it includes single/double
quote characters inside attribute values that should be encoded
and are not.
Since the internal validator requires the sanitize mode to be
active, this now makes sure that the internal validation mode
cannot pass through any invalid entity references to the output.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
At the top level of the document, the _HashHTMLBlocks function gets
called to sequester raw top-level html blocks from being processed.
As a result, anything in these top-level blocks escapes general
Markdown processing except that if XML validation has been enabled
(the default), the final result of processing does always pass
through a validation stage.
On the one hand that's good as it allows raw HTML in Markdown docs,
but on the other hand, some basic fix ups are not happening and that's
bad.
Rather than try and push all of the top-level raw HTML block content
through either _RunBlockGamut or _RunSpanGamut (thereby somewhat
defeating the point of allowing raw HTML top-level blocks in the first
place), use a compromise between the two extremes and push all the
text of raw HTML block content through just the _EncodeAmpsAndAngles
function.
This causes things like non-html-escaped ampersands (&) inside "href"
and "src" attributes to magically be transformed into "&" and
at the same time any url adjustment options (i.e. -r, -i, -b, -a) to
be applied.
The result produces better and less surprising outcomes than before.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The <ul> tag is just as much a block as the <ol> and <dl> tags.
Correct the omission by adding it to the tagblk hash.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although the <center>...</center> tag has been deprecated, it still
occurs in the wild.
Since it's equivalent to <div align="center">...</div> it needs to
be treated as a block level tag.
Add it to tagblk to make it so.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While <dd>, <dt> and <li> all have "optional" closing tags, they
can all be contained within a table.
And as such must not close the tags that define the content of
the table itself.
Customize the tagacl list for these three to exclude the tags
that may contain table content to prevent their premature closing.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although GenerateStyleSheet did, in fact, accept a prefix argument
(properly defaulting if omitted), it was not using the passed in
prefix.
Correct that so the style sheet can be generated using any desired
prefix, but most helpfully using the `style_prefix` as passed in
to the Markdown function.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Explain the syntax of the optional YAML front matter.
Include a few examples that demonstrate the known keys.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
This also adds support for the YAML front matter header_enum
option which if enabled has the same effect as --auto-number.
Only markdown format h1...h6 headers are numbered with --auto-number.
Any raw <h1>...<h6> contents are left unchanged.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The <title> value comes from the first markdown markup "h1"
encountered or, if YAML processing is enabled, a "title"
setting if present which always takes precedence.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Process any YAML front matter that may be present by default.
Provide copious options to control how any YAML front matter that
may be present will be handled including the ability to completely
disable YAML front matter processing altogether.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There is no dingus to play with; stop talking about it.
Also make the "syntax page" link hook up properly.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Make the --raw option an alias for --raw-xml and provide a
new --raw-html option.
Previously the --raw option always activated the auto-closing
and optional-closing tag semantics as indicated in the HTML
standard so that a valid XML document would be output.
Unfortunately, these semantics can result in valid XML documents
being rejected.
For example, "<p><pre></pre></p>" would be turned into
"<p></p><pre></pre></p>" because the standard specifies that
the opening "pre" tag automatically closes the open "p" tag.
Retain these auto-closing semantics under the new --raw-html
option while disabling them under the --raw-xml (aka --raw)
option.
This produces a less surprising outcome when valid XML is
provided as input while still providing access to the
auto-closing semantics (via --raw-html) if explicitly desired
when processing raw input.
The auto-closing semantics remain enabled (as before) for the
non-raw mode when using --validate-xml-internal (the default).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>