There are a few tags (e.g. `a`, `area`, `img`, `map`) that
require at least one attribute to be present in order to be
meaningful.
When these tags occur without any attributes they are treated
as non-tags and the leading `<` is escaped to `<`.
This can only happen when sanitize mode is active.
Although already partially implemented, it was not documented
in the help.
Add discussion of this to the help and make the implementation
more robust to catch more of these tags.
This is not intended to be a perversely pedantic change, but
rather to allow such meaningless tags to be used as plain text
without the need for escaping. For example the text:
The <a><c><e> process ...
Can be used exactly as-is and all of the `<`s will automatically
be escaped to `<` since none of them specify meaningful tags.
Of course, using the `--no-sanitize` option will disable this
behavior.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Take a hint from w3m and quietly fix up the six common entities
< > & " ' when they are missing their
trailing ';' provided whatever trailing character is there is not
alphanumeric, an equals sign or a semicolon.
Without this change this case the leading ampersand would have ended
up being escaped to & in these cases which seems likely to be
almost certainly incorrect.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When sanitize is active (--sanitize, the default), make sure all
"&" issues are checked. This includes things like bare "&" that
should be "&" but aren't. And it includes single/double
quote characters inside attribute values that should be encoded
and are not.
Since the internal validator requires the sanitize mode to be
active, this now makes sure that the internal validation mode
cannot pass through any invalid entity references to the output.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
At the top level of the document, the _HashHTMLBlocks function gets
called to sequester raw top-level html blocks from being processed.
As a result, anything in these top-level blocks escapes general
Markdown processing except that if XML validation has been enabled
(the default), the final result of processing does always pass
through a validation stage.
On the one hand that's good as it allows raw HTML in Markdown docs,
but on the other hand, some basic fix ups are not happening and that's
bad.
Rather than try and push all of the top-level raw HTML block content
through either _RunBlockGamut or _RunSpanGamut (thereby somewhat
defeating the point of allowing raw HTML top-level blocks in the first
place), use a compromise between the two extremes and push all the
text of raw HTML block content through just the _EncodeAmpsAndAngles
function.
This causes things like non-html-escaped ampersands (&) inside "href"
and "src" attributes to magically be transformed into "&" and
at the same time any url adjustment options (i.e. -r, -i, -b, -a) to
be applied.
The result produces better and less surprising outcomes than before.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The <ul> tag is just as much a block as the <ol> and <dl> tags.
Correct the omission by adding it to the tagblk hash.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although the <center>...</center> tag has been deprecated, it still
occurs in the wild.
Since it's equivalent to <div align="center">...</div> it needs to
be treated as a block level tag.
Add it to tagblk to make it so.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While <dd>, <dt> and <li> all have "optional" closing tags, they
can all be contained within a table.
And as such must not close the tags that define the content of
the table itself.
Customize the tagacl list for these three to exclude the tags
that may contain table content to prevent their premature closing.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although GenerateStyleSheet did, in fact, accept a prefix argument
(properly defaulting if omitted), it was not using the passed in
prefix.
Correct that so the style sheet can be generated using any desired
prefix, but most helpfully using the `style_prefix` as passed in
to the Markdown function.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Explain the syntax of the optional YAML front matter.
Include a few examples that demonstrate the known keys.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
This also adds support for the YAML front matter header_enum
option which if enabled has the same effect as --auto-number.
Only markdown format h1...h6 headers are numbered with --auto-number.
Any raw <h1>...<h6> contents are left unchanged.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The <title> value comes from the first markdown markup "h1"
encountered or, if YAML processing is enabled, a "title"
setting if present which always takes precedence.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Process any YAML front matter that may be present by default.
Provide copious options to control how any YAML front matter that
may be present will be handled including the ability to completely
disable YAML front matter processing altogether.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There is no dingus to play with; stop talking about it.
Also make the "syntax page" link hook up properly.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Make the --raw option an alias for --raw-xml and provide a
new --raw-html option.
Previously the --raw option always activated the auto-closing
and optional-closing tag semantics as indicated in the HTML
standard so that a valid XML document would be output.
Unfortunately, these semantics can result in valid XML documents
being rejected.
For example, "<p><pre></pre></p>" would be turned into
"<p></p><pre></pre></p>" because the standard specifies that
the opening "pre" tag automatically closes the open "p" tag.
Retain these auto-closing semantics under the new --raw-html
option while disabling them under the --raw-xml (aka --raw)
option.
This produces a less surprising outcome when valid XML is
provided as input while still providing access to the
auto-closing semantics (via --raw-html) if explicitly desired
when processing raw input.
The auto-closing semantics remain enabled (as before) for the
non-raw mode when using --validate-xml-internal (the default).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When the --wiki option is active, recognize wiki-style image
links in the format:
[[link-to-image.png|align=left,alt=text]]
Where any "well-known" image suffix may be used in place of ".png"
and the "|..." part is optional but may specify any of the "width=",
"height=", "align=" or "alt=" keywords (provided alt= is always last).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Allow spaces to be retained when generating wiki file names
by using the new "b" wiki sub-option.
Sinces spaces are always trimmed (leading and trailing removed
and runs of multiple replaced with a single) before processing
wiki links, multiple consecutive white space characters are
always collapsed to a single space in the final URL.
Since the retained spaces are subject to URL encoding, they
become "%20" in the final URL.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Given input like this:
hi<p>_</p>there
avoid leaving a dangling text blob outside of any "p" section
like this:
<p>hi</p><p>_</p>there
Instead, auto-open a new "p" section so the final text blob
ends up properly wrapped like so:
<p>hi</p><p>_</p><p>there</p>
This reflects the actual rendering behavior of the client
"user agent" (aka browser) which would end up supplying the
missing <p>...</p> wrapper in any case.
By doing this the output better reflects the way the markup
actually renders.
The heuristic used to auto-open the "p" section may not always
auto-open a "p" when it should, but it should never auto-open
a "p" when it shouldn't.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Since each "paragraph" is wrapped between a "<p>" and "</p>"
this input:
<p>hi
<p>bye
has been producing this output:
<p></p><p>hi</p>
<p></p><p>bye</p>
Correct this so that if the leading "<p>" of the paragraph wrapper
is immediately auto-closed then it's simply discarded rather than
creating a bogus "<p></p>" section.
With this change the previous input now produces this output:
<p>hi</p>
<p>bye</p>
The bogus leading "<p></p>" sections have been omitted and the
output looks much nicer.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When forming paragraphs, a $string is wrapped to become <p>$string</p>.
If the opening "<p>" ends up being auto-closed by markup within
$string, then either another "<p>" must be auto-opened or the closing
"</p>" of the wrapper must be silently dropped to avoid a validation
failure.
Figuring out exactly where to auto-open the "<p>" turns out to be
somewhat more difficult than just dropping the wrapper's "</p>".
For now just go ahead and drop the wrapper's closing "</p>" if the
wrapper's opening "<p>" has been auto-closed by the time the validator
encounters the wrapper's closing "</p>".
At the same time, make sure that all "optional closing tag" tags
that occur after the wrapper's opening "<p>" get closed immediately
upon encountering the wrapper's closing "</p>" (whether or not it
ultimately gets dropped).
With these changes, this input:
line<p>one
line<p>three
or this input:
line<p>one</p>
line<p>three</p>
produces this output:
<p>line</p><p>one</p>
<p>line</p><p>three</p>
While this input:
line<p>one</p>x1
line<p>three</p>x3
produces this output:
<p>line</p><p>one</p>x1
<p>line</p><p>three</p>x3
In this last example, the "x1" and "x3" text is left hanging outside
of a "p" section. The client "user agent" (aka browser) will end
up rendering these hanging "x1" and "x3" pieces of text in their
own "p" sections.
With these changes, simple markup that would previously have been
rejected for no apparent reason by the default `--validate-xml-internal`
parser while being accepted by the `--validate-xml` option becomes
acceptable to the `--validate-xml-internal` parser as well.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
With a minor enhancement to the support for specifying image
dimensions, images can now be "float"ed to the left or right
or even centered in their own block.
Add the ability to generate a <br clear="all" /> with 3 or
more spaces on the end of a line rather than a plain <br />
with only 2.
Document these additions as well.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Allow wiki names to be "flatten"ed by replacing runs of one
(or more) "/" characters with "%2F" indicated by the new "%"
sub-option. Ultimately these "%2F" replacements become
"%252F" by the time the final URL is generated.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Provide a "wikifunc" 'CODE' ref hook capability to provide
for custom wiki link handling when "use"ing the Markdown module.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
With the `--keep-abs` option absolute path URLs will be preserved
into the output despite any -r/-i options that may be present.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When stripping XML comments, if any XML comments are recognized as
a standalone block, strip that entire block when forming paragraphs
the final time.
This provides a much cleaner output as it results in many
superfluous blank lines being suppressed that the XML parser
would not otherwise remove when it strips out XML comments.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When `--strip-comments` is active, if an XML comment is
immediately followed by optional spaces and/or tabs and
a newline, remove those along with the comment itself.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While the default mode of Markdown.pl remains that of a command
line utility, it's fairly simple to "use Markdown" and call the
functions directly.
Explain this usage in the help and make sure all of the auxiliary
functions that might be used for this appear in @EXPORT_OK.
Include an example that simulates `Markdown.pl --stub --wiki`.
Add a symbolic link from Markdown.pm to Markdown.pl to go
along with the new example.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Even though block tags such as "<p/>" should not appear in
valid XHTML documents, the internal validator (which is
enabled by default) will properly expand "<p/>" to "<p></p>".
However, the block formatting code fails to notice such
an empty tag block leading to it being wrapped in a spurious
"<p>...</p>" pair before it's expanded by the validation code.
Attempt to recognize some of these valid-for-xml-but-not-xhtml
blocks earlier to produce better output.
This is not a perfect fix, but it's an improvement.
It's really an odd edge case anyway that's unlikely to be
encountered very often.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Move sanity checking of arguments to the Markdown and ProcessRaw
functions into a new _SanitizeOpts function.
Call the new _SanitizeOpts function from both Markdown and ProcessRaw.
Document all of the possible options in the _SanitizeOpts function.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Create a new "SetWikiOpts" function that parses the
`--wiki=` option value into the appropriate internal
options settings.
Use the new "SetWikiOpts" function to parse the command
line `--wiki=...` option.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Create a new "GenerateStyleSheet" function that returns
a copy of the internal fancy style sheet using the given
prefix as a prefix of all the CSS style names.
Use the new "GenerateStyleSheet" function to create the
style sheet as needed.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When parsing a "checkbox" item or image dimensions, recognize
a U+00D7 Multiplication Sign character as equivalent to an "x".
The real "x" is preferred (and still recognized along with "X"),
but in the case where a U+00D7 (×) ends up in there, just go
with it and recognize it as the intent remains clear.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Add an explanation of XML comments for those who may not be familiar
with them including a link to the relevant specification, examples,
and exacting details about where they are and are not recognized.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Combine adjacent (i.e. no separating blank line) standalone
XML comments into the same "block".
This is more efficient, better preserves the original comment
formatting and avoids an unfortunate side-effect that could
introduce unwanted extra paragraphs into the output.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Make use of more of the Getopt::Long::GetOptions API capabilities
to avoid needing extra, awkward code checks.
With this change, options that support negation (e.g. "stylesheet")
or have variants (e.g. "validate-xml-internal") now work as intended
such that the last option given wins.
Additionally, help/version options are now handled immediately
when encountered.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The XML standard section 2.5 is quite specific:
the string "--" (double-hyphen) MUST NOT occur within comments
In fact, xmllint will complain about any comments that
incorrectly contain an internal "--" sequence as they are
not valid XML.
Adjust the sanitation code to only pass through valid XML
comments using the same pattern that _HashHTMLBlocks uses
to recognize them.
With this change, invalid XML comments will be treated as
literal text by the sanitizer and have the initial "<" escaped
to < thus rendering them as not a comment at all.
Also take this opportunity to correct the comments in the
_HashHTMLBlocks function from "HTML" to "XML" to reflect
what it actually matches.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When using `--stub` and picking up the value of the first "H1" tag
to use as the title, remove markup (such as links, italic, bold,
etc.) from the value before using it.
Since <title>...</title> value cannot contain links or other markup
this makes the displayed title look much better where such markup
is present in the original document.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
This works to hook up a fragment link to its section:
# Section 1
Link to [Top](#Section_1).
Make the same thing work when written like this:
# Section 1
Link to [Top][id].
[id]: #Section_1
Or even like this:
# Section 1
Link to [id].
[id]: #Section_1
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
A link reference may have the URL actually split onto the next line,
not just the title attribute.
Mention this in the syntax description for links.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Any absolute path URLs (but not // ones) have the prefix prepended.
If that makes the resulting URL a fully absolute URL it will not
be processed by any --htmlroot and/or --imageroot options.
With this option, site-relative absolute path URLs can be re-written
so that the site is made explicit in order to support viewing on
a different site.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The "s" option of the --wiki format strips the final extension
before applying the template.
Enhance the "s" option to optionally take a list of extensions
and to only strip the extension if it's one from the list.
Provide a "shortcut" extension that represents all known markdown
extensions.
Change the default --wiki format to now be "%{s(:md)}.html" instead
of the previous default which means it will no longer strip arbitrary
extensions, but only known markdown ones.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
During initial processing, explict "block" tags are set aside to
avoid creating problems in the output later.
Adjust the matches to be case insensitive.
Also relax the extra-blank line before and after that only
prevents them being recognized where they need to be.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Add new `--base` option that allows a prefix to be specified to be
added to all bare fragment-only URL links.
Use of this option may be required in order for intra-document
fragment links to function properly within a document that makes
use of the `<base>` tag.
Make sure explicitly specified fragment-only URLs (i.e. given in
verbatim `<a>` tags) get hooked up to the proper destination if
possible.
They obviously are trying to refer to something in the same document
so make sure they get the same treatment to hook them up.
Do the same for fragment-only links inside wiki `[[`...`]]` links.
And for both of these (explicit `<a>` tags and `[[`...`]]` links)
make sure the new bare fragment-only URL prefix gets added if given.
While in there, adjust whitespace to match coding convention for
this file where needed in the sections that have been changed.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>