The line number mentioned in any error message gets generated by
counting from the beginning of the non-yaml output.
Of course, the final output will include any yaml table if generated.
Adjust the line number in any error messages by the number of lines
of preceding yaml table that will be included in the output.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Unless the new, heavily discouraged, `--keep-named-character-entities`
option has been given, always convert known named character entities
to their equivalent numerical entity.
All strict XML validators will complain about anything other than
the required-by-XML five entities (& < > " ')
unless an entity dictionary has been provided.
In addition, some older XHTML clients do not grok the ' entity.
Now only the universally supported four entities (& < > ")
will be preserved by default.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
It can be very convenient to be able to wrap the contents
in its own output "<div>". Add an option to do that with
an underlying corresponding API option to match.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There was absolutely no benefit to passing in an xmlcheck value
of 1 to the Markdown/ProcessRaw API. It was ignored and did NOT
result in any checking.
Change this so that any value other than a numeric 0 results
in XML checking when calling the API.
This makes the most sense and avoids creating obscure API bugs.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Normally there's no point to a "<br />" tag at the end of a
paragraph as the end of the paragraph will force a break anyway.
Unless that "br" tag contains a "clear='...'" attribute.
Make sure that 3 or more spaces at the end of a paragraph actually
turns into a "<br clear='all' />" tag but at the same time make 2
spaces at the end of a paragraph just go away as it serves no
purpose.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Add missing conjunction.
Update example of document that fails with --raw-html but
not --raw-xml. With the recent changes, the old example
no longer fails. Use a different example that still fails.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Each H1, H2, ... H6 generated courtesy of markdown markup has an
implicit anchor assigned based on the content of the element.
For example:
# This is an _H1_ header
Strip any inline markup (in this case the '_'s) out before creating
the implicit anchor. With this change, the text used to generate
the anchor for the above is just "This is an H1 header".
There are a couple of additional places where text that might have
inline markup gets turned into an identifier (implicit reference
links such as [thing][] or [thing] and wiki links without an
explicit link destination such as [[thing]]). Perform the same
tag stripping for them too before trying to find the destination.
Many links that should have connected previously now do.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Some @#%^@! are doing something like this:
```shell script
blah blah blah
```
That was not previously matching because only one optional "word"
was allowed trailing the opening "```" characters.
The single optional "word" is supposed to be a file extension type.
Clearly ".shell script" is _not_ a file extension!
Relax the rule somewhat. Multiple "words" are now allowed but only
the first will ever participate in choosing the syntax highlighting
(which currently never happens anyway).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When running _HashHTMLBlocks, there's a step where we
"match any empty block tags that should have been paired."
Exclude "p" from that list. Given a document like this:
<p>
text
That isolated "p" was getting sequestered away into its own
blob resulting in an output document like this:
<p>
</p><p>text</p>
By removing "p" from the list of "empty block tags that should
have been paired," we get this output instead:
<p>
text</p>
A nice improvement.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although "thead" and "tfoot" do, indeed, have an optional closing
tag, neither "td", "th" nor "tr" will auto-close them.
Therefore remove "thead" and "tfoot" from the list of tags that
"td", "th" and "tr" will auto-close.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The "bdo" (Bi-Directional Override) container element always requires
at least one attribute to be present for it to be valid. Specifically,
in this case, the "dir" attribute.
Add "bdo" to the `%taga1p` (TAGs requiring Attributes count of 1 Plus)
hash to reflect this. A bare "<bdo>" will now be passed through to
the output as "<bdo>" when using the default options.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Given an input document like this:
<div>
<p>
<pre>hi</pre>
</p>
</div>
It will validate just fine in `--raw-xml` mode. However, in normal
"html/xhtml" mode, the "pre" opening tag automatically closes the
currently open "p" tag leading to this:
<div>
<p>
</p><pre>hi</pre>
</p>
</div>
Without further intervention, the closing "p" tag that was already
there (just before the closing "div" tag), now has no matching open
"p" tag to close anymore -- the corresponding open tag is now the
open "div" section. Obviously the document fails to validate at
this point.
The naive fix simply has the closing tag that corresponds to the
opening tag that caused the "p" to be auto-closed to then automatically
re-open a "p" at that point producing this:
<div>
<p>
</p><pre>hi</pre><p>
</p>
</div>
While such a solution does work, it frequently ends up introducing
extra unwanted "p" sections.
Instead of reopening the "p" immediately upon seeing the closing
tag that matches the opening tag that auto-closed the "p", simply
set a "reopen p" flag.
When the "reopen p" flag is set and suitable conditions are met,
then go ahead and "reopen" a new "p" tag.
The exact conditions are a bit of an heuristic at the moment but
amount to clearing the "reopen p" flag when the next start tag is
seen and inserting a new "p" at that time only if the open tag is
a text level element opening tag.
Alternatively, if the "reopen p" flag is currently set and some
non-whitespace text shows up before seeing another open tag, re-open
a new "p" at that point (and clear the "reopen p" flag).
Finally, if the flag is currently set and a closing "p" tag appears,
just discard it and clear the "reopen p" flag. Essentially this
case has the effect of just moving the closing "p" tag.
With these changes, the troublesome document now produces this:
<div>
<p>
</p><pre>hi</pre>
</div>
An improvement on what came before. Some might argue that the empty
"p" section ought to simply be omitted entirely. Perhaps. But
there was an explicit open "p" tag in the text -- auto closing it
is one thing -- removing an explicit open tag entirely is something
else.
Additionally, since the validator validates in a "streamy" way,
that's much more difficult to accomplish since at the time the
initial opening "p" has been seen there's not yet any information
available about the fact it's about to be auto-closed while still
not containing any text and it therefore gets emitted to the output.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When commit c86fea4089 ("Markdown:
enhance link handling", 2019-10-20, markdown_1.1.8) did its thing,
a new global (%g_anchors_id) was introduced to keep track of all
the link ids being used/generated in order to better connect them
up to the links meant to target them.
Unfortunately, that hash was not getting cleared before processing
each new document. While this is mostly not a problem when running
from the command line since typically only one document ever gets
processed at once, if more than one document is processed at a time,
prior documents could affect the link fragment targets for subsequent
documents.
Correct the problem by properly resetting the global (along with
all the others that are also reset) before processing a new document.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The default YAML mode from the command line shows unknown
YAML options in a table prefixed to the output and applies
the ones it recognizes.
Make the API have the same default mode rather than silently
discarding unknown YAML options and ignoring known ones.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
If running as a plug in for either of the two original systems
that this was designed to "plug in" to, continue to use the
archaic, non-standard default for expansion width of physical
tabs. This setting does not affect the "indent level" width.
Otherwise, force the physical tab width expansion to default
to the expected and standard value.
This has been the behavior for some time already, except that
when "use"ing Markdown.pm and calling the API directly this
was being bypassed in favor of the old, archaic default.
With this change, the old, archaic default becomes isolated
to those two originally supported systems.
The setting can still, of course, be changed by using an option
to whatever is desired. The default though will now be more
sane for more clients.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Replace 'require' with 'use' in a few places where it
should have been "used" in the first place.
Make sure the essential package variables are initialized
inside a BEGIN block.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Given something that looks like this:
[1][]
[1]: https://example.com/
Ever since commit dfbf2b4e30
("Markdown.pl: retain square brackets around footnotes", 2017-01-19,
markdown_1.1.2), the link text has been rendered to include the
surrounding '[' and ']' because it just looks better that way and
produces a bigger link target.
Unfortunately that can result in the linked text being processed
again and producing a nexted anchor which is not only invalid
according to the XHTML specification but is also the wrong rendering
for the input.
Deal with this by hiding the '[' and ']' characters inside link
text the same way other characters within the link text are already
being hidden.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The actual anchor id values produced while processing a page
are not necessarily immediately obvious.
These implicit anchor id values are created for all markdown-
format H1-H6 headers by "processing" the text of the header.
Provide a new external function, ResolveFragment that can
hook up a fragment identifier to one of these automatically-
generated anchor id values by transforming it as needed.
The lookup table needed by ResolveFragment can be retrieved
after calling Markdown by first setting the 'anchors' key in
the passed in options HASH ref.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Provide a new urlfunc hook that can inspect/change all urls that
are in "a" "href" attributes and "img" "src" attributes.
Make the new SplitURL and unescapeXML routines exportable (@EXPORT_OK)
and rename the old escape function to be escapeXML and make
it exportable (@EXPORT_OK) too.
Add some nice comments to each of the newly exportable functions.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There are a few tags (e.g. `a`, `area`, `img`, `map`) that
require at least one attribute to be present in order to be
meaningful.
When these tags occur without any attributes they are treated
as non-tags and the leading `<` is escaped to `<`.
This can only happen when sanitize mode is active.
Although already partially implemented, it was not documented
in the help.
Add discussion of this to the help and make the implementation
more robust to catch more of these tags.
This is not intended to be a perversely pedantic change, but
rather to allow such meaningless tags to be used as plain text
without the need for escaping. For example the text:
The <a><c><e> process ...
Can be used exactly as-is and all of the `<`s will automatically
be escaped to `<` since none of them specify meaningful tags.
Of course, using the `--no-sanitize` option will disable this
behavior.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Take a hint from w3m and quietly fix up the six common entities
< > & " ' when they are missing their
trailing ';' provided whatever trailing character is there is not
alphanumeric, an equals sign or a semicolon.
Without this change this case the leading ampersand would have ended
up being escaped to & in these cases which seems likely to be
almost certainly incorrect.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When sanitize is active (--sanitize, the default), make sure all
"&" issues are checked. This includes things like bare "&" that
should be "&" but aren't. And it includes single/double
quote characters inside attribute values that should be encoded
and are not.
Since the internal validator requires the sanitize mode to be
active, this now makes sure that the internal validation mode
cannot pass through any invalid entity references to the output.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
At the top level of the document, the _HashHTMLBlocks function gets
called to sequester raw top-level html blocks from being processed.
As a result, anything in these top-level blocks escapes general
Markdown processing except that if XML validation has been enabled
(the default), the final result of processing does always pass
through a validation stage.
On the one hand that's good as it allows raw HTML in Markdown docs,
but on the other hand, some basic fix ups are not happening and that's
bad.
Rather than try and push all of the top-level raw HTML block content
through either _RunBlockGamut or _RunSpanGamut (thereby somewhat
defeating the point of allowing raw HTML top-level blocks in the first
place), use a compromise between the two extremes and push all the
text of raw HTML block content through just the _EncodeAmpsAndAngles
function.
This causes things like non-html-escaped ampersands (&) inside "href"
and "src" attributes to magically be transformed into "&" and
at the same time any url adjustment options (i.e. -r, -i, -b, -a) to
be applied.
The result produces better and less surprising outcomes than before.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The <ul> tag is just as much a block as the <ol> and <dl> tags.
Correct the omission by adding it to the tagblk hash.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although the <center>...</center> tag has been deprecated, it still
occurs in the wild.
Since it's equivalent to <div align="center">...</div> it needs to
be treated as a block level tag.
Add it to tagblk to make it so.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While <dd>, <dt> and <li> all have "optional" closing tags, they
can all be contained within a table.
And as such must not close the tags that define the content of
the table itself.
Customize the tagacl list for these three to exclude the tags
that may contain table content to prevent their premature closing.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Although GenerateStyleSheet did, in fact, accept a prefix argument
(properly defaulting if omitted), it was not using the passed in
prefix.
Correct that so the style sheet can be generated using any desired
prefix, but most helpfully using the `style_prefix` as passed in
to the Markdown function.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Explain the syntax of the optional YAML front matter.
Include a few examples that demonstrate the known keys.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
This also adds support for the YAML front matter header_enum
option which if enabled has the same effect as --auto-number.
Only markdown format h1...h6 headers are numbered with --auto-number.
Any raw <h1>...<h6> contents are left unchanged.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The <title> value comes from the first markdown markup "h1"
encountered or, if YAML processing is enabled, a "title"
setting if present which always takes precedence.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Process any YAML front matter that may be present by default.
Provide copious options to control how any YAML front matter that
may be present will be handled including the ability to completely
disable YAML front matter processing altogether.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There is no dingus to play with; stop talking about it.
Also make the "syntax page" link hook up properly.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Make the --raw option an alias for --raw-xml and provide a
new --raw-html option.
Previously the --raw option always activated the auto-closing
and optional-closing tag semantics as indicated in the HTML
standard so that a valid XML document would be output.
Unfortunately, these semantics can result in valid XML documents
being rejected.
For example, "<p><pre></pre></p>" would be turned into
"<p></p><pre></pre></p>" because the standard specifies that
the opening "pre" tag automatically closes the open "p" tag.
Retain these auto-closing semantics under the new --raw-html
option while disabling them under the --raw-xml (aka --raw)
option.
This produces a less surprising outcome when valid XML is
provided as input while still providing access to the
auto-closing semantics (via --raw-html) if explicitly desired
when processing raw input.
The auto-closing semantics remain enabled (as before) for the
non-raw mode when using --validate-xml-internal (the default).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When the --wiki option is active, recognize wiki-style image
links in the format:
[[link-to-image.png|align=left,alt=text]]
Where any "well-known" image suffix may be used in place of ".png"
and the "|..." part is optional but may specify any of the "width=",
"height=", "align=" or "alt=" keywords (provided alt= is always last).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Allow spaces to be retained when generating wiki file names
by using the new "b" wiki sub-option.
Sinces spaces are always trimmed (leading and trailing removed
and runs of multiple replaced with a single) before processing
wiki links, multiple consecutive white space characters are
always collapsed to a single space in the final URL.
Since the retained spaces are subject to URL encoding, they
become "%20" in the final URL.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Given input like this:
hi<p>_</p>there
avoid leaving a dangling text blob outside of any "p" section
like this:
<p>hi</p><p>_</p>there
Instead, auto-open a new "p" section so the final text blob
ends up properly wrapped like so:
<p>hi</p><p>_</p><p>there</p>
This reflects the actual rendering behavior of the client
"user agent" (aka browser) which would end up supplying the
missing <p>...</p> wrapper in any case.
By doing this the output better reflects the way the markup
actually renders.
The heuristic used to auto-open the "p" section may not always
auto-open a "p" when it should, but it should never auto-open
a "p" when it shouldn't.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Since each "paragraph" is wrapped between a "<p>" and "</p>"
this input:
<p>hi
<p>bye
has been producing this output:
<p></p><p>hi</p>
<p></p><p>bye</p>
Correct this so that if the leading "<p>" of the paragraph wrapper
is immediately auto-closed then it's simply discarded rather than
creating a bogus "<p></p>" section.
With this change the previous input now produces this output:
<p>hi</p>
<p>bye</p>
The bogus leading "<p></p>" sections have been omitted and the
output looks much nicer.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When forming paragraphs, a $string is wrapped to become <p>$string</p>.
If the opening "<p>" ends up being auto-closed by markup within
$string, then either another "<p>" must be auto-opened or the closing
"</p>" of the wrapper must be silently dropped to avoid a validation
failure.
Figuring out exactly where to auto-open the "<p>" turns out to be
somewhat more difficult than just dropping the wrapper's "</p>".
For now just go ahead and drop the wrapper's closing "</p>" if the
wrapper's opening "<p>" has been auto-closed by the time the validator
encounters the wrapper's closing "</p>".
At the same time, make sure that all "optional closing tag" tags
that occur after the wrapper's opening "<p>" get closed immediately
upon encountering the wrapper's closing "</p>" (whether or not it
ultimately gets dropped).
With these changes, this input:
line<p>one
line<p>three
or this input:
line<p>one</p>
line<p>three</p>
produces this output:
<p>line</p><p>one</p>
<p>line</p><p>three</p>
While this input:
line<p>one</p>x1
line<p>three</p>x3
produces this output:
<p>line</p><p>one</p>x1
<p>line</p><p>three</p>x3
In this last example, the "x1" and "x3" text is left hanging outside
of a "p" section. The client "user agent" (aka browser) will end
up rendering these hanging "x1" and "x3" pieces of text in their
own "p" sections.
With these changes, simple markup that would previously have been
rejected for no apparent reason by the default `--validate-xml-internal`
parser while being accepted by the `--validate-xml` option becomes
acceptable to the `--validate-xml-internal` parser as well.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
With a minor enhancement to the support for specifying image
dimensions, images can now be "float"ed to the left or right
or even centered in their own block.
Add the ability to generate a <br clear="all" /> with 3 or
more spaces on the end of a line rather than a plain <br />
with only 2.
Document these additions as well.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Allow wiki names to be "flatten"ed by replacing runs of one
(or more) "/" characters with "%2F" indicated by the new "%"
sub-option. Ultimately these "%2F" replacements become
"%252F" by the time the final URL is generated.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Provide a "wikifunc" 'CODE' ref hook capability to provide
for custom wiki link handling when "use"ing the Markdown module.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
With the `--keep-abs` option absolute path URLs will be preserved
into the output despite any -r/-i options that may be present.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When stripping XML comments, if any XML comments are recognized as
a standalone block, strip that entire block when forming paragraphs
the final time.
This provides a much cleaner output as it results in many
superfluous blank lines being suppressed that the XML parser
would not otherwise remove when it strips out XML comments.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When `--strip-comments` is active, if an XML comment is
immediately followed by optional spaces and/or tabs and
a newline, remove those along with the comment itself.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While the default mode of Markdown.pl remains that of a command
line utility, it's fairly simple to "use Markdown" and call the
functions directly.
Explain this usage in the help and make sure all of the auxiliary
functions that might be used for this appear in @EXPORT_OK.
Include an example that simulates `Markdown.pl --stub --wiki`.
Add a symbolic link from Markdown.pm to Markdown.pl to go
along with the new example.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>