Instead of turning an empty URL into an href="" attribute that
effectively does nothing, change it into an href="#" attribute that
creates a link to the current page.
When adding a relative/image prefix leave fragment-only links
unmolested. They are meant to link somewhere on the current page
and must not be changed.
When inspecting the destination to determine whether to use the -i
prefix instead of the -r prefix when both are given, ignore any
trailing fragment. Fragments don't really make sense on image links
and should never actually be sent to the server anyway by a behaving
client, but match them properly in any case.
Also make sure that URLs only get a prefix added at most once.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When a and img tags are generated using the normal Markdown syntax
any prefixes specified with the -i and -r options are inserted as
appropriate.
Extend this processing to explicit a and img tags as well.
This makes sense because they should be handled the same way the
Markdown syntax generated tags are for consistency.
It's still possible to "escape" from the prefixes by using an
explicit scheme+host+port or the commonly supported (but not a
standard) //+host+port mechanism.
And it only matters if prefixes have been set with the -i and/or
-r options (the default is no prefixes) anyway.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The .svg/.svgz matching rule was matching .svz and .svgz by mistake.
Move the wayward '?' to the end so it matches .svg and .svgz as
originally intended.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Use the actual XML comment rule for parsing XML comments.
The leading delimiter is fixed as "<!--" and the trailing delimiter
is fixed as "-->".
In between the leading and trailing delimiters any characters other
than a "-" may be used and a "-" may be used provided it's followed
immediately with a non-"-" character.
Now that the clear beginning and end of comments can be properly
identified, there no longer needs to be a blank line following the
comment -- the end delimiter serves quite unambiguously. Relax the
ending match to just be end of line or end of document.
This makes comments parse much more like they're expected to.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Allow leading spaces before the backticks delimiters on the starting
and ending lines (up to one less than the indent width).
Then remove upto that number of leading spaces (based on the starting
backticks delimiter line) from each of the lines in the code block
itself.
This better matches how lax some other formatters are with backticks-
delimited code blocks parsing.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Markup is not allowed inside attributes. Make sure that everything
that ends up in alt="..." and title="..." has be properly escaped
to prevent it from acquiring markup during later processing phases.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Add support for basic tables.
Nested tables are not supported although tables themselves can
appear within lists and blockquotes and do work properly there.
The commonly used table syntax is recognized including the
left/right/center alignment indicators.
Inline markup within each column also works just fine.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When dealing with program arguments "<dir>" is highly problematic.
Both "<dir>" and "<menu>" have long been deprecated and there are
other tags readily available for making similar lists that are not
deprecated (and do not require use of style sheets either).
Therefore treat "<dir>" and "<menu>" as literal text unless the
new "--deprecated" option is used.
Other "deprecated" tags continue to be recognized and passed through
as they generally do not have non-deprecated equivalents that do not
also require use of style attributes or style sheets in some fashion.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
While trying to keep all the various table-related tags together
is admirable, it makes it hard to be sure the tag is in the list
or not (an also looks bad compared to the other tags).
Therefore put the table-related tags into alphabetical order
just like the rest of them.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Automatically encode the leading '<' of non-html tag names so they
do not confuse the HTML parser or produce invalid HTML output.
This requires embedding a list of known HTML tags (a list of over
50 is now included).
This will also cause some "unsafe" tags that were previously being
passed through to be escaped (such as "script", "style", "object",
"embed" etc.).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When support for additional list markers was added in
51f3d63833 (Markdown.pl: support more list markers, 2017-01-10, 1.1.0),
a bug was inadvertently introduced that could cause adjacent sibling
list items to only recognize the first as a list item and as a side-effect
prevent markup from being recognized in the second.
The problem occurred when the matching pattern was split to run in
progressive matching mode and resulted in the sibling list items match
not always being matched by the progressive list item pattern (extra
possible \n's were preventing a match).
Fix this by adding a '+' in the correct location in the progressive pattern.
The side-effect was caused because any "leftover" (of which there shouldn't
be any) was not being processed for markup.
As a precaution, run any leftover through the block gamut markup processor
just in case even though there should never be any leftover.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Only treat (i), (v) and (x) as alpha if the previous
list marker was lower alpha (or upper alpha in the
case of (I), (V) and (X)).
Previously they were treated as alpha if the first
marker in the list was alpha, but if list marker
types were changed mid-list that could lead to
unexpected behavior.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
If the document contains footnote style links (e.g. [1],
[2], [3] ...) they look much better formatted so as to
retain the square brackets in the link text.
Do this for any footnote style link text consisting of
one to three digits.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When processing list items, while changes to the list marker style
are allowed for "ol" lists (ignored for "ul" lists), switching
from "ol" to "ul" or vice-versa is not allowed mid-list.
When a numbered/lettered marker is seen while processing a "ul"
list it was simply treated as '*'. This always produced the
correct result since the actual marker does not matter for "ul"
lists.
However, when a '*', '+', or '-' was seen while processing a
"ol" list it was always treated as a '1.' marker. This is,
however, incorrect if the list is not using decimal numbering.
Instead treat a "ul" marker encountered during "ol" list
processing as a repeat of the last marker seen. The lazy
list numbering will kick in and bump it up by one while
retaining the correct list marker style.
The same treatment is also now given to "ol" markers encountered
during "ul" list processing since it's simpler to code that way
even though it doesn't make a difference in output in that case.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
White space in '#', '##', '###', '####', '#####' and '######'
headers is already being normalized (i.e. leading and trailing
whitespace is stripped off).
Do the same thing for '=', '-', and '~' headers and, in addition,
do that for link ids, title text and alt text and for them also
replace internal runs of whitespace with a single space.
This makes the output nicer and more consistent and avoids
subtle bugs due to accidental inclusion of an extra space.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
In b7f3fc1c (Markdown.pl: do not mishandle double list markers,
2017-01-09, markdown_1.1.0), an attempt was made to avoid
having something like "* 1. item" be misinterpreted as an
unordered list containing an ordered sublist.
However, the change made to fix the problem only introduced
another problem where lists were not always being recognized
when they should be.
Fix the fix (it did have a bit of a kludgely side to it) so
it works properly and is less kludgely.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Backticks-delimited code blocks do not require a blank line
between them to be recognized (at least they're not supposed
to).
Recognize two such code blocks in a row by tweaking the
regex to use an assertion instead of an explicit match.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When using the "implicit link name" shortcut (i.e. the
link name is the same as the link text), the trailing '[]'
is unsightly.
Allow the trailing '[]' to be omitted when the omission is
unambiguous. In other words, if there is no preceding or
following pair of square brackets the trailing '[]' can safely
be omitted.
For example:
See any of [link 1][] [link 2][] [lnik 3][].
The trailing '[]' MUST NOT be omitted in this case because the
result:
See any of [link 1] [link 2] [link 3]
would be misinterpreted. But, if they're separated with commas
or words instead like so:
See any of [link 1], [link 2] or [link 3].
then they cannot be misinterpreted and the trailing '[]' can be
safely omitted making for a much nicer looking document.
To go with this change the basics.md and syntax.md documents
have been modified to take advantage of these new semantics.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Originally code blocks simply output this:
<pre><code>the code block text></code></pre>
However that sometimes led to unsatisfying formatting with
some browsers (especially text ones) and so that was ultimately
changed to this:
<div><dl></dl><pre><code>code block text</code></pre></div>
The div then additionally has a class to facilitate formatting
with a style sheet. The empty <dl></dl> causes agents that would
otherwise make a poor formatting choice to do "the right thing"(tm)
instead. However, the "<dl></dl>" kludge is unsatisfying for a
number of reasons.
Instead output this:
<div><pre></pre><pre><code>code block text</code></pre></div>
where the first div still has a class to facilitate formatting via
a style sheet, but the replacement "<pre></pre>" block has an
embedded style and is actually emitted like so:
<pre style="display:none"></pre>
While this is still a kludge, it's much more satisfying because:
1. The same element type is being used to force those recalictrant
text agents to do "the right thing"(tm).
2. The explict style="display:none" attribute completely protects
properly behaving agents from any unwanted side-effects from
the extra "<pre></pre>" tag pair.
Together with this change, the --stub stylesheet has also been
modified to use a more universally (i.e. page background is not
white) compatible styling for the code blocks themselves.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
If the input is UTF-8 then lowercase greek letters may
be used for list "numbering" of <ol> lists.
If the style sheet is not included or the result is
displayed by something that does not support style
sheets they will show as lower-alpha instead.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Auto-detect input format of either ISO-8859-1 (interpreted as
per the HTML 5 specification) or UTF-8 and always write UTF-8
to the output.
As a result of this change at least Perl 5.8.0 is now required.
The stub document now includes a charset (both meta tags).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The double bracket links are pervasive in source documents.
Recognize them and process them. At this point only links
that reference an absolute URL are recognized and turned
into clickable links.
Everything else is passed on through unchanged.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
When the --stub option is used, the output is wrapped in a full
HTML document stub.
This makes it much easier to test and validate the output even
if ultimately it will not be used with the --stub option.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Upper and lower case latters and roman numerals may
now be used as well as a ')' instead of a '.' to
terminate the marker.
The style sheet must be included for the ')' to
show otherwise it will display as a '.' instead.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Unordered list items that begin with '[ ]' or '[x]' and
a space will be formatted to display a fancy checkbox
(and checkmark for "x") when the fancy style sheet is
included in the output.
Without the fancy stylesheet everything will still look
fine, but the fancy stuff won't be there.
A new "--show-stylesheet" option is added to show the
style sheet at the beginning of the output. When combined
with no arguments and redirecting standard input to
/dev/null it can be used to show just the style sheet.
And, since we're adding a stylesheet, add an item to
make the "code" blocks look a bit nicer too.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Since almost the first thing Markdown.pl does is expand
tabs it's silly to have all these patters with \t in them.
There are only two places where \t in patterns makes sense:
1. The _Detab function that's expanding them
2. The _HashBTCodeBlocks function that's called before _Detab
Therefore purge all the other \t patterns and text that talks
about tabs. A few other minor regex optimizations were made
at the same time in the affected regexes as obvious efficiencies.
This has resulted in another very very very tiny speed boost.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Clean up the formatting in baiscs and syntax to make it
more readable as a text document.
This is now possible by making use of the automatic
anchors for top-level headers and '~~~~~' style h3's.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Long documents often need to link within themselves in order to
provide a convenient table of contents section.
To facilitate this, all setext-style and atx-style headers defined
at the top-level (i.e. they start at the left margin) now have
automatic anchors added to them and link definitions added for
them provided there is not already a link definition with the
same id present.
These can be easily targeted using the "implicit link name"
shortcut (e.g. [Foo][]).
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
This example:
* a
+ 1. x
+ 2. y
* c
Should format as one outer "ul" with the first item having
a second inner "ul". There should not be any "ol" lists in
the formatted result at all.
Correct the code so that it does not think "+ 1. x" not only
starts a list but also includes a sublist.
Now it only starts a list where the first item just happens to
have content that closely resembles a list marker.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The regular, indented-by-four-spaces, code blocks do not nest nor
should they. But they were nesting if they were located inside a
list. Fix this by hashifying them and not unhashifying them until
the very end.
Also there's a kludge in the code that says:
# Turn double returns into triple returns, so that we can make a
# paragraph for the last item in a list, if necessary
Unfortunately that perverts blank lines inside a code block.
Fix this by changing the perversion so that it accomplishes the
same thing but has an exact inverse and apply that inverse before
formatting code blocks.
Code blocks inside lists should now format correctly (and this
does fix the example in the README that was previously formatted
incorrectly).
Finally, a code block at the very beginning of the file preceded
by a single blank line would not have been recognized (but if it
were preceded by none or two or more it would have). Now it will
be recognized properly.
And one more thing. Since we're in there tweaking code blocks,
wrap the output in a <div>...</div> section and insert a null
<dl></dl> right after the opening <div> tag. This makes sure
the displayed code block will not end up getting mashed up
against something it shouldn't be mashed up against.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The change in version 1.0.1 to fix "a bug where lines in the
middle of hard-wrapped paragraphs, which lines look like the
start of a list item, would accidentally trigger the creation
of a list," broke things like this:
For example:
* broken
* microphone
That will now be recognized as a list again. The heuristic is
now when two lines in a row start with the same type of list
marker then recognize that as a list even when the first item
doesn't appear to start its own paragraph.
Additionally if the second line is at the next indent level
then the two lines may have different kinds of list markers
and still be recognized.
All previously recognized lists are still recognized.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
The --tabwidth=<num> option only affects the width to which
tabs are expanded. It does NOT affect the number of spaces
required to start a new indent level. That remains set at 4
no matter what value is used for the --tabwidth=<num> option.
With this change it's now, finally, possible to have proper
tab expansion without breaking the "4 spaces per indent level"
rule.
Note that backticks-delimited code blocks will always expand
their tabs to 8-character tab stop positions no matter what
value is used for the --tabwidth=<num> option.
With this change the default expansion width for tabs when
Markdown.pl is run from the command line is now 8.
When used as a module the default is still 4, but that's
easily changed by passing in a suitable option.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Move one-time initialization into BEGIN blocks.
Avoid running qr(...) more than once on expressions that
do not change (actually Perl should mostly already do this).
Get rid of the kludgy check for command-line and move all
that code into a new _main function and call it only when
being run from the comamnd line.
This seems to have resulted in a very very very tiny speed
boost as well.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
There's not a lot to work with in the way of speeding
things up. However, after timing a few different changes
there were some minor speed ups to be had.
In particular, md5_hex is no longer used in favor
of a global hash table instead.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>
Obscuring email addresses is all very nice, but
outputting a different document every time for the
same input is not.
It screws up caching and last modified checks and
is a very bad thing to do.
Instead continue to obscure email addresses, but
arrange for the same obscurity to be used when the
same input file is processed repeatedly by Markdown.pl
on the same machine.
Signed-off-by: Kyle J. McKay <mackyle@gmail.com>