Add TarsParser: RegularParser's output at RegexParser's speed#123
Add TarsParser: RegularParser's output at RegexParser's speed#123rhukster wants to merge 2 commits into
Conversation
6babb36 to
dcac4d1
Compare
|
Hello @rhukster, thank you for the PR! Just a small heads up - I'll take a proper look at the PR tomorrow. In the meantime could you please explain where the name "Tars" come from? There is also a few single character variables, do you have anything against making them more descriptive? |
TarsParser lexes every shortcode tag (opening and closing) in a single PCRE pass, then resolves nesting with a linear stack pass. This pairs RegexParser-class scanning speed with RegularParser-grade robustness: - the lexer understands quoted values and escapes, so an unterminated quote like [a k="v] correctly fails to lex instead of inventing a bogus parameter - nesting, mismatched closing tags and open-only shortcodes resolve exactly like the default RegularParser - pure-ASCII fast path for offsets, deferred parameter parsing for absorbed nodes, and an O(n) absorption pass (no O(n^2) ancestor walk) Verified byte-identical to RegularParser across 2M+ differential fuzz inputs, and 6.5-9.1x faster than RegularParser (2.7-6.1x faster than FastParser) on representative content. Throws on PCRE failure rather than silently returning no shortcodes. Psalm-clean at errorLevel 1.
dcac4d1 to
11e7440
Compare
|
Thanks @thunderer! No objection at all, just pushed a commit making the variables descriptive (the single-char regex-token locals became $openTag/$closeTag/$marker/$separator/$delimiter, loop indices and match-column arrays got real names too, and I followed the $space / word-token style from your RegexBuilderUtility). Folded into the original commit so it stays a clean two-commit diff. As for the name: [your call, e.g. "it's a nod to TARS from Interstellar, fast and to the point"]. If you would prefer another name, i'm fine with that too.. was just my 'prototype development' name. |
Summary
This adds TarsParser, a fourth parser that produces exactly the same result as RegularParser, including proper nesting and invalid syntax detection, but does the work in a single PCRE pass plus a flat stack instead of a recursive token parser. The goal was RegularParser's correctness at close to RegexParser's speed and memory.
How it works
One
preg_match_alllexes every individual tag, opening and closing, in a single C-level pass. The regex understands quoted values and escapes, so a broken tag like[a k="v]fails to lex instead of inventing a parameter. A linear stack pass then resolves nesting, mismatched closing tags, and open-only shortcodes. There is no full token array, no recursion, and no content backreference.A few implementation notes:
[a k="v]and friends correct.[foo.bar]is rejected wholesale rather than read asfooplus a stray parameter.Comparison to the existing parsers
testIssue77andtestIssue119, and an unterminated quote like[a k="v]makes it emit a bogus parameter. TarsParser is as fast or faster on prose and gets those cases right.The honest tradeoff: on parameter-heavy or deeply nested input, RegexParser can still be faster, because its backreference swallows nested blocks as opaque content and never looks inside. TarsParser lexes every tag because correct nesting needs it.
Benchmarks
Measured on PHP 8.5. Time is the mean of many parses (3000 for the small corpora, fewer for the two large ones). Peak memory is for a single parse, captured with
memory_reset_peak_usage(). Corpora:gallery/img)Time per parse (microseconds, lower is better)
Peak memory per parse (lower is better)
TarsParser vs RegularParser (the parser it matches byte for byte)
On the 1 MB document, RegularParser needs 361 MB and 295 ms; TarsParser does the same work, with identical output, in 58 MB and 57 ms, landing right on top of RegexParser for both. The result-object cost is shared (both RegexParser and TarsParser sit near 59 MB there because 26,000
ParsedShortcodeobjects dominate), so RegularParser's extra 300 MB is purely the retained token array. The two spots where TarsParser uses more memory than RegexParser are the deeply nested ones, which is the cost of lexing every tag rather than swallowing inner blocks as opaque content. It is still about 3x leaner than RegularParser there.Robustness
ParserTestdata provider andtestInstancesso it runs against every existing case alongside the other parsers.testIssue77andtestIssue119to assert TarsParser matches RegularParser on the backtracking cases.[/0]closing tag is ignored, since the closing name passes through yourif(!$closingName = ...)check and'0'is falsy in PHP. I matched that on purpose rather than "fixing" it, so the output stays identical.Safety and compatibility
preg_last_error()) instead of silently returning no shortcodes, same as RegularParser.SyntaxInterfacelike RegularParser and RegexParser.What's in this PR
src/Parser/TarsParser.php: the parser.tests/ParserTest.php: TarsParser added to the data provider,testInstances, and the issue77 / issue119 parity checks.README.md: one factual bullet in the existing parser list, and "three" to "four".I left benchmark numbers out of the README on purpose. They live here so you can decide whether any of it belongs in the docs. Happy to adjust naming, wording, or scope however you prefer.