Rendering phpBB BBCode in Go: Part 2

In Part 1, we talked mostly about the basic premise of how BBCode rendering works in phpBB. We also clumsily figured out how phpBB 3.2 translates BBCode from the old legacy BBCode storage format to the new s9e TextFormatter-based BBCode storage format.

In this part, I plan to explore how phpBB uses TextFormatter, and how it fits in with the various ways BBCode is defined in phpBB.

Spoiler alert: we're not going to be writing any code. We still have a lot to learn before we can really start on that.

aerial view of spiral staircase
Photo by matthew Feeney / Unsplash

In Pursuit of Performance

phpBB and TextFormatter do some very interesting things to maximize rendering performance. TextFormatter's intermediate XML format enables a few different things, but most importantly, a post will be translated to the XML format only once - then, translated to HTML on display, many times. Therefore, the rendering step needs to be very quick. Presumably, XML was used because it is quick to parse using a native-code XML parser.

The XML intermediate format is a constant no matter what renderer you use. However, as I did mention in the previous article, there are two different renderers: one that generates an XSLT template and uses that, and one that generates PHP code.

phpBB uses the one that generates PHP code. There's probably a couple reasons for this. One, it is probably more flexible. Two, it might very well perform faster than using an XSLT renderer.

What phpBB does is it invokes the Configurator and TextFormatter dumps the PHP output into the phpBB cache folder, and it is then loaded in. This would explain why the board literally cannot load if the cache directory is unwritable.

Let's take a look at BBCode templates and how they relate to the output.

PHP? XSLT.

I continued to explore what exactly TextFormatter was doing, and I came to be quite surprised. The native template language for TextFormatter is actually XSLT, and templates for markup are written in XSLT templates. Though, it's not quite that cut and dry. If you have a simple enough tag to be expressed with plain HTML, you can do that and it will be automatically translated to XSLT. Also, plugins seem to be able to provide their own template behaviors. The BBCode plugin provides a phpBB custom BBCode compatible template language, which is awfully convenient! In any case, the canonical language for defining tags is XSLT. By the time we get to the renderer generator, we're always dealing with XSLT templates.

So, uhh, effectively, the PHP Renderer code is just an XSLT JIT. Pretty fascinating.

What do we do with this knowledge? Well, we still just want to render text - we don't really need to build all of the extensibility that TextFormatter provides. So, what we really need is to generate an XSLT stylesheet and use it to process the XML from the database. Which is, to be honest, oddly straightforward.

There's still one other rut, though. To deal with custom BBCode, we either have to write our own code to convert the BBCode templates into XSLT templates.

How BBCode Templates Become XSLT

Inside the BBCode plugin of TextFormatter is a file called BBCodeMonkey.php. I have no idea what this name signifies, but in any case, this file appears to control the behavior of BBCode templates. Cutting through a callstack you can find that the basic operation of the BBCode to XSLT conversion is simply based on regular expression replacement.

#\\{(?:[A-Z]+[A-Z_0-9]*|@[-\\w]+)\\}#

The BBCode template regular expression.

The BBCode template regular expression is ran inside of both text nodes and attribute nodes, and when a match is found, it is replaced with a node that is determined by the callback function in BBCodeMonkey.php. There's some good news here: it looks like this process is fairly straightforward, and you can only really replace a pattern with a single XSLT element. In particular, you can replace the pattern with a xsl:value-of node, a xsl:apply-templates node, or a text node.

I'm not sure where a text node might be useful, but the BBCode callback never uses it - it only generates xsl:value-of and xsl:apply-templates nodes. So that's all we really need to be concerned with.

The BBCode template stuff is looking pretty manageable. However, the story isn't quite so simple. We're lacking something important: the attribute mapping.

When we parse BBCodes, we use the BBCode definition. This is used to translate the BBCode into the intermediate XML form we see in the database. When BBCodes are rendered, we take the intermediate XML form and run the XSLT templates on them. So it seems like we can just ignore the BBCode definition and generate an XSLT template, right? Unfortunately, no.

This is because some of the information we need is stuck in the BBCode definition.

Take a look at what happens in this example.

Input:

[align=center]Test[/align]

Intermediate form:

<ALIGN align="center">Test</ALIGN>

Output:

<div style="text-align: center;">Test</div>

This, in itself, is pretty simple. The BBCode definition and template are likewise pretty simple:

Definition: [align={SIMPLETEXT}]{TEXT}[/align]

Template: <div style="text-align: {SIMPLETEXT};">{TEXT}</div>

Is the problem obvious yet? Take a look at the intermediate form. How do we know what attributes the tokens {SIMPLETEXT} and {TEXT} refer to from the intermediate form? The answer lies in the BBCode definition. We don't need to refer to the original BBCode text, but when building the BBCode template, we do need to look at the BBCode definition.

How do we do that?

Regular Expressions, Again

I actually really like the TextFormatter library, but if I had one criticism, it would be this: There are too many damn regular expressions!

Parsing the BBCode definition is done by a rather long function that goes through an awful lot of trouble. It seems like we have to reimplement most of this in order to implement custom BBCode in its entirety.

And wow, there's a lot going on here. The seemingly-humble BBCode syntax actually supports quite a lot of rather arcane functionality. The first thing the parser does is translate hashmaps and... embedded regular expressions... to Base64, so that the regular expression based parser won't trip over them. We're matching a regular expression using a regular expression!

Here's an example of one of the more complicated BBCode definitions, called "usage" strings in TextFormatter parlance.

[LIST type={HASHMAP=1:decimal,a:lower-alpha,A:upper-alpha,i:lower-roman,I:upper-roman;optional;postFilter=#simpletext} start={UINT;optional} #createChild=LI]{TEXT}[/LIST]

Let's start from the top. The hashmap's functionality is now pretty clear: it just maps input values to output values. So, if I enter type=1, it will become type=decimal. The postFilter option seems to just apply the attribute value against a filter, maybe to help prevent XSS. How about #createChild=LI? Not nearly as bad as it sounds: it merely signifies that in WYSIWYG editing, after creating a LIST BBCode, an LI BBCode should be created. The tag for this BBCode is [*], which will be familiar to phpBB users.

Just to see what it looks like, here's an example of a BBCode using the regular expression syntax:

[MAGNET={REGEXP=/^magnet:/;useContent}]{TEXT}[/MAGNET]

In this case, it's merely validating that the argument matches the regular expression, but then the argument is used verbatim.

Of course, it can get more complicated than that. Regular expression matches can also contribute to the attributes in some cases. So it seems we're actually going to have to do most of the work of parsing the BBCode definitions.

Stepping Back

So, it seems I have underestimated exactly how much work is going to be involved here, to go directly from BBCode definitions and templates to being able to render the storage format. I still would like to try to tackle the problem eventually, for fun if nothing else, but I am dealing with a single forum right now, with a limited amount of BBCode. I could create an XSLT stylesheet that works for the forum database I'm working on and call it a day.

Well, I could. But it's getting too late, so I'll have to do it another day :)