Recently, I had a project that involved an old phpBB database, and long story short, I found myself wanting to parse phpBB's internal BBCode in Go. It can't be that bad, right?
Surveying the Situation
Before digging in, I decided to do some code reconnaissance. I Git-cloned the latest phpBB and dug in. My first approach was simply to search around for files that were related to BBCode, but the system has a lot of twists and turns, so I instead followed the I/O and looked at what happened with the post_text field coming from the database.
What I found was... interesting. The database in question ran the latest version, phpBB 3.2. It turns out in phpBB 3.2, the phpBB developers revamped the BBCode system. In phpBB 3.2, new posts will use the s9e TextFormatter library.
Overview of TextFormatter
At first glance, it's really hard to gleam much other than it is clear this library was built with a good grasp of the problem set, and you can find that it optimizes for the common cases of internet forums (such as, handling the practically common situation where no formatting was needed at all.) I'm still a bit confused about what precisely is going on with the 'Renderer' abstraction, and it looks like phpBB may be using TextFormatter as more of a framework than a library in some ways.
Nonetheless, I can see that it uses an XML-based intermediate language. At least in syntax; there's no schemas or XML namespaces in use here, which I'm more thankful for than regretful of. When parsing, it immediately replaces all of the s, i, and e tags with nothing, effectively removing them, using this regex:
My initial guess was that 's' and 'e' stood for start and end, and contained the actual BBCode tags that were parsed out. This turned out to be accurate, but this far in I still do not know what 'i' is for, nor do I understand exactly how all of this fits together.
Configurators and RendererGenerators
It's starting to become clearer that phpBB is using TextFormatter like more of a framework than a library because TextFormatter is more like a framework than a library. Digging deeper underneath the hood, you will find that TextFormatter comes with an elaborate and flexible configuration system that lets you control the rendering. The renderers bundled are "PHP" and "XSLT," but if you try reading the source code for them, you might find yourself woefully confused. This is because you don't actually use them. You generate renderers and use those. These classes are just helpers for your generated renderers.
So what does that mean? Why is there a renderer called "PHP?" Well, because it's based on generating PHP code. The XSLT generator is based on generating an XSLT stylesheet. The XML that stores our database representation of the text is subtly pretty clever, because it keeps our data in a representation that is cheap to render and easy to convert back to user input. You could cache the HTML, but it would mean that entire cache would need to be cleared every time the BBCode was changed, which clearly is not ideal, especially for huge forums.
A quick aside: so far most of the code I've read of TextFormatter is surprisingly clean and elegant. Will PHP ever have an ES6 moment?
I didn't figure out what was going on with 'i' tags when looking through the renderer, so I decided to read through the parser. It turns out that this contains text from the original parsing that is meant to be ignored entirely, likely including things like HTML comments (which are supported by TextFormatter as an option.)
Now I have a general idea of what I need to do for TextFormatter. I'll need to dig in deeper for the actual implementation. But for now, I need to look back at the old phpBB BBCode formatting engine. To do that, I needed to turn my attention to a file called bbcode.php.
Overview of phpBB's Legacy BBCode Engine
You might be wondering why I have to bother with this if we're using a modern version of phpBB. Well, it looks like I have no choice. Looking at generate_text_for_display, it seems like the old code is still active, which leads me to believe that the old BBCode is not converted. I have little doubt that this conversion would cause much issue, but decided to push forward regardless on the path of supporting both engines, simply because it seems like a fun thing to do.
phpBB's Legacy BBCode Engine: Database Layer
First thing we should do is take a look at what the database looks like on one of these legacy items. I learned that TextFormatter wraps its parsed text in either <r> or <t> tags depending on if the text was rich text or plaintext, so it's easy to tell if we're dealing with a TextFormatter row. Meanwhile, any table in phpBB that contains a BBCode field always comes with two additional fields: a BBCode "UID" field, and a BBCode "Bitfield" field.
First wrinkle: It turns out that phpBB actually did do something to every post. All of the post_text entries in my database either started with <r> or <t>. This seems to suggest that all of the posts are converted, but I looked and it seems like the ones with <t> might be legacy, so I gave this a whirl:
SELECT bbcode_bitfield, bbcode_uid, post_text FROM phpbb_posts WHERE post_text NOT LIKE "<r>%" AND bbcode_bitfield <> "" ORDER BY post_id DESC LIMIT 10;
Alright. Now we can take a look. After a while of futzing around, I found a pretty good, minimal specimen for examination:
+-----------------+------------+---------------------------------------+ | bbcode_bitfield | bbcode_uid | post_text | +-----------------+------------+---------------------------------------+ | AAI= | 3dy74kdk | <t>[youtube]dQw4w9WgXcQ[/youtube]</t> | +-----------------+------------+---------------------------------------+
Hmm. Quick observations:
- bbcode_bitfield is obviously encoded with base64.
- bbcode_uid is probably pseudo-random, but its use is unclear.
- The text stored in post_text looks pretty close to what is entered into the form. I don't see the bbcode_uid anywhere.
I did some Google searches and it looks pretty likely that bbcode_uid is meant to be appended onto the BBCode like this:
...but I found no such examples in my database at a cursory glance, so I'm very confused at the moment.
Turns out the bitfield is parsed via code in functions_content.php, which contains a class called... bitfield. At this point I decided to cleanroom it since I was pretty confident in what it was actually doing, and implemented a subset of the class in Go. After a table test revealed a couple simple mistakes, I had the most important functions of the bitfield class in Go working.
I will let you in on a little secret: I'm pretty sure I didn't really need to do that.
What's in the bitfield?
There are some things I do know about phpBB's BBCode parser and renderer. phpBB's BBCode parser identifies each BBCode with a simple numeric ID. The default BBCodes take up the first 12 or so IDs, while the rest are stored in a table called bbcodes. I immediately went to confirm my suspicions, and... Well, I came to be quite surprised.
So, that bitfield? It has one bit flipped. Bit 14. And well, there is no bbcodes row for 14. It's simply not there. And indeed, if I go and locate that post... there's no video displayed. Just the text. What happened?
I'll tell you what happened, the unfettered sledgehammer of Progress! It turns out the old [youtube] BBCode on this forum was deleted after moving instead to the MediaEmbed plugin. Existing posts were not converted. So now... yeah.
Does this mean we're barking up the wrong tree? Not entirely. I dug up some old backups and found that yes, this was BBCode 14.
So what did we learn here?
- Always check your assumptions. Finding this weird wrinkle out earlier may have saved hours of misdirected debugging effort later.
- People do not care about backwards compatibility :)
So what is the bitfield really? It simply tells the parser which BBCodes the text contains. That way, it can render only those BBCodes, saving on precious regular expression executions. This is obviously not the world's greatest performance optimization, but it absolutely beats executing every regular expression on every rendering of every body of text.
If I knew all of this, why did I bother implementing the bitfield if it's not really that useful?
Well, it has a secret secondary purpose. If you add a new BBCode to the forum, it will do absolutely nothing to existing posts. This is because none of the existing posts have the bit set for this BBCode. Therefore, if you want to accurately portray a phpBB post, you need to keep this wrinkle in mind during rendering. It's not very important, but it's important to me.
Plus, honestly, it was pretty easy. My bitfield implementation is a little under 75 lines, is well-tested, and was pleasant to write. How often do you get an excuse to write code like that?
phpBB's Legacy BBCode Engine: Rendering
At this point, I've determined what the BBCode UID is actually for... It's another optimization!
It looks like the purpose of the BBCode UID is to exist as a sort-of sentinel. The UID is added to BBCode tags via a regular expression replacement, called the first pass, so that in the second pass, the BBCode can be replaced easier, using a simple string replacement.
A bit ugly, but not really anything out of the ordinary. Anyone familiar with multipart form will recognize this approach.
Now the burning question: What exactly the hell is going on with BBCode UID? Why can't I find it in my database?
Finally, a moment of zen. phpBB did upgrade all of my posts. The UID is gone because when it reparsed all of the posts, the [youtube] BBCode had already been deleted. So, it treated it as plaintext. What it didn't do was it didn't clear the BBCode bitfield, which is likely ignored by TextFormatter, since it uses a proper XML parser.
So I don't need to build both text renderers. But hey, I probably can, and I probably will. Just not right now.
The remainder of what I found was simple, although the code is certainly not as easy of a read as TextFormatter was. As I said before, there are two passes. We only need to be concerned with the second pass, since the first pass happens when the user submits the form.
The second pass is simply a set of string replacements. The string replacements can come from a few different places, and you have to substitute in the BBCode UID from the database to do them. It looks like the default BBCode behavior can be overridden by the theme, and custom BBCode contain their first and second passes in the database.
It makes sense that the custom BBCode would have their first and second passes stored in the database, but it's worth noting this behavior is abstracted away from the user, where you simply enter a single template, and the passes are generated. My guess is that the passes are likely not used anymore, and instead the user-entered template is thrown to TextFormatter after some processing.
Alright. I finally have a basic idea of what's going on here. I can temporarily drop work on the oldschool phpBB BBCode parser, and start looking at what TextFormatter is doing.
Alright. After all of that, we have... a bitfield implementation.
However, this post is getting long and I am growing tired, so I think I will cut it here. In the next part, I plan to further dissect and reimplement the rendering portion of TextFormatter. To do this properly, we have to figure out exactly how phpBB uses TextFormatter, and figure out exactly how accurate our implementation can be. (Crossing my fingers for 'quite.')