You're right of course.
It's helpful that on that site a well informed reviewer has tested both a traditional head and a new tech head using the same method. I think under those circumstances you can use GN to compare. Also he handily compares the manufacturer stated GN and his tested GN. Different methods are going to give different results but I think you can piece it together. What we need is an equivalent to the Wilhelm institute for lights. It doesn't really matter what the metric is - just that the test is standard. We can all dream, right?
For me it's interesting that in a large power band the t.5 time for an IGBT is the same as a t.1 - which I guess is obvious when you think about it.
Anyway, this isn't helping the OP pick a light
This isn't going to help the OP to pick a light either but

There is a sort of unofficial standard used for testing lights for guide numbers, AFAIK this is used by all reputable makers of studio flash, but not by makers of hotshoe flashguns, who tend to exaggerate performance by using the maximum zoom setting, and not by makers of the cheapest lights, who IMO just guess at the figures...
It's a simple test rig, the flash tube is 2m from the meter, the room is large enough for any light reflected from the walls and ceiling to not make a significant difference, and the flash is fired 10x at full power, in case of a rogue reading. Someone taking readings in their living room, with a low white ceiling and white walls will produce much higher figures than expected.
It sort of works but it has its limitations because the reflector fitted for the tests makes a HUGE difference to the result and, because of the different fittings, it's impossible to standardise on the reflector, even if it suited the manufacturer to do so...
Of course, it should be a standard reflector with an angle of 55 degrees, but even if the reflectors are all the same shape and design, the reflectivity can vary a lot. For example the Bowens standard reflector has a highly polished surface which produces a very high reading compared to the matt surface of the Lencarta one, and the Elinchrom standard reflector is similar to Lencarta. Years ago, some manufacturers produced optional white reflectors (Strobex and Courtney spring to mind, although there may have been others) and their white versions produced about 2 stops less light than their metal finish reflectors. There seems to be a trend for some manufacturers to inflate their figures by supplying highly polished reflectors, which inflate the figures but which produce such harsh light that they need some diffusion to make it usable - which reduces the guide number very substantially.
So what's the answer?
Well, of course one answer is to test without the reflector fitted at all. In theory this would create a level playing field but in practice the figures would be misunderstood by many people, and anyway there are some makes that have a built in 'mini reflector' just behind the flash tube...
Richard has a method that seems very valid to me, when he tests lighting for
Advanced Photographer he fits an umbrella-style softbox to every light, this seems to me to be the best leveller there is at the moment because it works with every flash head.