Per a request, I created a new #Wikipedia database report that sorts featured articles using their "prose size":

Previously there was a report sorted them by size of the wikitext markup:

That measure biased articles using online sources (longer reference markup) vs books (shorter).

(Yes, I demoted Taylor Swift from first to 254th 😢)

Prose size is usually measured by a JavaScript gadget:

I ported the gadget to #Rust and published it as the wikipedia_prosesize crate. Source:

Using the #Parsoid HTML, it looks for text that ends up in <p> tags, and then subtracts the reference numbers, [citation needed] templates and a few other things.

#MediaWiki #mwbot-rs


@legoktm So I got curious and compared them: prose size and wordcount line up pretty nicely around an average ratio (which makes sense) but wow, the skew for markup size is quite something.

· · Web · 1 · 1 · 3

@generalising ooh, very cool! I think most of the difference is that references and media are excluded, which take up a sizable amount of markup and are text shown to users, but not considered "prose".

I'm going to package this up as a web API on Toolforge later today to make it easier to use for non-Rust folks if you want to try it on other sets of articles.

@legoktm poking at the very extreme cases, it looks like these are also picking up heavy use of text in notes or in tables - which I guess is unusual enough not to skew things too much by comparison to reference markup. Least "efficient" is which definitely goes all-in on the table approach!

The tool sounds fun - will have a think about how it could be used.

Sign in to participate in the conversation

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!