floss.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
For people who care about, support, and build Free, Libre, and Open Source Software (FLOSS).

Administered by:

Server stats:

689
active users

alcinnz

Thanks to @bleakgrey (and I think I recall someone else being involved), a new Odysseus release is coming out soon support a "reader mode".

I find it rediculous I feel need to support this feature, it's saying "webdevs are doing such a poor job that I need to offer to clear away their mess!"

In celebration of this I will describe how this code (the same as used in Firefox and Pocket) works.

@bleakgrey When a page loads, this code injects some JavaScript to check if it's probably "readerable".

That isit looks through any visible <p>s (not in an <li>), <pre>s, or <br>-containing <div>s with more than 140 characters, discards some depending on class names, and sums the square roots of any remaining character count.

If so it sends a message to the UI telling it to show the button offering a reader mode.

@bleakgrey When you click that button, there are three layers to the JavaScript ran in the page in order to remove it's junk.

The first layer uses document.write() to drop the existing markup from the page. Then in addition to the page's text, it adds back in it's extracted title and byline. It also computes an estimated reading time @ 200 words-per-minute, before removing any attribute besides "src" and "href" and annotating the page with a theme class.

@bleakgrey The next layer might (if configured to do so) consider giving up if there's too many elements on the page.

Then it removes all <script>, <noscript>, and <style> tags, before replacing "chains" of <br>s with a <p> containing it's subsequent siblings and replaces all <font>s with <span>s.

From there it examines the metatags for useful information (filling in any missing excerpts), and after layer 3 it (unnecessarily here) makes links absolute and removes any classes.

Layer 3 considers elements which:
* is not marked hidden
* doesn't look like a byline/the author name (those are rendered seperately)
* (optionally) based on the absence/presence of certain classes, unless it's in a <table>
* is a <section>, <h2+>, <p>, <td>, or <pre>
* inline-containing <div>s (rewritten to <p>s)
* or <div> containing a single <p> (as on mobile.slate.com)
* and has more than 25 characters

From there it scores each of those elements by number of:
* paragraphs
* commas
* characters by the hundred (up to 3)
These scores also count towards the parent elements, but scaled by depth (especially beyond depth 2)

From there it scales these scores by how much of the text is in links (a likely indicator of navigation) and tracks only the top 5 candidates.

Then it looks to see if it captures more useful text by looking at ancestors and/or siblings.