Auger

How it worked

I made Auger back in early 2021, and it solved a very specific problem for me. I wanted a single page that aggregated posts from the Merveilles webring. Not a web app, not a service, just something that ran once a day and produced HTML. I wanted something static.

I did not know Python when I started. I had never written SQL. I had never used Docker. I did not understand what asynchronous meant, and I did not really understand what lists, or comprehensions were doing beyond "this seems to work." Auger exists entirely because I was willing to ask questions to people who were much more experienced than me, and brute force my way through ignorance.

The solution I landed (ended up with) on was a pipeline of scripts, each responsible for one pass over the data.

First, pull the RSS feeds and extract titles and links.
Then, fetch article pages to extract metadata like hostnames and favicons.
Then, go back and fill in publish dates.
Then, clean up the database by deleting broken or duplicate rows.
Finally, render static HTML.

Each step was its own script, run sequentially by cron.

pull.py was the entry point. It scraped RSS feeds and inserted article titles and URLs into Postgres.

The XML parsing logic here is basic. RSS and Atom feeds are wildly inconsistent, so I treated parsing as a series of attempts instead of a single pass. I used try except blocks as flow control (apparently encouraged/used by the Python core dev team). If one namespace or structure failed, I tried another.

try:
    links = [x for x in root if x.tag.split("}")[1] in ("entry", "item")]
except IndexError:
    links = [x for x in root[0] if x.tag in ("entry", "item")]

Every successful parse resulted in an immediate INSERT followed by a commit.

cur.execute(
    "INSERT INTO posts (article_title, article_url) VALUES (%s, %s)",
    (title[0], link_url[0])
)
conn.commit()

This is extremely inefficient. I did not know about transactions yet. I did not know about batching. I only knew that committing immediately made the data show up.

metadata.py was my solution to attaching site information after the fact. Once I had article URLs, I wanted hostnames and favicons.

Instead of representing this as structured data in Python, I wrote a SQL function that stripped URLs down to protocol and host, then passed that list back into Python.

CREATE OR REPLACE FUNCTION public.posts(_url text)
RETURNS text AS $$
    SELECT string_agg(token, '' ORDER BY alias DESC)
    FROM ts_debug(_url) q
    WHERE q.alias in ('protocol', 'host');
$$;

This is a wild approach. It technically works, but it's doing URL parsing by leaning on Postgres full text search rather than an actual URL parser.

BeautifulSoup was then used to scrape favicons from each host. If nothing was found, I defaulted to /favicon.ico.

if root.find("link", attrs={"rel": "icon"}):
    favicon_path = ...
else:
    favicon_path = "/favicon.ico"

This script took several minutes to run. I knew it was hacky even then. What I didn't understand yet was how much work I was duplicating by storing host and favicon data on every post row instead of normalizing it.

date.py repeated the same feed scraping process, but only to extract dates. It tried multiple tag names and fell back aggressively, because feeds have disagreements on where or how dates are represented.

published_date = ...
updated_date = ...
pub_date = ...

If the date was invalid, I shoved in a value.

if pub_date[0] == "Invalid Date":
    pub_date = ['0001-01-01']

I wanted something in the field so I could sort. Cleaning it up later felt easier than blocking on correctness.

When things still went sideways, I deleted rows outright.

DELETE FROM posts WHERE article_date IS NULL;
DELETE FROM posts WHERE article_url IS NULL;

Pagination

Once the database had enough posts, one giant HTML page became annoying. It was slow to load, required a long scroll, and stopped feeling like a readable feed. I needed a way to split the output into multiple pages while still using the same template.

The approach sounded straightforward: pull everything from Postgres, chunk it into fixed-size pages, then render one HTML file per chunk using Jinja2.

Pagination only works if the ordering is stable, so I pushed that responsibility into SQL.

q_select = '''
SELECT 
    article_title, 
    article_url, 
    to_char(article_date, 'DD Mon YYYY'),
    article_favicon,
    article_host
FROM posts
ORDER BY article_date DESC;
'''

Two things mattered here.

First, I sorted in SQL instead of Python so every page would be deterministic. Second, I formatted the date in SQL using to_char so the template wouldn't have to deal with date logic at all.

Then I pulled everything into memory at once.

cur.execute(q_select)
origin_data = [cur.fetchall()]

I wrapped fetchall() in a list and then immediately unpacked it later. It works, but it shows that I was still feeling out how lists actually behave.

In the end, I needed help with the core pagination logic. I wasn't quite sure where to take it, so I asked a friend with a lot more experience than me. His solution was this:

entries_per_page = 100

entries = [
    origin_data[page_offset:page_offset + entries_per_page]
    for page_offset in range(0, len(origin_data), entries_per_page)
]

What this does is simple: walk the dataset in steps of 100 and slice out a page each time. Take a big list, cut it into smaller lists, loop over them.

I generated page numbers separately so the template could render navigation links.

pnum = []
for page_number, page_entries in enumerate(entries):
    pnum.append(str(page_number))

This gave the template a list of all available pages. What it did not give it was the current page number. It also used zero-based numbering.

Writing the HTML files

This is where the problems start to show.

for page_number, page_entries in enumerate(entries):
    with open(filename, 'w+') as fw:
        fw.write(template.render(data=page_entries, pg=pnum))
        fw.close()

    fw = open('page/%s.html' % page_number, 'w')
    fw.write(template.render(data=page_entries, pg=pnum))

For every page, I wrote two HTML files.

One was hardcoded to page/end.html and got overwritten on every loop iteration. Only the last write survived. That file effectively acted as "the final page," but nothing in the code ever said that explicitly.

Its meaning depended entirely on the fact that the loop ran in order and that the last iteration happened to write last. If the loop order changed, if pages were skipped, or if rendering was ever refactored, it would break.

The second file was the actual paginated output: page/0.html, page/1.html, and so on.

The biggest lesson Auger taught me is that you need to stop bad data as it comes in, not clean it up later.

A lot of the code in Auger exists purely to compensate for earlier mistakes. Scripts like matchdrop.py are there because the ingest path happily accepted missing URLs, missing dates, and duplicates. Instead of preventing those cases at insert time, I let them through and tried to repair the damage afterward.

I learned that the SQL schema should do more of the work. If a value should never be null, the database should enforce it. If two rows should never represent the same article, the database should make that impossible. Most of my cleanup logic only existed because those rules weren't there.

How I would improve it now

The database would do more of the work. Posts and sites would be separate tables. Hostnames and favicons would be stored once instead of copied onto every row. Uniqueness and required fields would be enforced by the schema instead of patched later by cleanup scripts. Rerunning the pipeline would update existing rows instead of relying on deletes and reinserts to correct bad data.

The pipeline would be safe to rerun. If it crashed halfway through, running it again would pick up where it left off instead of leaving the database in a half-broken state.

Auger is not good code. But it is honest. It reflects exactly what I knew at the time, and it is the reason I understand these systems now.

I'm proud of it anyway.

Thanks

Auger was not a solo effort.

Thomasarus handled the CSS and overall styling for the site. The visual side of the project is almost entirely his work, and it did a lot to make Auger feel finished instead of like a raw data dump.

Som taught me how to use Jinja2. Learning how data gets passed into templates and turned into real HTML pages was a huge shift in how I thought about building things. I'm using it even today.

Wally got me started in the first place, setup the entry point for asyncio, and introduced me to list comprehensions, and he wrote the core pagination code. Having someone show me how to start made the project feel possible and much less overwhelming.

And Lykkin, for his guidance throughout the project and for helping me understand what problems I was actually trying to solve, while keeping things simple, light hearted, and being patient with me.

23.03.21 00:51 Lykkin:
Aleph do something real stupid
like

have a cron job that takes dumps of your db and throws it into a git repo, commits it, and pushes it up to github
then when you go to build your containers it checks out the repo and loads the dump
how many tables do you have?

I wouldn't have started my journey without them, and I certainly wouldn't have finished this project without their guidance.

THE INFINITE ARCHIVE

Auger

How it worked

Writing the HTML files

How I would improve it now

Thanks

Outbound