Reddit’s API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn’t touch Reddit’s servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

  • USB drive / local folder (just open the HTML files)
  • Home server on your LAN
  • Tor hidden service (2 commands, no port forwarding needed)
  • VPS with HTTPS
  • GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is “trust but verify” – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

  • SteveCC@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 months ago

    Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

  • frongt@lemmy.zip
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 months ago

    And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.

      • muusemuuse@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 months ago

        If only I had the space and bandwidth. I would host a mirror via Lemmy and drag the traffic away.

        Actually, isn’t the a way to decentralize this that can be accessed from regular browsers on the internet? Live content here, archive everywhere.

        • psycotica0@lemmy.ca
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 months ago

          Someone could format it into essentially static pages and publish it on IPFS. That would probably be the easiest “decentralized hosting” method that remains browsable

  • Tanis Nikana@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 months ago

    Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

    Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

  • 19-84@lemmy.dbzer0.comOP
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 months ago

    PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

    • Bazell@lemmy.zip
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      2 months ago

      We can’t share this on Reddit, but we can share this on other platforms. Basically, what you have done is you scraped tons of data for AI learning. Something like “create your own AI Redditor” . And greedy Reddit management will dislike it very much even if you will tell them that this is for the cultural inheritance. Your work is great anyway. Sadly, that I do not have enough free space to load and store all this data.

  • breakingcups@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 months ago

    Just so you’re aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

    Not to detract from your project, which looks cool!

        • idealism_nearby@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 months ago

          Would love to see you learn an entire foreign language just so you are able to communicate with the world without being laughed at by people as hostile as yourself.

          • potustheplant@feddit.nl
            link
            fedilink
            English
            arrow-up
            0
            ·
            edit-2
            2 months ago

            They said it wasn’t their “first” lanugage. Which leads me to believe that they do speak English. If that’s the case, then they indeed are kind of lazy. There have already been studies in the impact of AI when used for communication and the results are not positive.

            This isn’t something I’d personally point out and criticize, just something I wouldn’t do personally. Take the time to express your own ideas in your own words. The long term cost is higher than the short term gains.

            • lad@programming.dev
              link
              fedilink
              English
              arrow-up
              0
              ·
              2 months ago

              I have A1 and A2 level in a couple of non-first languages, technically I can speak those, realistically I don’t and will not be able to communicate something more complex than ‘here, take a look’

              So I don’t agree with your absolutistic stance

              • potustheplant@feddit.nl
                link
                fedilink
                English
                arrow-up
                0
                ·
                2 months ago

                There’s nothing “absolutistic” about my “stance”. If you’re rusty using a language, you won’t get better if someone else does the homework for you. Make an effort, make mistakes, write in a way that sounds weird, who cares. But practice. If you only take the easy way out, that’ll be your only option in the future.

                Although, like I already said, that’s MY way of thinking about it. If you want to use ai to write your stuff, you do you. It doesn’t negate the fact that, whle it’s not “wrong”, it’s the lazy (or minimum effort) option. Don’t know why it bothers you so much.

            • rumba@lemmy.zip
              link
              fedilink
              English
              arrow-up
              0
              ·
              2 months ago

              Hey I drove to the library, picked up all these things you needed, got dinner here ya go, free!

              You drove? man that’s lazy…

              He used AI to clean up translation and save time after he spent a fuck ton of time curating and delivering us a helpful product. Calling him out as lazy is an awful take.

              • potustheplant@feddit.nl
                link
                fedilink
                English
                arrow-up
                0
                ·
                2 months ago

                First, that’s an awful analogy.

                Second, you’re assuming (for some unknown reason) that they “cleaned up” the “translation” using ai. You have literally no idea exactly how they wrote the post. It’s kinda weird to make up a random scenario but ok.

                Third, no, it’s not an awful take. You can code something that requires a ton of effort but write awful documentation. One thing does not make the other impossible.

                Fourth, I already explained that there have already been studies that concluded that using AI to write stuff for you has a negative impact on your communication skills. This is not an opinion or me being ingrateful or whatever. I was just sharing information.

                • rumba@lemmy.zip
                  link
                  fedilink
                  English
                  arrow-up
                  0
                  ·
                  2 months ago

                  If that documentation was awful, I’d REALLY like to see your take on NixOS :)

              • 19-84@lemmy.dbzer0.comOP
                link
                fedilink
                English
                arrow-up
                0
                ·
                2 months ago

                there are the so called activists that complain alot then there are the activists that deliver projects and code… enough said

                • potustheplant@feddit.nl
                  link
                  fedilink
                  English
                  arrow-up
                  0
                  ·
                  2 months ago

                  “Activists”? What are you even talking about?

                  Regardless, I specifically said that what you did wasn’t wrong or anything likw that. I simply think that it’s going to do you more harm than good in the long run. You’re free to do whatever you want though, obviously.

                  Another piece of advice. When someone simply shares an opinion, don’t get instantly butthurt over nothing. Otherwise this might as well be reddit.

        • MadMonkey@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 months ago

          Brush, you do not seem like a nice person to be around.

          Spread love and kindness, not hate.

          I hope you have a better rest of your day.

      • Melvin_Ferd@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 months ago

        You’re awesome. AI is fun and there’s nothing wrong with using it especially how you did. Lemmy was hit hard with AI hate propaganda. China probably trying to stop it’s growth and development in other countries or some stupid shit like that. But you’re good. Fuck them

        • rumba@lemmy.zip
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 months ago

          Yup, if there was ever a decent use for AI, this is it. Lemmy can (and will) hate the shit out of it, but it took a little burden off the shoulders of someone doing us a great service.

    • Gerudo@lemmy.zip
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 months ago

      Say what you will about Reddit, but there is tons of information on that platform that’s not available anywhere else.

      • UnderpantsWeevil@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 months ago

        :-/

        You can definitely mine a bit of gold out of that pile of turds. But you could also go to the library and receive a much higher ratio of signal to noise.

        • pixeltree@lemmy.blahaj.zone
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 months ago

          This one specific bug in this one niche library has probably not been written about in a book, and even if it has I doubt that book is in my local library, and even if it is I doubt I can fucking find it

          • mirisgaiss@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            2 months ago

            obscure problems almost always have reddit comments as search results, and there’s no forums or blogs with any of it anymore. be nice to have around solely for that… though I’m sure if shit like /pics or whatever else was removed it could get significantly smaller…

      • irmadlad@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 months ago

        I use Reddit for reference through RedLib. I could see how having an on-premise repository would be helpful. How many subs were scrapped in this 3.28 TB backup? Reason for asking, I’d have little interest in say News or Politics, but there are some good subs that deal with Linux, networking, selfhosting, some old subs I used to help moderate like r/degoogle, r/deAmazon, etc.

        • 19-84@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 months ago

          the torrent has data for the top 40,000 subs on reddit. thanks to watchful1 splitting the data by subreddit, you can download only the subreddit you want from the torrent 🙂

    • 19-84@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 months ago

      redarc uses reactjs to serve the web app, redd-archiver uses a hybrid architecture that combines static page generation with postgres search via flask. is more like a hybrid static site generator with web app capabilities through docker and flask. the static pages with sorted indexes can be viewed offline and served on hosts like github and codeberg pages.

        • 19-84@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 months ago

          redd-archiver will take up more disk space because the database exists along with the static html

    • muusemuuse@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 months ago

      You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.

        • muusemuuse@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          0
          ·
          2 months ago

          What would they say? It’s information that’s freely available, no payment required, no accounts to simply read it, no copyrights, where’s the legal in hosting a duplicate of the content?

          • limelight79@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            2 months ago

            It might fall under the same concept that recipes do - you can’t copyright a recipe, but a collection of recipes (such as a book) is copyrightable.

            In any case, they have a lot more money to pay lawyers than you or I do, I’ll bet, so even if you are right, that doesn’t mean you’ll have the money to actually win.

          • El Barto@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            2 months ago

            Oh I agree with you, friend. The problem is that they’ll say that they’re losing ad revenue. So they’ll try and sue, even if they’re in the wrong.

  • a1studmuffin@aussie.zone
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 months ago

    This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.