I’ve been thinking about adding this to my “Fuck it, I’ll do it myself” / SHTF pile. I have a spare 10-15GB for a good selection of basic articles (across sciences, history, pop culture trivia etc).

https://get.kiwix.org/en/solutions/hotspots/content-bundles/

https://get.kiwix.org/en/solutions/hotspots/imager-service/

There’s something inherently cool about having wikipedia in a box (yes, you’d likely need to refresh it once a year) but I’ve never heard of anyone actually self hosting a Kiwix instance.

  • surfrock66@lemmy.world
    link
    fedilink
    English
    arrow-up
    25
    arrow-down
    3
    ·
    23 hours ago

    Yes, and I actually use it to train a local llm so I’m not hammering the internet. I have a ton of storage, and like to keep my kids in the sandbox, so we have wikipedia, project gutenberg, kahn academy, and a bunch of others all hosted behind an apache reverse proxy which is using mellon so there’s LDAP auth.

    • Domi@lemmy.secnd.me
      link
      fedilink
      English
      arrow-up
      3
      ·
      8 hours ago

      Do you actually train the LLM or use RAG? I have been looking for a local LLM + Wikipedia RAG solution for a while now.

      For now I just have kiwix-serve + searxng doing a simple search but the Kiwix search is…questionable.

      • SuspciousCarrot78@lemmy.worldOP
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        5 hours ago

        Somewhere in my documents, I have a scoped ticket for how to use kiwix as the source for the LLM to pull information directly from, populate its answer organically, and naturally respond to question at hand, without word-vomiting a wiki entry complete. The last I looked, you can poll the kiwix DB directly without using the search engine.

        I can dig that up for you if it still exists; it’s actually why I’m looking at kiwix (back burner project for now but the spirit moved me).

        PS: You’re aware of LLM-wiki? That might suit your purposes better, if your corpus is bespoke and updating. Works nicely.

        https://tinyurl.com/llmwiki

    • SuspciousCarrot78@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      12
      ·
      23 hours ago

      That was actually my immediate thought. I already have Wikipedia as a trusted source for llm, but I would prefer to self host and not hammer them.

      130GB to fit the entirely of Wikipedia is basically nothing and I’m mildly embarrassed not to have done it already.

      • surfrock66@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        23 hours ago

        I also try to participate in some of the farms, running zimit and mwoffliner to help make more archives. Feels like I’m helping.