this post was submitted on 07 Sep 2023
98 points (92.2% liked)
Asklemmy
43968 readers
1259 users here now
A loosely moderated place to ask open-ended questions
Search asklemmy ๐
If your post meets the following criteria, it's welcome here!
- Open-ended question
- Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
- Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
- Not ad nauseam inducing: please make sure it is a question that would be new to most members
- An actual topic of discussion
Looking for support?
Looking for a community?
- Lemmyverse: community search
- sub.rehab: maps old subreddits to fediverse options, marks official as such
- !lemmy411@lemmy.ca: a community for finding communities
~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It's never going to be all knowledge, since a lot of stuff is just lost or never recorded. A ton of stuff (like this thread) are probably low on the priority list for recording as well. But the closest you'd probably get to a full catalog of human knowledge (at last text based) are the huge data sets of nearly all text data on the internet used for training LLMs. I wouldn't be surprised if there are ones soon that include video and pictures as well, since newer AI models are starting to be able to interpret those too.
I believe this is one of those data sets: https://github.com/yaodongC/awesome-instruction-dataset
Edit: here's a big data set used for a lot of gpt3 https://commoncrawl.org/