this post was submitted on 04 Sep 2024
65 points (90.1% liked)

Ask Lemmy

27027 readers
671 users here now

A Fediverse community for open-ended, thought provoking questions

Please don't post about US Politics. If you need to do this, try !politicaldiscussion@lemmy.world


Rules: (interactive)


1) Be nice and; have funDoxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them


2) All posts must end with a '?'This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?


3) No spamPlease do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.


4) NSFW is okay, within reasonJust remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].


5) This is not a support community.
It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.


Reminder: The terms of service apply here too.

Partnered Communities:

Tech Support

No Stupid Questions

You Should Know

Reddit

Jokes

Ask Ouija


Logo design credit goes to: tubbadu


founded 1 year ago
MODERATORS
 

Obviously there's not a lot of love for OpenAI and other corporate API generative AI here, but how does the community feel about self hosted models? Especially stuff like the Linux Foundation's Open Model Initiative?

I feel like a lot of people just don't know there are Apache/CC-BY-NC licensed "AI" they can run on sane desktops, right now, that are incredible. I'm thinking of the most recent Command-R, specifically. I can run it on one GPU, and it blows expensive API models away, and it's mine to use.

And there are efforts to kill the power cost of inference and training with stuff like matrix-multiplication free models, open source and legally licensed datasets, cheap training... and OpenAI and such want to shut down all of this because it breaks their monopoly, where they can just outspend everyone scaling , stealiing data and destroying the planet. And it's actually a threat to them.

Again, I feel like corporate social media vs fediverse is a good anology, where one is kinda destroying the planet and the other, while still niche, problematic and a WIP, kills a lot of the downsides.

you are viewing a single comment's thread
view the rest of the comments
[–] tkw8@lemm.ee 27 points 2 months ago (2 children)

I think it’s amazing. I’m running Ollama with a bunch of open-source llms. You’re right. It’s so good. The problem is keeping up to date on what the newest development is.

The pace of progress is so fast and it’s really difficult to know what the cool kids are experimenting with this moment.

[–] brucethemoose@lemmy.world 15 points 2 months ago* (last edited 2 months ago) (1 children)

Oh, and if your hardware is AMD or Nvidia, you should really give exllama a shot.

If it's Apple, you should investigate kobold.cpp and more "nitty gritty" llama.cpp backends.

I have largely negative feelings towards ollama for a lot of reasons, but one of them is that it hides a lot of the knobs to get the absolute best out of LLMs, and understand how they work.

[–] tkw8@lemm.ee 6 points 2 months ago (1 children)

I’m running Nvidia on Ubuntu. I’ll give exllama a shot.

[–] brucethemoose@lemmy.world 7 points 2 months ago* (last edited 2 months ago) (1 children)

I'd recommend TabbyAPI with your favorite frontend, anything that works with OpenAI.

Or exui (which is what I tend to use) but is a bit more manual. text-gen-web-ui has better samplers, but its IMO more clanky and crufty, and really slow at long context.

Also, uh, you'll have to be careful about picking a model, you have to fit it to your GPU instead of letting ollama do it for you. I view this as a positive, as it forces you to search more a more optimal fit.

[–] tkw8@lemm.ee 5 points 2 months ago (1 children)

I manually specify what models to pull. I’m not running anything too crazy. My largest model is gemma27B. But I’ve worked with dolphin-mistral which was fun.

[–] brucethemoose@lemmy.world 6 points 2 months ago (1 children)

If you have a 24GB card, just go straight to the most recent Command R, a 3.75bpw-4bpw quantization. It's incredible, and you can do the full 131K context on a 24GB GPU easy.

Gemma 27B Is actually quite good, but "narrow." Its super low context and seems to be hyper optimized for short chatbot-arena style questions.

[–] tkw8@lemm.ee 4 points 2 months ago* (last edited 2 months ago) (1 children)

Gemma 27B Is actually quite good, but "narrow." Its super low context and seems to be hyper optimized for short chatbot-arena style questions.

This is the stuff I love to know so thanks for sharing. I will be pulling Command R tomorrow.

[–] brucethemoose@lemmy.world 3 points 2 months ago

Good! So Command-R excels at "RAG" style tasks like asking questions about a huge document, continuing a long story or so on. You should also read up on its super intricate system prompt format, which can steer it quite well.

I dunno about code, I tend to use Mistral Code 22B (or deepseek v2 API) for that.

I am happy to ramble on about this stuff, just ask.

[–] brucethemoose@lemmy.world 11 points 2 months ago (1 children)

Honestly a big problem is that the community for filtering the news has "collapsed."

The only reasonable congregation was basically /r/localllama, and due to a number of factors (including, apparently, a Reddit bug that was driving away traffic according to a mod), and its shrunken a ton.

Twitter, linkedin, youtube and such are awful and full of straight up lies. Huggingface is just impossible to navigate and filter. There are a few niche aggregators, but they come and go.

Hence I was hoping lemmy would grow its existing ML communities, but most of lemmy seems broadly anti AI, even anti open source AI, hence this post to get a feel if that's true.

[–] tkw8@lemm.ee 4 points 2 months ago (1 children)

I read localllama through redlib but I don’t contribute. I am not technical enough to contribute and I don’t understand the math.

I have been looking at YouTube for some videos to try to explain it, but I haven’t found anything that is in the sweet spot between “video for non-technical people” and “video for people with PhD and quantum physics”

[–] brucethemoose@lemmy.world 3 points 2 months ago* (last edited 2 months ago) (1 children)

It's a giant mess. Even the technical vidoes tend to be theoretical, and are either obsolete or do nothing to help you actually run them.

I would know nothing if I hadn't been following the community since the Pygmalion/ESRGAN days

[–] Bob_Robertson_IX@lemmy.world 2 points 2 months ago (3 children)

I've spent the past 2 years looking for the open source AI community, but haven't really found it. I've tinkered with Stable Diffusion and Ollama and I want to learn more, but haven't found the right places online yet.

[–] brucethemoose@lemmy.world 8 points 2 months ago

I'll give you one hint, a lot of the community is locked away in various Discords.

This is one of the many reasons I hate Discord.

[–] brucethemoose@lemmy.world 3 points 2 months ago (1 children)

And just to be more helpful, I can point you in the right direction depending on your hardware.

[–] Bob_Robertson_IX@lemmy.world 1 points 2 months ago (1 children)

Yeah, I hate Discord too but that has been the best place I've found the best information, but even then it doesn't really feel like a community.

I'm running on an Apple M1 at the moment, likely to upgrade to an M4 when it is released.

[–] brucethemoose@lemmy.world 1 points 2 months ago* (last edited 2 months ago) (1 children)

What RAM capacity?

Honestly, if LLMs are your focus, you should just upgrade to a used M2 Max (or Ultra) when the M4 comes out, lol. Basically the only thing that matters is RAM capacity and bandwidth, and the M2 is just going to be faster and better than a similarly priced M4.

Or better yet, upgrade to and AMD Strix Halo. This will buy you into linux and the cuda ecosystem (through AMD rocm), which is going to open a lot of doors and save headaches (while admittedly creating other headaches).

[–] Bob_Robertson_IX@lemmy.world 1 points 2 months ago (1 children)

Honestly I've mostly been playing around with image generators and learning how to write a good prompt. But LLMs are where I see real value and would love to learn more about self-hosting one, and custom training it.

I find it interesting that the M2 is better for LLM than the M4 will be... what's the reason for this?

[–] brucethemoose@lemmy.world 1 points 2 months ago* (last edited 2 months ago)

RAM capacity and bandwidth.

That basically the only two things that matter for local LLM performance, as it has to read the entire model from memory for every token (aka half word). And for the same money, a "higher end" M2 (like an M2 Max or Ultra) will just have more of it than the equivalent cost M3 or (probably) M4.

[–] sunzu2@thebrainbin.org 1 points 2 months ago

Hate to suggested it but have you checked reddit localllama?