this post was submitted on 29 Oct 2023
94 points (93.5% liked)

Programming

17157 readers
238 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev



founded 1 year ago
MODERATORS
 

Got myself a few months ago into the optimization rabbit hole as I had a slow quant finance library to take care of, and for now my most successful optimizations are using local memory allocators (see my C++ post, I also played with mimalloc which helped but custom local memory allocators are even better) and rethinking class layouts in a more "data-oriented" way (mostly going from array-of-structs to struct-of-arrays layouts whenever it's more advantageous to do so, see for example this talk).

What are some of your preferred optimizations that yielded sizeable gains in speed and/or memory usage? I realize that many optimizations aren't necessarily specific to any given language so I'm asking in !programming@programming.dev.

you are viewing a single comment's thread
view the rest of the comments
[–] Die4Ever@programming.dev 23 points 11 months ago* (last edited 11 months ago)

ok maybe not my biggest or cleanest optimization, but an interesting one

I made the WCS Predictor for StarCraft 2 eSports, which would simulate the whole year of tournaments to figure out the chances for each player qualifying for the world championship (the top 16 players by WCS Points), and it would also show the events that would help/hurt the players. It was a big monte carlo simulation that would run millions of iterations. (My memory is fuzzy here because this was 2013 to 2014)

Originally I did a data-based approach where each Tournament object had a vector of Round objects (like round of 32, or semifinals, etc) which had a vector of Match objects (like Soulkey vs Innovation, or a 4 player group). The Round class had a property for best_of (best of 3, best of 5, etc). This was really inflexible cause I couldn't properly simulate complex tournament formats that didn't fit my data, and it required a lot of if statements that could cause branch prediction misses.

A tournament definition looked something like this:

older code example

		Round ro32("ro32");
		ro32.best_of=3;
		ro32.points_for_3rd=150+25;
		ro32.points_for_4th=100+25;
		Match ro32GroupA;

		ro32.matches.push_back(ro32GroupA);
		ro32.matches.push_back(ro32GroupB);
		ro32.matches.push_back(ro32GroupC);
		ro32.matches.push_back(ro32GroupD);
		ro32.matches.push_back(ro32GroupE);
		ro32.matches.push_back(ro32GroupF);
		ro32.matches.push_back(ro32GroupG);
		ro32.matches.push_back(ro32GroupH);

		GSL.rounds.push_back( ro32 );

When I rewrote it for the following year, I ditched the data-driven approach to try to get more efficiency and flexibility, but I definitely didn't want virtual functions either, so I decided to use templates instead and a custom class for each tournament. Now creating a tournament looked like this:

newer code example


template
class Round : public RoundBase
{
public:
	MatchType matches[num_matches];
// ... more stuff ...
};

class CodeSBase : public TournamentBase
{
public:
	Round<32, SwissGroup, 8, RandomAdvancement<16, 16>, true > ro32;
	Round<16, SwissGroup, 4, A1vsB2<8, 8>, true > ro16;
	Round<8, SingleMatch, 4, StraightAdvancement<4, 4>, true > quarterfinals;
	Round<4, SingleMatch, 2, StraightAdvancement<2, 2>, true > semifinals;
	Round<2, SingleMatch, 1, StraightAdvancement<1, 1>, true > finals;

	void init(vector &prev_matches, vector &upcoming_matches)
	{
		quarterfinals.best_of = 5;
		semifinals.best_of = 7;
		finals.best_of = 7;
		ro32.points_for_placing[3] = 100;
		ro32.points_for_placing[2] = 150;
		ro16.points_for_placing[3] = 300;
		ro16.points_for_placing[2] = 400;
		quarterfinals.points_for_placing[1] = 600;
		semifinals.points_for_placing[1] = 900;
		finals.points_for_placing[1] = 1250;
		finals.points_for_placing[0] = 2000;
		finals.match_placing_to_tournament_placing[0] = 1;
	}


	void predict(Simulation &sim, array &top8, array &bottom24, array &finalists, Rand64 &rng)
	{
		RandomAdvancement<16, 16> Ro32toRo16;
		A1vsB2<8, 8> Ro16toRo8;
		StraightAdvancement<4, 4> Ro8toRo4;
		StraightAdvancement<2, 2> Ro4toRo2;
		StraightAdvancement<1, 1> finalsadv;

		ro32.predict(sim, t_id, Ro32toRo16, rng);
		ro16.AcceptAdvancements(Ro32toRo16.advancing_players);
		ro16.predict(sim, t_id, Ro16toRo8, rng);
		quarterfinals.AcceptAdvancements(Ro16toRo8.advancing_players);
		quarterfinals.predict(sim, t_id, Ro8toRo4, rng);
		semifinals.AcceptAdvancements(Ro8toRo4.advancing_players);
		semifinals.predict(sim, t_id, Ro4toRo2, rng);
		finals.AcceptAdvancements(Ro4toRo2.advancing_players);
		finals.predict(sim, t_id, finalsadv, rng);

		top8 = Ro16toRo8.advancing_players;
		for (uint i = 0; i<8; i++) {
			bottom24[i] = Ro32toRo16.falling_players[i * 2];
			bottom24[i + 8] = Ro32toRo16.falling_players[i * 2 + 1];
			bottom24[i + 16] = Ro16toRo8.falling_players[i];
		}
		finalists[0] = finalsadv.advancing_players[0];
		finalists[1] = finalsadv.falling_players[0];
	}

All data had very good locality using arrays instead of vectors, everything could be inlined and branch predicted and prefecthed, complex tournament formats could be built even properly handling high placements from one tournament granting you seeding into a different tournament. I think I gained like 3x to 5x speedboost from this alone, I also made it multithreaded and work in batches which improved performance further, which allowed me to get updated results more quickly and with a higher number of samples. I also made it so it could output the results and then keep processing more batches of samples to refine the numbers from there. I wouldn't normally suggest these kinds of optimizations but it's a very unusual program to have such a wide hotloop

The website is gone, but here's a working archive of it (which I didn't know existed until writing this post, I thought there were only broken archives)

archive links, me reminiscing on old times, get ready for stats and graphs overload

home page: https://web.archive.org/web/20150822091605/http://sc2.4ever.tv:80/

a tournament page: https://web.archive.org/web/20160323082041/http://sc2.4ever.tv/?page=tournament&tid=27

a player's page: https://web.archive.org/web/20161129233856/http://sc2.4ever.tv/?pid=73

a general checkup page: https://web.archive.org/web/20161130055346/http://sc2.4ever.tv/?page=checkup

a page for players who must win tournaments to qualify: https://web.archive.org/web/20161129235740/http://sc2.4ever.tv/?page=must_wins

page showing the simulations history: https://web.archive.org/web/20161130055230/http://sc2.4ever.tv/?page=simulations

FAQ page: https://web.archive.org/web/20160323073344/http://sc2.4ever.tv/?page=faq

the fantasy league WCS Wars: https://web.archive.org/web/20161130055233/http://sc2.4ever.tv/?page=gamehome

my WCS Wars user page https://web.archive.org/web/20160704040152/http://sc2.4ever.tv/?page=user&uid=1