2022.01.12 23:12

How fast is yes

Vinnl on June 13, root parent next [—]. By running the yes command twice via Terminal under Mac OS X, users were able to max out their computer's CPU, and thus see if the failure was heat related.

Ahh interesting, maybe he did have a legitimate reason for it. I always assumed it was busywork to make the customer feel better. Now if I could just prove to Apple that my computer randomly shuts down all the time Mine just freezes, the last 0. For reference, I had that exact same issue on my MBA - went on for years. Many times while watching youtube or doing something Garageband related. At the time I had When The issue has now gone away completely BenjiWiebe on June 14, root parent prev next [—].

Funny, I've seen that multiple times on my homebuilt Fedora system. You know the system keeps logs, right? Open console. Hm, my old Dellbuntu laptop does the same. It's been dropped a few times in its 10 years of service, and now sometimes will shut down if I tap it too brusquely, regardless of the temperature. The BIOS will report it overheated, and i don't know what's going on.

With DDR, it should be Why is it being limited to main memory speed? Surely the yes program, the fragments of the OS being used, and the program reading the data, all fit within the L2 cache?

For each byte of data passing through pv: 1. This is also the reason why further "optimization" of this program in assembly was a fool's errand: the bulk of CPU load is in the kernel. The actual throughput then, once you include OS copying, is either 2 or 4 times the quoted speed depending on splice usage , so we're either at main memory theoretical speeds, or double main memory speeds.

Intuitively, I'd still have expected that it should be a larger multiple. L2 cache and copying "y" bytes have very little to do with this; I suspect if you could produce high-granularity timings it would almost all be in the syscall overhead. See eg. Many, many years ago I was working on the Zeus web server, and we went to surprising lengths to avoid syscalls for performance. Yes they do. You make that overhead negligible by calling read with a large size.

Binding both processes to the same core is definitely a performance hit though. There might be some gain through a shared cache, but it's lost more through lack of parallelism.

These numbers are from a ik. How do you know that the dataset fits in L2? If the pages are not "recycled" through LRU scheme for allocation , the destination changes every time and the L2 cache is constantly trashed.

I only learned of pv from this article so I can't speak much about its buffering. I would guess that the kernel would try to re-use recent freed pages to minimise cache thrashing. But anyway, on the 'yes' side, the program isn't re-allocating its 8kb buffer after every write , so there's a lot of data being re-read from the same memory location.

In general, only the CPU itself sees the L2 cache. Anything you see on another device screen, disk, NIC etc has been flushed out of cache.

Sure, but this is a pipe between two tiny processes, hopefully with very little else being run on the computer at the time otherwise all bets are off for any benchmarking.

I am probably way behind the current state of the CPU judging by the downvotes I got so if you are saying there is no reason and the data can be written into a device without leaving the CPU I will just concede my ignorance. Don't fret about the downvotes, these magic internet points aren't redeemable anywhere : It's completely possible for the data to not all leave the CPU.

Then the CPU has no need to send the original page out to main memory. If they remained in the issuing CPU cache, I fail to see how the destination device could possibly see them. Writing to memory is the only way which a guarantees the data is available elsewhere and b is the fastest.

You could make "yes" faster with the tee syscall. Keep duplicating data from the same fdin doesn't actually copy and it becomes entirely zero-copy.

On the other hand the current code is perfectly portable, tee 2 is a linux syscall. Actually yes should use vmsplice. And pv on the other side of the pipe should use splice. His code can further be optimized by having only 1 large element in the iovec vs.

Correct, I misread the code. Good catch, I just noticed that it does. AstralStorm on June 13, parent prev next [—]. Still gets piped and hits same performance. It's 2x times faster for a quick test here. Only copying on the read side? Out of interest: Could you please post the code? FreeBSD's yes has just been updated because of this. Looks like that may drop some data if you get a short write, possible when writing to pipes etc.

It's come to my attention that lower numbers are not better here. I have filed a bug, we'll get to the bottom of this shortly. I was not going to post this because hacker news has this ethic? Perhaps we should have discussion about that, I'm not sure that's a good thing but I'm not in charge here. The GNU guys never really stepped up to being kernel people. Bitch at me all you want, they didn't get there. It's a funny comment, especially coming from reddit. Anyway, the comment you're quoting is just a shallow jab that belittles the GNU developers' work without contributing anything new or meaningful.

It's telling that you had to spend two paragraphs to justify cross-posting it here. You say that and then immediately become an example of what hes talking about. This "shallow jab that contributes nothing new or meaningful" is, in some circles, known as a "joke. Not really -- I didn't downvote that comment. And I still dispute the notion that negativity is rare or always shunned on this board: to the contrary, it's so commonplace that an actual rule [0] had to be added to try to sway things in the other direction.

Jokes have their place, but bringing up the failure of Hurd in every GNU-related post is banal. And saying they "never really stepped up" to your level as a mighty kernel developer, as if the people who brought us glibc and coreutils lack an understanding of OS internals, just seems rude and curiously out of touch. So negativity here is not rare, when people don't like something they are fast to jump on it. What I was trying to get at is this, if you care about your upvotes, hacker news promotes a sort of hive mind.

Which is somewhat like "say only nice things unless you are clearly swatting down something that is obviously wrong". Which is mostly fine, fantastic in fact. I'm fine with it, hacker news is really pleasant because of the more or less quality of the posts and especially the quality of the comments. I'd much rather have it be this way than a free for all, those go bad pretty fast. So I'm for the hive mind, I was just pointing out that you can't make jokes and be upvoted.

The joke wasn't banal at all IMO. GNU has done a lot of good, I've been there since the beginning and paid attention along the way. They have also been pretty self serving with their choices, every project has to sign away their rights and then GNU takes full credit for the project even though they had nothing to do with it other than it being GNU troff for example. Given their tendency to take credit for stuff that they didn't do, and their claim that they can do an OS but clearly can't, that joke is funny as heck.

If you don't get that, sorry, you haven't been paying attention. The quality of the posts in the area of programming, especially systems programming, is spotty. That last one I just don't get, but whatever, the good stuff is good. Years ago I read a similar experiment about max. CPU data flow. Guy was testing how much data can his CPU pass in a second. He was writing it in C, using some Linux optimization, optimizing code for CPU caches, using some magical C vectors that are optimized for such purpose.

He got some help from someone working at Google. I tried to find that post but never succeeded. Does anyone here know it? AstralStorm on June 13, parent next [—].

Does not do that on modern Linux or especially -ck patch. Good to know. I think the last time I tried it was on a rhel5 or rhel6 variant. However, as avip found out below, it does still render OS X useless within less than a minute at least on my MBP. Only eats 2GiB and crashes bash with bash: xrealloc: cannot allocate bytes. ZenoArrow on June 13, parent prev next [—]. This will attempt to open a child shell process with whatever output of "yes" is, interpreted as shell script.

But before that, parent shell has to buffer until EOF. That is, until OOM killer take notice and shut it down. Whole thing will probably take few seconds to a minute depending on how much free RAM there is vs. Then yes, bring-it-to-its-knees. Single CPU is fine nowadays. It will be slowish if you ran it with no nice or priority. RAM can be tweaked by preventing Linux heuristic overcommit via sysctl vm. It may also push everything in memory to swap in the process, what is the real speed killer. Make sure to have a sane memory limit set in PAM and switch vm.

I'm guessing it tries to read the output and return it. Since yes doesn't actually terminate it's going to generate a massive temporary variable to store it's continuous output. Well thanks for killing my mac. People are trying to work here you know. You took a command from a comment which said they use it to "bring their system to its knees", ran it, and then complained? I thought that's what internet is for.

And the question is, do we need yes to be so optimized? Not complaining, I like this kind of analysis But it seems you won't be limited, in a shell script, by the speed you can push y's. That kind of question doesn't make sense for open source code. Somebody wanted to optimize 'yes', so they did. There doesn't need to be a good reason, just like there doesn't need to be a good reason for a person to read a certain book or watch a certain movie, other than they want to do it.

The question was "do we need it? There's not a group of people obligated to work on the GNU utilities. There's not a central project manager ordering people around telling them what changes to make. Declaring that 'yes' is generally fast enough already doesn't imply that it was fast enough for the person who spent their time optimizing it. Somebody needed or just wanted 'yes' to be really fast, so they did the work and submitted the changes back.

No, but there is review and approval of patches also the most frequent contributors know where the project is going and there is bug tracking If one just makes it faster without major benefits that patch is likely to be rejected. But I'm almost certain the GNU project is not being overwhelmed with changes to the core utilities. Everything else being equal code quality, readability, test coverage, etc. JdeBP on June 14, root parent prev next [—]. It makes sense for any kind of code. This is software engineering.

Thinking about the tradeoffs and whether they are appropriate or warranted is bread and butter. It very much makes sense to ask whether all of this is merely optimizing for a benchmark at the expense of other factors that might turn out to be more important, such as long term maintainability, portability, behaviour in some common cases which are significantly unlike the benchmark, and so forth. Several people have made some of these points here and in other discussions; and not only do they make sense, they are an important part of the discipline.

Is the optimized implementation documented well enough that other people coming to it anew can understand it in another 10 years' time?

How many of the magic constants have to be tweaked as the operating system evolves? Is it clear to maintenance programmers how they need to be tweaked? Does optimizing for the benchmark skew the implementation too far away from one common case where one is only wanting a few "y"s because, say, the program being told "y" only asks 17 questions resulting in every invocation of the yes program rapidly generating KiBs or even MiBs of output, that lives in pipe buffers in kernel space unused and only to be thrown away?

Does it make more sense to put buffer size optimizations in the C library where all can benefit? Will we optimize yes for GNU Hurd, too, in the same codebase? How much conditional compilation will we end up with? If the C library is improved to do better in the future, is it a problem that the optimized for benchmark program no longer automatically gains from it?

How much of a problem is it that a GNU program is now not portable to systems other than Linux, and is locked into one operating system? And what about other benchmarks? Many have noted that this benchmark pretty much relies on the fact that the output of the yes program is simply thrown away, and no real work is being done by anything else on the system. What about a benchmark where it is? What about a benchmark where it is important that yes issues a system call per line, yielding the CPU to other processes so that they can process that line before yes bothers to print the next?

What about a benchmark that measures kernel memory usage and treats lower as better? What about low impact? Yes, it's fun to focus narrowly and optimize yes so that it generates as much output as it can for one specific toy usage.

But in engineering one has to ask and think about so much more. This can be extended to any input given to the program, since yes is defined to take an argument from the command line to print out instead of "y". A detail you seem to be missing is that you are not limited to a shell script.

The shell sets up the pipeline, but the members of the pipeline can be written in arbitrary languages and just have each stdout linked to the next processes stdin. As a result you can process very large volumes of data and consume not waste, consume significant system resources to perform your processing. Real yes accepts a string to print so you can have it spit out a full "yes" or "Y" rather than a hardcoded "y".

Have a few pre-written arrays for the common cases: "y", "Y", "n", "N", etc. Those are the fast cases or the benchmark optimized cases, like, what Volkswagen did. Have another pre-allocated static array to fill in with other input. Maybe we can turn this into a verb? I'm so going to use this as soon as possible. Incidentally, that word works even better in German, where every verb must end in "-en". I am pretty sure software cheating on benchmarks precedes VW.

Samsung a few years ago? Probably not the first either. I think video card manufacturers were first to cheat benchmarks.

That's how you get sued. Let me check that I have correctly understood the scenario you're anticipating. Someone uses "volkswagen" as a verb meaning "cheat in benchmarks". Volkswagen takes them to court, on the basis that it is slanderous or libellous to associate Volkswagen with cheating in benchmarks.

Counsel for the defence reminds the court that Volkswagen were in the news for a protracted period for a benchmark-rigging that saw them hit with a multi-billion-dollar fine, slashed a third off their stock price, saw their CEO resign, led to there being a "Volkswagen emissions scandal" page on Wikipedia, etc.

Volkswagen wins the case. How exactly does step 4 go? By volkswagen arguing that this is industry standard behaviour, and the entire media campaign is already libel and slander. In the same vein as Adobe sues me for photoshopping a picture or Alphabet sues me for googling my exgirlfriends' names? If VW can realistically show that it is industry-standard behaviour which is pretty obvious , then they might even have a chance to win.

I'll call you when they sue. It's been said, "Truth is the best defense for libel". Reasonably held opinion would then be the second. Realistically, not mentioning anyone or anything is the only complete defense. The truth still comes close. I do not think Volkswagen will try to sue you for adding a new phrase to urban dictionary ;-. That's exactly what the OP did. No, OP mallocs an 8k buffer then fills it as part of main.

See the "fifth iteration" done in assembly. Why is it so slow compared to the post in the macbook air. It isn't! Are you trolling? The commenter obviously meant the macOS version of "yes". The VM subsystem is much slower. Probably alignment issues: stuff on the stack might be aligned by default, in contrast to malloc'd memory.

Not to mention pipe buffer size is small on OS X. Someone on June 13, prev next [—]. With that malloc overhead, I expect GNU yes to be slower when only a few bytes are read from it. So, what's the distribution of bytes read for runs of 'yes'? What effect does that have on the size of the binary? I suspect it wouldn't get up much given that you can lose dynamic linking information that may mean having to make a direct syscall, too. Unless you are statically linking, one malloc doesn't significantly affect your startup time.

When only a few y's are read, time is going to be dominated by ld. Sean on June 13, parent prev next [—]. Measurements are really noisy, but I seem to get significantly better numbers than that when I use fsplice on a pre-generated few pages of file data instead.

Yes, splice can bypass the pipe buffer in some cases. I thought this was a fascinating read but it left a serious question lingering in my mind, which is a little out-of-scope for the article, but I hope someone here can address. Why did the GNU developers go to such lengths to optimize the yes program? It's a tiny, simple shell utility that is mostly used for allowing developers to lazily "y" there way through confirm prompts thrown out by other shell scripts.

The stated use case for the perf improvement was "yes 1 may be used to generate repeating patterns of text for test inputs etc.

I've personally used it for generating repeating text and filling disks in system testing, so I appreciate it being faster at those tasks. But doesn't this make the typical use case just a few "yes"s needed slower, since first it has to fill a buffer? I would write the buffer each time it gets enlarged, in order to improve startup speed. Also: The reddit program has a bug if the size of the buffer is not a multiple of the input text size.

And it's increasing the buffer by incrementing one at a time, instead of copying the buffer to itself, reducing the number of loops needed at cost of slightly more complicated math. If "only few yes's are needed" then the slowdown to produce them will be inconsequential, whether they still fill a buffer in this case or not.

If your only overhead is filling an 8K buffer, I don't think your user is going to care. Taking one microsecond instead of one nanosecond doesn't matter all that much when you're going to lose way more than that in pipes, the kernel, the program you're piping it to, etc.

But what's the use case for a large volume of continuous output? It feels like we're optimizing for the wrong use case. One could put a write in the memcpy loop, so that first it writes one copy of the string, then two, then 4, 8 etc. But as usually, I'm counting only the size of the. And there's no initialized static data; all the constants are immediate.

Without aligning the stack by 64 or 4k, sub esp, edx ; mov edi, esp would be only 4 bytes, and safe in a Linux x32 executable: bit pointers in bit mode. For efficient syscall instead of slow int 0x80 or cumbersome sysenter in bit mode.

Otherwise break-even for sub rsp, rdx. So if not counting the BSS bothers you, save a byte and use the stack for buffers up to almost 8MiB with default ulimit -s. We fill a buffer that's 4x larger than we actually pass to write; rep stosd needs a repeat count of dword chunks and it's convenient to use the same number as a byte count for the system call.

NASM source which assembles to this answer's machine code. It works on TIO; it doesn't depend on the output being a regular file or anything. BTS r32, imm8 creating any power of 2. This would have kept the ratio of filled buffer to used buffer at instead of , so I should have done that to save a few pagefaults at startup, but not redoing the benchmarking now.

A static buffer makes alignment to a cache line or even the page size not cost instructions; having the stack only byte aligned was a slowdown. I recorded times in comments on my original code that aligned the stack pointer by 64kiB or not after reserving space for a buffer.

Measured with perf stat. For other benchmarking purposes with a smaller 1GiB tmpfs , it was convenient to use a loop that exited when write failed with -ENOSPC instead of looping until killed. I used this as the bottom of the loop. I tested using tmpfs because I don't want to wear out my SSD repeatedly testing this. And because keeping up with a consumer-grade SSD is trivial with any reasonable buffer size. L2 cache misses from re-reading too large a buffer every time.

And also minimize time spent on filling a buffer before even starting to make system calls. The default dirty timeout is centisecs, and powertop suggests raising that to 15 seconds , so that's not going to come into play, just the high water mark. So what really matters is getting to the high water mark ASAP for a physical disk, and potentially pushing as many dirty pages into the pagecache as possible within the 1 second window, to finish writeback after yes dies.

So this 1-second test probably depends on how much RAM your machine has and even how much free RAM , if you're using a physical disk instead of tmpfs.

Makes some sense: clearing is touching a cold page, copy is copying over a just-cleared buffer, presumably hitting in L2 cache for src and destination. Or L1d for dst if it clears 4k at a time. And that happens inefficiently, zeroing before copying. So clearly hugepages are worth it overall. Managing memory in 4kiB chunks obviously costs a lot more than in 2MiB chunks. Ryzen having kiB L2 caches.

I ran a warm-up run right before the main test to make sure the CPU speed was at full before the timed 1-second started: timeout 1s. And up to 7. Up-arrow recall that string of commands a few times, take the best case.

Assume that lower values didn't ramp up the CPU speed right away; spent some time allocating hugepages, or had other spurious interference. I intentionally limited myself to only looking at 2 significant figures of file size letting ls round to x.

I did see 6. Maybe some lucky early arrangement of hugepages that didn't last into a stable state? Perhaps because of lower startup overhead. My statically linked executable makes literally no system calls before write , just a BSS pagefault or two, vs.

GNU yes being dynamically has more overhead before it gets going. Two chances for iTLB or i-cache misses. GNU yes 's buffer is byte aligned cache line but not 4k-page aligned. Being an odd multiple of 64 bytes address ending in 0x is not ideal for the L2 spatial prefetcher which tries to complete B-aligned pairs of lines but that's probably insignificant; 8kiB is small enough that we get lots of L1d hits and very few L2 misses overall, including both user and kernel space.

The fall-off at the high end of size is I think due to cache and TLB misses. You can run perf stat -d instead of time to see LLC-misses.

And LLC-loads is an indication of L2 load misses. Indeed, small buffers get mostly L2 hits, larger buffers get some misses. And that Linux's time accounting is somewhat granular, I think.

I'm hoping they just reserve space and then copy, so a 2nd call can reserve more space while an earlier one is still copying. But making a clone system call would probably take significantly more code. With significant time spent in the kernel not bottlenecked on DRAM bandwidth, multiple threads could still help. Kernel CPU time is now So yes, less time spend copying , more time spent just clearing. Source on Try it online! As expected it doesn't come close to paying for itself in user-space code size to create the array of ptr,length pairs, although it was possible with a loop around 2 push instructions 4 bytes , but extra instructions outside the loop to get the args into place were necessary.

There's still overhead per iovec x 2k buffer is slower than 20x 8k buffer. The sweet spot is around 4k or 8k buffers, with very minor gains for using a big vector 20x 8k seems enough. Edit: Saved 25 bytes thanks to ceilingcat by assuming 4-byte wide characters. This will fill your disk unless you kill it with a different signal or set a quota or other size limit.

You can't try it online! May be faster on other systems if I increase the buffer size to 1 megabyte but there's no noticeable difference on mine. This really should be an average. Using score. An implementation that coalesces multiple. Thanks to the list of BF constants. Clearly not the winner, only for reference; not bad, btw :. Might be faster to use FPUT?? Produces a very underwhelming result. Based on performance on TIO for most challenges I did, I assume the legacy version will score much better.

I will leave just one of these four combinations with the largest output after OP's local test. This is on my ancient tablet, probability better on my laptop that's far away right now - on holiday! Sign up to join this community. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Learn more. Fastest yes in the west Ask Question. Asked 1 year, 9 months ago.

Active 8 months ago. Viewed 7k times. Lowest score wins. Edit: I will run all code on my computer, please show compilation commands and any setup required.

Specs in case you want to optimize or something : OS: Arch Linux 5. Improve this question. Asad-ullah Khan. Asad-ullah Khan Asad-ullah Khan 4 4 silver badges 7 7 bronze badges.

If all entries were run on the same host, then it could be an interesting comparison. Assuming default Linux tuning settings, some significant fraction of your free RAM can be used for dirty buffers holding pages of the file that haven't made it to disk yet, but have been successfully "written" before yes is killed. Also, be careful when benchmarking: amount of free RAM could affect your measured result!

Show 10 more comments. Active Oldest Votes. Self-contained testing script This script loop-mounts a freshly created filesystem from a memory-backed image file and runs the program there. Improve this answer. Anders Kaseorg Anders Kaseorg Unfortunately, I think this breaks the rules, but I definitely was not clear enough. Does this output to stdout, like the yes program does? I'll give it a shot in the morning.

Those are the errors, not sure the reason.

googlighnalri1972's Ownd

0コメント

1000 / 1000