-
Notifications
You must be signed in to change notification settings - Fork 695
Propose (Load,Store)WithOffset #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Would it be useful to allow offsets that weren't literals as well? |
Good question. The worry is that we'd impose extra work for some OS/platform/impl-strategy. The good thing about a constant offset is that, when using dynamic branching, we can always just do 1 bounds check (against $(heapLength - imm)) and this should be GVNable (by taking max(imm), subject to certain dominance requirements). But, if the offset is dynamic, I don't see how we'd get around, in general, needing an additional jump-if-overflow (after the add) followed by the normal bounds check. |
Nice. |
W.r.t combining and hoisting bounds checks: we explored many such tricks in TurboFan. It gets really tricky if you want to guarantee the out-of-bounds error occurs at exactly the right instruction. You also don't want to generate errors early (i.e. outside of loops) if the program would not actually have executed an out-of-bounds access. |
Yes, those have been concerns for us as well and that specifically was the motivation for the "cannot execute module after OOB" proposal. That way, you can hoist the bounds check caring only about dominance and external visibility (via calls to imports); once the OOB is detected, noone can see exactly where it happened or what had already executed. For the loop case, the "loop rotation" technique Dan mentioned in #27 would allow the bounds check to be pulled out of the loop but still be guarded by the condition. |
The check could still be hoisted above a write to the buffer, which would be visible to the outside. |
By "poisoning the heap" (from V1.md#heap), the idea is that any external JS ArrayBuffer aliasing the heap would also be subsequently unavailable (detached, which is what is proposed to happen upon resize as well). |
I would also hold off this change if it's just a size saving, and aim towards automated compression on real-world code. |
I'm not sure if you read most of the comments, but it's specifically not motivated by size. |
Isn't that the same as reg = mov imm ; load var + reg ? |
I'm not sure what you're asking. The way LoadWithOffset(base, offset) is different than Load(Add(base, Imm(offset))) is that the latter can wrap-around (when base is out-of-bounds) to be in-bounds so you can't simply throw if 'base' is out of bounds. |
Hmm, that's an interesting point: overflow on the int is defined, but at the C++ level it certainly isn't defined for addressing purpose. |
We explored bounds check combining extensively in TurboFan in the early days. There be dragons here if you don't have a backup path to run the "slow version" up to the point where the actual OOB access occurs. For example, if you try to lift bounds checks out of branches to a dominator (with loop invariant code motion just being a special case of this), you can actually introduce checks that fail earlier that would not necessarily have failed in the execution (in particular because the program had a branch that guarded that OOB access). This is easy to see if you have accesses i+1 and i+2, where i+1 is inbounds and i+2 is out of bounds. Then combining the check for both i+1 and i+2 is tricky; if you check i+2 is in bounds, it'll fail--but where? At the access to i+1? That's surprising, especially if the program never executes the i+2 access. The way you'd do this with deoptimization is to do the check but just deopt if i+2 is out of bounds, then run slow code up to the point where the i+2 access happens, if at all. If you have such a sophisticated system I don't see that having special instructions really makes it easier. |
So I've been thinking about a combination of approaches for just that use case you mentioned: on 32-bit, you can have an extra guard page such that, as long as the base is in-bounds, base+offset (where offset < guard page length) will either be in-bounds or fault. Thus, you could hoist the bounds check (assuming we have the no-overflow property of loadWithOffset!) in the case you talked about. That being said, I am hoping that, even without these advanced techniques, just optimizing straight-line (load p; load p+1; load p+2; load p+3) and loop (where we can hoist into the loop-rotated head) cases will remove the vast majority of bounds checks (at least dynamically executed ones). We plan to do more experimentation to measure this % to help make the case for deterministic OOB. |
On Tue, May 5, 2015 at 5:39 PM, Luke Wagner [email protected]
Tried multiple versions of that, with and without faulting. With special
|
Well, I'm saying the guard page is PROT_NONE so you fault and the signal handler does The Right Thing. (This is something we already do for asm.js but it would be way easier with throw-on-OOB semantics: the signal handler just moves $pc to the throw trampoline.) |
NaCl uses guard pages on x83-32, x86-64, ARM and MIPS, exactly for the purpose of preventing out-of-bounds reg+imm accesses. There reason NaCl can do it easily is because it's in a separate process. I'm wary of designing the spec with such implementation requirements. I think we should leave this as implementation-defined behavior, and try to converge when we have better experience. |
We actually implemented guard pages with combining bounds checks almost On Tue, May 5, 2015 at 6:24 PM, JF Bastien [email protected] wrote:
|
@jfbastien Of course the implementation would be an implementation-defined behavior :) Also, having a separate process is (demonstrably) not a requirement for using signal handlers. @titzer I'd be interested to hear about the security concerns. One real one we take seriously is not accidentally squelching crash reports. Also, if the guard pages are only hit when there is a real OOB (at which point you jump to throw; asm.js doesn't have this property, it might wrap back around to be in-bounds, hence this proposal) I don't see how there could be a performance cliff for non-OOB code. So bringing the conversation back from possible optimization techniques to the main PR: I think loadWithOff() should help regardless of the impl technique used. Any concerns on this point? |
Separate process isn't an issue with signal handling, I was mentioning it because I worry about address space availability and fragmentation when in-process. |
We had a couple ideas on what to do when the signal handler hit. Both were premised on the idea that we wanted to continue back to the code that faulted after fixing things. The easiest one was to look up a handler in a table based on the PC of the fault and just return to there. The handler would materialize either 0, NaN, undefined, or whatever was necessary for the out of bounds read. The other option was to have metadata that basically encoded which value and where to materialize it. The signal handler would interpret that and simply jump over the faulting instruction. The security teams concerns centered around attacks to either the metadata table or other data in the system that could trick the handler into jumping into attacker-controlled code (or using ROP some juicy code already in Chrome). |
@jfbastien We were talking about an extra guard page (a few KB) which doesn't significantly change address space availability. @titzer Odin currently does the latter (fills in 0/NaN and jumps to the next pc). Anyhow, it sounds like these security concerns wouldn't be an issue with WebAssembly since, with the loadWithOffset() semantics, the signal handler always sets PC to a fixed (throw trampoline) address. This whole operation can be (is, in Odin) guarded on the PC being inside WebAssembly code and at a known heap access. |
Luke: it does on x86-64 :-) |
@jfbastien I think we must be talking about different things. I'm talking about a single extra guard region (say 4kb) tacked onto the end of the big heap used on 32-bit platforms that allows the hoisting of dynamic bounds checks. Given heaps in the 64-512mb range, this is insignificant. I'm not talking about the "map 4gb" trick for 64-bit. |
Understood. I'm just not sure where we limit the redzone size? I want to make sure we don't assume a certain implementation in the spec. What's the maximum immediate size, if any? What do we do if bigger is provided? |
FWIW, we (WebKit) are unlikely to have signal handling magic - mainly because we live in everyone’s apps as a framework, and signal handler registration isn’t really invisible to a client app. The client app could have its own signal handler tricks, and it may not be doing chaining properly. So, it would probably break apps if we did signal tricks. Therefore, I’m in favor of any spec feature that lets a signal-magic-free system on mainstream hardware get good performance. This makes me like (Load,Store)WithOffset and heap poisoning, though I haven’t thought about either of them very deeply. -Filip |
@jfbastien The spec would not talk about redzone size and would not, e.g., limit the immediate size other than to something reasonably large (e.g., int32). @pizlonator Totally agreed; we definitely want something that runs well on platforms where signal tricks aren't possible, don't work, don't make sense, etc. It still requires proof, but I am quite hopeful that (Load,Store)WithOffset and heap poisoning will allow aggressive-enough hoisting that deterministic halt-on-oob semantics won't be significantly slower than the non-deterministic semantics we were discussing in #23. |
I want to think about this poisoning more. Have you written this up in any more detail? My initial take is: Option #1: Speculatively hoist checks to a dominating but not control-flow-equivalent position and recompile if that hurts. Option #2: Hoist checks to a control-flow-equivalent dominating position and use OSR exit to get to the real point of failure. Option #3: Hoist checks to a control-flow-equivalent dominating position and claim in the webasm spec that this is OK. I’m assuming you’re trying to tackle #3. I’m assuming that #1 and #2 shouldn’t be required by the spec. -Filip |
@pizlonator I haven't written it up; it's actually a recent idea @sunfishcode proposed when we were thinking about the same bounds-check-hoisting scenarios that it sounds like we've all be thinking about. You're right, option 3 is the goal and we'd like to avoid 1 and 2. |
Something else to consider: we could further reduce the burden on webasm implementations by allowing the webasm generator perform check hoisting. Imagine if webasm had a checkAddress(base, offset) operation. In my strawman, this operation would be required to do a bounds check, and throw/abort/whatever at that point, if it fails. load/store would still semantically require bounds checks, but with the caveat that checkAddress(base, offset1) subsumes a dominated checkAddress(base, offset2) if offset2 <= offset1. Then, eliminating the checks in the webasm implementation would just be a matter of doing CSE. This certainly increases the amount of work that the webasm generator must do, but maybe it’s simpler for the spec? -Filip
|
On Tue, May 5, 2015 at 10:12 PM, pizlonator [email protected]
|
#3 refers to Luke’s vague poisoning proposal. It requires control-flow-equivalence. So, we wouldn’t pull a check out of a branch, and your problem doesn’t manifest. -Filip
|
That's an interesting option and one we've discussed (without conclusion) before. I guess the big question question for me is if there are any big opportunities that this opens up where LLVM knows something (from C++ semantics) that allows it to hoist the checkAddress() up higher than the wasm backend could (esp. with poisoning/WithOffset semantics). On case where the wasm backend is limited is the control-flow-equivalence/domination requirement, so that leads to the question of whether the C++ spec buys us any UB we can leverage here. It seems like, since you can write One concern is that this depends on the wasm code generator having a pretty precise model of the CSE capabilities of the wasm engine. Otherwise, the code generator could be generating checkAddress()s that weren't being eliminated and thus creating more work. Also, some wasm compilers (e.g., the baseline compiler we might use for fast startup) won't have any CSE to each checkAddress() would cost. One variation of this idea and possible fix we've considered is having checkAddress() return some "validated pointer" type that is consumed by loads/stores and part of the type system. The advantage here is that you could have a guarantee that even a dumb (read: baseline) wasm compiler would be able to elide the bounds checks. It would also allow doing some potentially fancy hoisting (inter-procedural, but not heap, since you can't trust anything coming out of the heap). Another concern is that, in cases where we are using signal magic, the checkAddress would actually be imposing additional requirements, likely requiring a spurious load (to trigger the fault). If code generators generated tons of checkAddress, this could end up hurting. |
On Tue, May 5, 2015 at 10:42 PM, pizlonator [email protected]
For error reporting and debugging modes, moving checks above stores, or The only advantage that I see with (Load|Store)WithOffset is guaranteeing In general I think we should avoid surprising behaviors for small gains in -Filip
|
@titzer Agreed poisoning is mostly orthogonal to *WithOffset. What is the surprising behavior you're referring to? Having the heap (really, the module) not be accessible after an OOB? This could be considered (rightly, I think) a security measure as the module may be in a corrupt state making it vulnerable to XSS-type attacks. |
Yup, all good points. -Filip
|
I think I agree with you. My reading of what you wrote sounds like what I was trying to say earlier. The question is: should the spec allow some fudge or features to make some check hoisting easier if the implementation doesn’t use versioning or deopt? My understanding of poisoning is that it is something that the spec would have to say things about, to loosen semantics to allow a check to float above a store. I’m trying to decide if I like poisoning being in the spec and to do that, I’d like to hear more about the specifics. -Filip
|
Yes, poisoning would definitely need to be in the spec if it were accepted. Experimentation is needed (microbenchmarks being totally different than large C++ apps), so I'm really just interested in putting the ideas out until the implementations get to a state of maturity where we can experiment directly. Back to the PR, everyone seems positive on overflow-free (Load|Store)WithOfffset that I can tell, so I'll merge that. Happy to continue discussing the broader issue of bounds checking, though, here or in a new issue. |
This is a proposal for discussion, relevant to the separate heap/OOB discussion in #23.
A minor side benefit is that, because most loads/stores have a non-zero immediate offset (based on a trivial pattern match), the polyfill prototype shows a 4% size win from this opcode (compared to Load(Add(index, Lit(offset)))).