-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/link: R_PPC64_REL24 relocation truncation to fit errors when building Kubernetes on ppc64le #15823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think the cleanest solution is to make cmd/link split the text section
into suitably sized smaller ones and let external linker do the job (but
for internal linking, our linker also need to generate stubs). Another
possibility is that we always generate the longer form of branch/call
instructions at compile time, and rewrite them into short branch/call and
nops during linking when we know the distance is not too far away.
I agree similar issue will eventually hit other RISC architectures because
as far as I know, none of them support huge branches.
|
Minux, I think it is better to have a max text section size and then split it into multiple .o files if it gets to that limit. Perhaps the limit could be based on GOARCH. I think there are other advantages to avoiding excessively large .o files and text sections. I can look into that. |
@minux I've been looking at this to try and understand how the go.o file gets generated. I'm not sure how best to decide when to generate different .o files. Should each new package just create a new goX.o file? Naming them go1.o, go2.o, etc.? Or are there reasons to divide them up differently. A comment about the internal linker. Personally, I don't think a golang only app will get that big. All the real world Go applications that I know of link against external libraries or packages of some kind so will require external linking. Also a question about what gets generated. I've been looking at what gets put into the go.o file today. I see a lot of functions in the objdump of go.o with names like <type..eq .something> or <type..init something>. What are those for? |
I believe the right solution is to create multiple text sections within the go.o file, instead of multiple .o files. There seems to be some support in place in golang to do that. I was told by my Power linker expert that the GNU linker should be able to handle this. I will try it out. |
I see this was marked for the Go 1.8 milestone, but is a serious bug that can happen in Go 1.7 and does happen in Go 1.6. I have a fix for it, does the fact that it is marked as Go 1.8 mean that I can't submit it? |
You can still submit it. I marked it 1.8 because it's not a regression and because it sounds like the fix is complicated. If you have a safe fix, we can still get it in for 1.7. |
CL https://golang.org/cl/23963 mentions this issue. |
I have found 2 problems with the golang processing for elf sections that are related to my change. It's taken me a while to debug it because I'm not familiar with it yet. When multiple text sections are created then the syms for the functions found in the sections other than the first will have their Value set to the offset relative to the start of the section they are in. So if there is a function in the second text section, its Value is set to the offset from the start of that section and since the shndx is set correctly, the linker relocates it according to its containing section correctly. However I have found at least 2 other places where golang's elf processing assumes that the Value in a sym is relative to the start of the first text section and that causes errors. In the first case, there is a structure being created into rodata somewhere (not documented, not named the same as where it is used at least not that I could find) for the ifn and tfn fields of the method structure in runtime/type.go, for some of the go.itab functions and probably others. These values were filled in incorrectly for functions in text sections other than the first text section. If someone could point me to where the method's fields are being filled into the rodata section then I can try to find a way to make that work. The other problem occurs when generating the findfunctab because the code assumes that all the function syms have Value fields that are ascending, and again if there are multiple text sections the functions in the second text section will have offsets relative to the start of the second text section. As a result the functab is not correct and look up fails. I think I have some ideas on how to make that one work. |
The data structure you mention has some documentation in reflect/type.go. It is generated in cmd/compile/internal/gc/reflect.go. When a concrete type has methods, the reflect.rtype is followed by a reflect.uncommonType which has an offset to an array of reflect.method structures, which contain the text offsets you mention. The compiler creates the text offsets by generating an R_METHODOFF relocation, which cmd/link turns into a R_ADDROFF relocation and then turns into a section offset. For the reflect.method structure, this R_METHODOFF is generated in dextratypeData. (In particular, see dmethodptrOffLSym.) |
Based on my debugging it looks like the ifn and tfn addresses in this table for methods in the second text section are being generated as if the section offsets were relative to the first text section instead of the second. |
There are problems with two elf tables in output sections related to this change. The method offset tables are generated with absolute offsets determined at compile time, assuming that all text will appear consecutively. But if there are multiple text sections, the second one will not appear right after the first, there will be tables inserted by the linker to handle the long call offset instructions. I believe the method table could be set up with offsets that are relocated so that the values are correct (although not sure what the relocation would be.) The second problem is with the pclntab. It contains "entries" for the functions in the program. The entries in the table are relocated addresses but the buckets that are generated to look up those entries were created based on absolute function addresses. As in the first problem, it assumed that all the text is consecutive so that means the buckets used for looking up functions in text sections after the first are not correct. I think the problem is somewhat easier to solve because worst case it could just do some searching if the lookup PC didn't provide the right function address the first time. |
@ianlancetaylor What is the backport policy on a change like this. Must it go upstream before it can be backported to the Go 1.6 branch? The reason I'm asking is that the latest additions to my fix (the risky ones?) are needed because of the new method offset tables (int32) in Go 1.7 that depend on absolute offsets of methods rather than relocated addresses like they were in Go 1.6. My previous patch mostly worked for Go 1.6, but needs the minor addition of more searching in functab. |
We do normally require that a CL be accepted on tip before it be backported. And, we don't currently plan to make a new 1.6 release anyhow. And, it would be odd and unfortunate if programs worked with 1.6 but broke with 1.7. |
Do you have any thoughts on why this problem wouldn't also occur on arm64? It is my understanding that they have a similar call instruction with a 24 bit field (shifted by 2 to get 26 bit range) for the offset of the method being called. |
I have a another proposal for Go 1.8 and beyond, and
it's not limited to ppc64 port. It should help with all the
RISC ports where jump/call instructions can't cover
the entire 2GB/4GB space.
In compiler, we don't assemble the entire function into
a single blob of instructions, and instead, we assemble
each basic block separately, and we let the linker
assemble the exit instruction of each basic block based
on the actual distance.
The other benefit of this is that, we can make the linker
reorder the basic blocks and possibly do basic block
cache alignment.
In the new SSA backend, this is probably fairly easy
to do. If not, I think we can make cmd/internal/obj do
this by tracing the instructions. I think we already have
the code for some RISC architectures (but the code
is probably disabled and lost during the c2go translation?)
|
If there is a cleaner, more general solution, I'm all for it. With this suggestion, how would the linker decide when to create a long or short call? As it is generating code/text and writing it out for the final text section, it still won't know the offset for calls to future functions in this text section, which also could be at offsets that are too large. Or will you always generate long calls if the total text size is over a certain threshold and the call target offset is unknown? |
There are multiple ways to handle external targets. For example, we can put
a list of near jumps to external code at the end of the text segment and
make all internal call sites jump there. Or, we can generate per function
.text sections.
I think the first solution is also a simpler solution for ppc64 Go 1.7?
|
If you put the list of near jumps after a text section that exceeds the limit, then how can the internal call sites branch there? The offset would be too big. If we try to generate per function .text sections, we'd hit the same problem with the method offset table as happens with my patch. Because the linker would fix the long calls and add jump tables between sections, causing the offsets in the method offset table to be wrong. |
Liblink used to be able to choose appropriate instruction to jump further.
That capability must be lost in the transition to generate object code for
each function in the compiler. Not able to support large pure Go binary
seems a general problem to me.
However, if each function is not too big, we should be able to insert
trampolines after each function to resolve all internal jumps.
|
Right. The function offset in method table fundamentally assume all Go code
is in the same section. I think the Go symbol table and module data
(pcdata) also make such assumption.
Splitting into multiple .text sections must address those assumptions in
runtime first.
That's why I'm suggesting not to split the .text section.
|
Based on the comments from Brad and Ian in #16356 I didn't think any patch for this problem in Go 1.7 was going to be accepted. Did I misunderstand that? So I don't understand why you are trying to find a solution for it there. As far as the long term fix, I think that changing the generated code is as risky or more than my original fix. The section split can only happen on ppc64le now so any of your concerns about what might go wrong if the sections are split can't happen on other platforms. We have built and tested programs where the sections are split and I've inspected the symbols and they have correct addresses even if they are in the additional text sections. I'm not sure what your concerns are about the pcdata. I think it is best to generate correct elf sections in a way so that the GNU linker can solve this problem since it already knows how instead of reinventing another solution for it. I have some ideas on how to improve the way my patch is using the method offset table, but I want to do some experiments before posting anything. As far as the same problem in a pure Go binary, the solution/suggested workaround could be to require that they use external linking in that situation. |
@laboger I encounter similar linker relocation problems on ppc64le machine based on
|
@jianzhangbjz I don't think this is the same error as what's discussed in this issue. Your error message says there is an undefined symbol. This problem identified in this issue happens when the symbol exists but the offset to that symbol exceeds the limit. First fix the problem with the missing symbol and see if it still happens. |
What has to happen to move forward with this issue and patch? I have tried some experimentation with modifying the call sequences. I still think there is risk in making those changes too. I don't have enough detail from @minux previous proposals to understand fully what he is suggesting and if he is the one who is planning to implement them. If his plan is to change the code sequences at assembly time then that means all imported packages would need changing too so those calls are fix and thus all imported packages need rebuilding. If there are concerns with my patch related to the way sections are generated I can clean it up and/or make it more specific to ppc64le only if I know what areas are of concern. Please let me know how to proceed because this is a big issue since it prevents several Kubernetes binaries from building and we need to get a fix into 1.8. |
CL https://golang.org/cl/27790 mentions this issue. |
@jianzhangbjz The problem you reported earlier was found to be related to the level of binutils. The error occurs on binutils 2.24 but not on binutils 2.25. If you can build with binutils 2.25 the error should go away. |
@crawshaw suggested I add a testcase for this scenario. It have a prototype that generates the test, which took a long time, and then it took over 45 minutes to compile and link it. It needs to be cleaned up a bit before adding to the CL but wanted to know if this is too long to use for a testcase. This would test that the linker tables were created and inserted correctly but wouldn't stress the number of sections since only a few are generated. |
I'm sure @bradfitz would prefer the 45 minute test case. That will give him a nice challenge for #17104. Just kidding, 45 minutes is clearly too long. I also don't think you should do heavy modifications to the linker for your CL just for testing. So probably we should move ahead without a test case. @crawshaw, opinions? |
Definitely too long to be on by default. But you can |
45 minutes is never going to be run, so it's not much use. But I'm surprised it takes so long. Does the kubernetes binary that started all this take that long? If not, can it be simplified into a test case? |
I'll try some more experiments to see if I can get a testcase with a big text section that builds faster. I just redid the full Kubernetes build (which builds 25 binaries, with 3 that exceed the text size limit) and it took 48 minutes for all of them. Either there's something odd in my testcase or it's because my test is a single huge .go source file. With the Kub binaries they build from multiple .go files and include large packages (.a files) that were compiled in an earlier separate step. I like Brad's idea of having the test out there but skipping it by default. |
Ooops there is probably some parallelism in the Kubernetes build so I will need to measure the build times individually to really know how they compare. |
You want a big text section right?
Slowest part is 1e6 echos. If you do it with a single open and a write buffer in a go program the test should be decently fast. (You may still want to guard it with |
I am working on the latest suggestion for a test. It takes only a few minutes to build when done this way. It does reproduce the problem and identified a bug that only happens with really big functions. So I am in the process of fixing that and retesting it all and will update the CL once it is ready. |
Please answer these questions before submitting your issue. Thanks!
go version
)?go 1.6.1
go env
)?Ubuntu 16.04
If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.
Trying to build kubernetes upstream with installed go 1.6.1 on Ubuntu 16.04.
git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes
modify the hack/lib/golang.sh file to remove the 2 comments that are preventing ppc64le from being used.
make
A clean build
A failed build with linker errors due to the extremely large size of the intermediate go.o file that is being generated when building the hyperkube binary. The text section of go.o is too large for the GNU linker on ppc64le to handle, specifically with respect to branch offsets within the code.
From the output with -v option set for ldflags:
host link: "powerpc64le-linux-gnu-gcc" "-m64" "-gdwarf-2" "-o" "/tmp/go-build669045650/k8s.io/kubernetes/cmd/hyperkube/_obj/exe/a.out" "-rdynamic" "/tmp/go-link-267127109/go.o" "/tmp/go-link-267127109/000000.o" "/tmp/go-link-267127109/000001.o" "/tmp/go-link-267127109/000002.o" "/tmp/go-link-267127109/000003.o" "/tmp/go-link-267127109/000004.o" "/tmp/go-link-267127109/000005.o" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-lpthread" "-g" "-O2"
/usr/lib/go-1.6/pkg/tool/linux_ppc64le/link: running powerpc64le-linux-gnu-gcc failed: exit status 1
/tmp/go-link-267127109/go.o: In function
k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.(*eventMonitoringState).monitorEvents': /home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/event.go:214:(.text+0x200798c): relocation truncated to fit: R_PPC64_REL24 (stub) against
runtime.ifaceeq'/tmp/go-link-267127109/go.o: In function
k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.headersWithAuth': /home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/image.go:550:(.text+0x200dc10): relocation truncated to fit: R_PPC64_REL24 (stub) against
runtime.makemap'/tmp/go-link-267127109/go.o: In function
k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.tlsDialWithDialer': /home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/tls.go:50:(.text+0x20115dc): relocation truncated to fit: R_PPC64_REL24 (stub) against
runtime.makechan'/home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/tls.go:85:(.text+0x20119ec): relocation truncated to fit: R_PPC64_REL24 (stub) against
runtime.chanrecv1' /tmp/go-link-267127109/go.o: In function
k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient.(*Client).ListVolumes':/home/boger/kubtest/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/fsouza/go-dockerclient/volume.go:46:(.text+0x2011c78): relocation truncated to fit: R_PPC64_REL24 (stub) against `runtime.makemap'
..... more
Interestingly, if I try to add the -tmpdir option to -ldflags so I can look at the intermediate .o files, I get this:
host link: "powerpc64le-linux-gnu-gcc" "-m64" "-gdwarf-2" "-o" "/tmp/go-build496546066/k8s.io/kubernetes/cmd/hyperkube/_obj/exe/a.out" "-rdynamic" "/home/boger/gotmp/go.o" "/home/boger/gotmp/000000.o" "/home/boger/gotmp/000001.o" "/home/boger/gotmp/000002.o" "/home/boger/gotmp/000003.o" "/home/boger/gotmp/000004.o" "/home/boger/gotmp/000005.o" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-g" "-O2" "-lpthread" "-g" "-O2"
/home/boger/golang/go1.6/go/pkg/tool/linux_ppc64le/link: running powerpc64le-linux-gnu-gcc failed: exit status 1
/usr/bin/ld: /home/boger/gotmp/go.o: invalid string offset 1869033317 >= 7130106 for section `"���'
/home/boger/gotmp/go.o: error adding symbols: No error
collect2: error: ld returned 1 exit status
I was hoping that the -buildshared option would help this problem but I ran into other problems when I did that and opened another issue when using that option.
This is a serious problem for us because it prevents upstream Kubernetes from building on ppc64le.
Our Power linker expert says that the go.o file that is generated has a single text section that is too large. Usually, when the branch offset could be too long, stubs are inserted to handle it but in this case the text section is too long for the stubs to work.
From the linker output when it fails:
.text 0x0000000010001950 0x107b7c0 /tmp/go-link-220343705/go.o
0x000000001006a938 _cgo_reginit
0x000000001006d4c0 _cgo_topofstack
0x000000001006d898 main
0x0000000010a6b490 _cgo_panic
0x0000000010a6b4c8 crosscall2
From the last commit where it linked successfully:
.text 0x0000000010001740 0xe68d58 /tmp/go-link-262454621/go.o
0x000000001006a8a8 _cgo_reginit
0x000000001006d430 _cgo_topofstack
0x000000001006d808 main
0x00000000102d2c40 _cgo_panic
0x00000000102d2c78 crosscall2
So for now we are the first platform to hit this problem, but as applications keep adding more and more packages eventually limits could be hit on other platforms too. It is not very efficient for the linker to have to handle such large sections either.
FYI... I did try go 1.6.2 and go built from master with similar results. Right now the size of the text section in the go.o file is living on the edge so depending on what is done by a commit, the link will sometimes work.
I suspect resolving this is not easy, so any packaging suggestions or other ideas on how to reduce the text size would help too. The last commit to successfully build was 997d55d.
The text was updated successfully, but these errors were encountered: