Skip to content

include: net: Fix incorrect casting of timeval on zsock_timeval. #29962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kkasperczyk-no
Copy link
Collaborator

@kkasperczyk-no kkasperczyk-no commented Nov 12, 2020

Currently zsock_timeval implementation uses long type for both
tv_sec and tv_usec fields. In the select method timeval type is
directly casted on zsock_timeval, but in some cases (e.g. when
using CONFIG_POSIX_API and some specific libc version) both types
have fields of different sizes, what may lead to assigning them
incorrect values.
Using types declared by the used libc in the CONFIG_POSIX_API=y
case will let to keep timeval and zsock_timeval types compatible.

Signed-off-by: Kamil Kasperczyk [email protected]

@github-actions github-actions bot added the area: API Changes to public APIs label Nov 12, 2020
@kkasperczyk-no kkasperczyk-no changed the title [include/posix] Fix incorrect casting of timeval on zsock_timeval. include: posix: Fix incorrect casting of timeval on zsock_timeval. Nov 12, 2020
@pfalcon
Copy link
Collaborator

pfalcon commented Nov 12, 2020

When the original code was written, there were assumptions that: a) Zephyr is intended to be small embedded RTOS, so unneeded code should be avoided; b) the layout of zsock_timeval was specifically made to match layout of Newlib's (as bundled with Zephyr SDK) struct timeval.

@kkasperczyk-no
Copy link
Collaborator Author

When the original code was written, there were assumptions that: a) Zephyr is intended to be small embedded RTOS, so unneeded code should be avoided; b) the layout of zsock_timeval was specifically made to match layout of Newlib's (as bundled with Zephyr SDK) struct timeval.

Maybe I will explain the reason why I proposed this change, to clarify situation. I met with the problem that using CONFIG_POSIX_API=y leads to the situation when timeval has tv_sec field of 8 B size and tv_usec of 4 B size, while for zsock_timeval both of them are 4 B. Because of the direct casting of one type to another, field values are not assigned in the expected way and select method is not working properly (time spent on sleeping is different from the expected). You wrote that there were few assumptions made, so is it kind of desired operation? If solution that I proposed is wrong, could you suggest how can I fix my problem?

@pfalcon
Copy link
Collaborator

pfalcon commented Nov 12, 2020

I met with the problem that

Perhaps, the most useful information here would be: in what config you met that situation (what libc, etc.)

I don't say that this change is wrong, just that the original idea was to support CONFIG_POSIX_API with just Newlib, as otherwise there're just too many combinations (often of unknowns) to support. Of course, if people want to support more combos, why not.

@pfalcon
Copy link
Collaborator

pfalcon commented Nov 12, 2020

If solution that I proposed is wrong, could you suggest how can I fix my problem?

Again, not saying it's wrong, but another option would be, for the case of CONFIG_POSIX_API, to rely on the declaration of struct timeval of the underlying libc, and make zsock_timeval a #define to it. (Of course, also need to handle case of !CONFIG_POSIX_API).

@Damian-Nordic
Copy link
Collaborator

@pfalcon
timeval from newlib is defines as:

struct timeval {
	time_t		tv_sec;		/* seconds */
	suseconds_t	tv_usec;	/* and microseconds */
};

where, by default, time_t == _TIME_T which in turn is defined as follows:

#if defined(_USE_LONG_TIME_T) || __LONG_MAX__ > 0x7fffffffL
#define	_TIME_T_ long
#else
#define	_TIME_T_ __int_least64_t
#endif

Thus sizeof(time_t) == 8 unless you override the config.

zsock_timeval is, on the other hand, defined as:

struct zsock_timeval {
	/* Using longs, as many (?) implementations seem to use it. */
	long tv_sec;
	long tv_usec;
};

Hence if you say that both structures should match in ARM+newlib scenario it's simply not the case. Should we then fix zsock_timeval to use time_t and get definition of that type from libc headers?

@pfalcon
Copy link
Collaborator

pfalcon commented Nov 12, 2020

@Damian-Nordic: So, newlib was upgraded over this time too.

Anyway, the commit message here said "wrongly assumes", and I proceeded to explain that it actually was the "baseline assumption", which is of course holds only under specific setup.

How to fix it - add layers of marshalling like this patch does, or uphold the original idea of avoiding that whenever possible - is up to new maintainers of the POSIX subsys (when I was one, I clearly leaned on the side to avoid adding extra layers, whenever possible. But the lack of consensus and clear direction how to do things for the subsys is why I resigned.)

@kkasperczyk-no
Copy link
Collaborator Author

Ok then, @aescolar could you comment on this PR, as I found that you are POSIX maintainer?

@pfalcon
Copy link
Collaborator

pfalcon commented Nov 12, 2020

Ok then, @aescolar could you comment on this PR, as I found that you are POSIX maintainer?

He's maintainer of native_posix port of Zephyr. See e.g. #13054

@pfalcon
Copy link
Collaborator

pfalcon commented Nov 12, 2020

@kkasperczyk-no, @Damian-Nordic: Generally, this patch should be fine. I'm sure it will be approved. I apologize for not doing that myself, as it literally goes sideways with the direction I tried to move the POSIX subsys in, and spent a lot of effort on that. (But neither can I encourage you to go the same way, as I myself very well know that it's full of obstacles and stumble points, vs just applying local workarounds like this.)

@aescolar
Copy link
Member

Ok then, @aescolar could you comment on this PR, as I found that you are POSIX maintainer?

He's maintainer of native_posix port of Zephyr. See e.g. #13054

Correct, they are not the same thing :)

@Damian-Nordic
Copy link
Collaborator

@pfalcon That's fine. We appreciate your comments and we don't want to negatively impact the code quality, but we may not understand all the design principles of the POSIX layer, so it's hard for us to make the decision which way to go.

Suppose that we want to follow the rule that all Zephyr structures match libc counterparts. Why isn't that relationship expressed explicitly in the code? I can see a lot of structures are assumed to be the same, but this assumption doesn't seem to be checked anywhere. Is it because we don't want inner Zephyr layers depend on the POSIX layer (i.e. we want the Socket layer to stay the same regardless of how it is used)? If so, isn't the marshaling approach that we proposed more natural?

@pfalcon
Copy link
Collaborator

pfalcon commented Nov 12, 2020

Suppose that we want to follow the rule that all Zephyr structures match libc counterparts. Why isn't that relationship expressed explicitly in the code?

Because the whole POSIX subsys is "experimental" and many things about it are "RFC". There were various proposals, e.g. #17706, #13787, #13054, but I never received enough feedback on them.

Is it because we don't want inner Zephyr layers depend on the POSIX layer (i.e. we want the Socket layer to stay the same regardless of how it is used)?

Yes, that's kind of how it all started - POSIX subsys is just a layer on top of native Zephyr API. But as I delved into the subsys (and I inherited it from other folks, and trust me, it was in much worse shape then), and thought how to optimize various parts of POSIX API implementation, it became clear to me that to achieve that, Zephyr kernel would need to be tweaked to be more POSIX-friendly. But whether that should be done or not depends on what place POSIX support has in Zephyr, and even that was never answered well and clear.

(i.e. we want the Socket layer to stay the same regardless of how it is used)?

In Socket layer you also can see traces of its evolution - it started as experimental layer on top of "native networking" API, and the first main goal was to show concerned people that it's not big bloat and a very viable alternative to the adhoc API. It won and became the main Zephyr networking API, but some its initial traits (including small code, etc. overhead) are there. E.g., by all means, CONFIG_NET_SOCKETS_POSIX_NAMES is a chore, and if someone says "It's a chore, let's drop it", I wouldn't argue. Myself, I'm not the best person to do that, given how much effort I spent to implement and maintain it that way already (so, it's easier for me to apply yet another workaround when it bites).

If so, isn't the marshaling approach that we proposed more natural?

Maybe it is, but: a) it requires stack allocation for extra struct copy; b) requires dummy code to move values around. Obviously, for one case, it's nothing, but as you understand, we discuss the general approach. And my approach had always been "Nobody's going to optimize it later, so think about "small" things (multiplied in dozens/hundreds) right away."

@kkasperczyk-no
Copy link
Collaborator Author

for the case of CONFIG_POSIX_API, to rely on the declaration of struct timeval of the underlying libc, and make zsock_timeval a #define to it. (Of course, also need to handle case of !CONFIG_POSIX_API).

@pfalcon I tried to follow your suggestions and not make local fix. Could you verify if it's what you meant?

@kkasperczyk-no kkasperczyk-no changed the title include: posix: Fix incorrect casting of timeval on zsock_timeval. include: net: Fix incorrect casting of timeval on zsock_timeval. Nov 13, 2020
Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pfalcon
Copy link
Collaborator

pfalcon commented Nov 16, 2020

@kkasperczyk-no:

@pfalcon I tried to follow your suggestions and not make local fix. Could you verify if it's what you meant?

I appreciate giving it a try! However, I meant more along the lines of:

#ifdef CONFIG_POSIX_API
// For POSIX API, we rely on the underlying libc definition
#define zsock_timeval timeval
#else
... (original zsock_timeval declaration as it stands now)
#endif

Can you give that a quick try to see if it works (it should, but that's the magic of POSIX subsys that even seemingly simple changes may lead to unobvious effects) ?

@kkasperczyk-no kkasperczyk-no marked this pull request as draft November 17, 2020 12:03
@kkasperczyk-no kkasperczyk-no force-pushed the timeval_fix_pr_public branch 2 times, most recently from f319ece to 74d7714 Compare November 18, 2020 07:56
Currently zsock_timeval implementation uses long type for both
tv_sec and tv_usec fields. In the select method timeval type is
directly casted on zsock_timeval, but in some cases (e.g. when
using CONFIG_POSIX_API and some specific libc version) both types
have fields of different sizes, what may lead to assigning them
incorrect values.
Using types declared by the used libc in the CONFIG_POSIX_API=y
case will let to keep timeval and zsock_timeval types compatible.

Signed-off-by: Kamil Kasperczyk <[email protected]>
@kkasperczyk-no kkasperczyk-no marked this pull request as ready for review November 18, 2020 13:46
@kkasperczyk-no
Copy link
Collaborator Author

kkasperczyk-no commented Nov 18, 2020

@kkasperczyk-no:

@pfalcon I tried to follow your suggestions and not make local fix. Could you verify if it's what you meant?

I appreciate giving it a try! However, I meant more along the lines of:

#ifdef CONFIG_POSIX_API
// For POSIX API, we rely on the underlying libc definition
#define zsock_timeval timeval
#else
... (original zsock_timeval declaration as it stands now)
#endif

Can you give that a quick try to see if it works (it should, but that's the magic of POSIX subsys that even seemingly simple changes may lead to unobvious effects) ?

@pfalcon sorry for slow responding, but I had some other work to do. I tried to introduce solution proposed by you and it partially worked, but there was a problem with lack od timeval declaration for older Newlib version. I declared timeval structure for the older Newlib as a workaround, because including <sys/time.h>, where this workaround is already done didn't help in every case. The problem I had was "No such file or directory <sys/time.h>" compilation error in CI for some samples (e.g. tests/posix/eventfd and tests/posix/eventfd_basic) and platforms (e.g. qemu_cortex_r5). It was quite weird, because when I compiled it locally everything was fine, so I guess that the success depends on the including order inside of the Zephyr. Current solution seems to work and build on every platform properly.

@pfalcon
Copy link
Collaborator

pfalcon commented Nov 19, 2020

@kkasperczyk-no

sorry for slow responding, but I had some other work to do

I once again appreciate giving a try to an alternative solution. In all fairness, I didn't expect you to, or I would have said "please spend nor more than half an hour of it, if it didn't work out, can go with a simpler workaround". And well, now I see for yourself how it may be hard (in the sense of "tangled") to get POSIX subsys and libc integration "right".

Looks good (as an input for next stage, where we should find better ways to deal with it), thanks!

@jukkar jukkar merged commit 414ba6b into zephyrproject-rtos:master Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Networking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants