-
Notifications
You must be signed in to change notification settings - Fork 184
Proposal for lists of strings #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This looks very similar to the implementation of iso_varying_string by Rich Townsend. |
Really? I should have a look at that then. I thought that was about
supporting strings of variable length, whereas I was thinking of managing a
list of strings of varying lengths. But if you are right, then this becomes
superfluous :).
Op do 17 dec. 2020 om 12:33 schreef Jürgen Reuter <[email protected]
…:
This looks very similar to the implementation of iso_varying_string by
Rich Townsend.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#268 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR6FPO77SPZ5ZEX4XIDSVHT7NANCNFSM4U7P3HOA>
.
|
No, sorry, I was too hasty. You are right. I supports strings of variable length as a container of allocatable character arrays. But it doesn't implement arrays of strings of varying length. |
No problem - that happens to all of us :).
BTW, I realised that it might be good to include a "split" method as well:
split a string into substrings based on separator characters, but I should
have a lok at the discussion on split(). I have been postponing that ...
Op do 17 dec. 2020 om 13:16 schreef Jürgen Reuter <[email protected]
…:
Really? I should have a look at that then. I thought that was about
supporting strings of variable length, whereas I was thinking of managing a
list of strings of varying lengths. But if you are right, then this becomes
superfluous :). Op do 17 dec. 2020 om 12:33 schreef Jürgen Reuter <
***@***.***
No, sorry, I was too hasty. You are right. I supports strings of variable
length as a container of allocatable character arrays. But it doesn't
implement arrays of strings of varying length.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#268 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YRZ23HJAWOHVFY4GKVDSVHZDDANCNFSM4U7P3HOA>
.
|
I think that this addition would really be amazing. Comments
Can you bit more specific? What do you mean by that?
IMHO if
I don't understand why not. I think that it would be quite convenient and I do not see the potential pitfall lurking. Additional methodsI would find very useful to run on lists of strings the following methods.
Additional operators
Input/outputI admit that I have no exact idea here, but I would like to make the access to the lists as simple as possible for input and output. So I'll throw in the topic for potential discussion. I hope that I am not repeating something that has been already proposed or discussed during the monthly call (apologies in that case): unfortunately, I didn't have still the chance to catch-up with the recording. Am I putting too much stuff in the plate? 😄 |
What I meant by "not storing data with the string" is: the module I propose only stores strings, it does not function as a dictionary to associate arbitrary data with a string. Such functionality would be welcome too, but it would complicate matters a bit. I first want this one sorted out. I will have a look at your suggestions. A first comment: I do not know if Fortran allows overloading of // - I have never seen it myself. The specification as I posted it is far from complete, I definitely acknowledge that :). I did sit down yesterday to get started with a proof-of-concept implementation - currently two methods: insert and get. They work nicely. |
It does in fact! You can run this small example to check it:
I can't find it right now, but there is a Fortran library for dictionaries somewhere on GitHub, that uses overloading of In general I like the proposed the API, the idea of an infinitely long lists also sounds like it might fit to Fortran well. One question concerning this is would the list have a I also see some overlap with the
|
Yes, also the iso_varying_string module by Rich Townsend uses overloading of the (//) operator:
for the three different cases of varying string with itself, and attaching a character to varying string from left and right. |
That is an alternative implementation I had not thought about. The nice
thing about it is that the memory remains contiguous, even though if you do
enough manipulations, of course, at some point you will have memory
fragmentation. Well, worth checking out :).
Op vr 18 dec. 2020 om 10:51 schreef Ivan Pribec <[email protected]>:
… I will have a look at your suggestions. A first comment: I do not know if
Fortran allows overloading of // - I have never seen it myself.
It does in fact! You can run this small example to check it:
module test
implicit none
interface operator(//)
module procedure sum_ints
end interface
contains
pure function sum_ints(a,b) result(c)
integer, intent(in) :: a, b
integer :: c
c = a + b
end function
end module
program use_test
use test
print *, " 2 + 3 = ", 2//3
end program
I can't find it right now, but there is a Fortran library for dictionaries
somewhere on GitHub, that uses overloading of // for building key//value
for pairs. In the issue #69
<#69> (string handling
routines), I believe there were even some suggestions to use this to
directly concatenate strings and numerical values (e.g. '2 + 3 ='//5).
In general I like the proposed the API, the idea of an infinitely long
lists also sounds like it might fit to Fortran well. One question
concerning this is would the list have a size method?
I also see some overlap with the split discussion in #241
<#241>. Particularly the
comment from @esterjo <https://github.com/esterjo>:
How about this:
There should be a string_array type. This type:
- Would hold all the "string" elements of the array by storing them
side by side in one contiguous character string, called "data".
- It would also store an index array (possibly empty) whose nth
element marks the end of the nth string in the "data".
- A string type would be an instance of this where this index array is
empty
- Making use of this index array would allow for a function to return
the nth string in the "data"
Inheriting form this string_array would be a split_string type:
- It would be no different, but it's index array would simply be the
locations of delimiters in the contiguous character string
- By making use of this index array this child-type would have a
function to return the nth token (perhaps overloading the one used by the
parent type), which would just be the nth string in the "data" without the
first character
- Perhaps it holds an array of delimiters
Some things I like about this kind of implementation:
- it reduces memory fragmentation because you do not allocate multiple
chunks of heap for each string in the array, while also avoiding padding
short strings like a simple character array of fixed size elements.
- splitting a string type does not reallocate it's character vector,
but simply modifies the index array
- extra capacity can be added to the end of the "data" array to allow
for fast addition of small string_arrays to the end.
- if the index array has 2 columns, then the string elements can be
accessed as if they are laid out in a matrix
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#268 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR4E2FKQZWJJIQ2IZZTSVMQ2VANCNFSM4U7P3HOA>
.
|
It can be overloaded like any other operator (e.g.
Brilliant! Why don't you share this with a (draft?) pull request so it can be discussed and reviewed by our knowledgeable community? |
Well, I discovered a small issue after I added another test and solved that
this morning. I will put it in a pull request later today.
Op vr 18 dec. 2020 om 12:08 schreef Emanuele Pagone <
[email protected]>:
… A first comment: I do not know if Fortran allows overloading of // - I
have never seen it myself.
It can be overloaded like any other operator (e.g. ==).
I did sit down yesterday to get started with a proof-of-concept
implementation - currently two methods: insert and get. They work nicely.
Brilliant! Why don't you share this with a (draft?) pull request so it can
be discussed and reviewed by our knowledgeable community?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#268 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR72VJP7X7N3B5W2UHDSVMZZXANCNFSM4U7P3HOA>
.
|
Just created a pull request for this.
Op vr 18 dec. 2020 om 12:10 schreef Arjen Markus <[email protected]
…:
Well, I discovered a small issue after I added another test and solved
that this morning. I will put it in a pull request later today.
Op vr 18 dec. 2020 om 12:08 schreef Emanuele Pagone <
***@***.***>:
> A first comment: I do not know if Fortran allows overloading of // - I
> have never seen it myself.
>
> It can be overloaded like any other operator (e.g. ==).
>
> I did sit down yesterday to get started with a proof-of-concept
> implementation - currently two methods: insert and get. They work nicely.
>
> Brilliant! Why don't you share this with a (draft?) pull request so it
> can be discussed and reviewed by our knowledgeable community?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#268 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAN6YR72VJP7X7N3B5W2UHDSVMZZXANCNFSM4U7P3HOA>
> .
>
|
Referring to my comment that @ivan-pi mentioned from #241 I should have also added that sorting may mean sorting the index array, and setting a "sorted" flag to true. No need to reallocate and reorder the data elements. However, it's probably best to just actually reallocate and sort the data, so that scanning through it in sorted order will not cause you do jump around memory randomly. But yes @arjenmarkus, insertion and modification of elements would be inefficient and would require reallocating at least part of the data for each insertion. Anecdotally, in my day to day programming I typically use arrays of strings in R (usually for data science and statistical modeling). I rarely modify the elements of a character vector or insert elements after I have generated the array. Instead, querying, appending, sorting, getting unique values, and splitting are by far the most common operations. Given that, I will say that I'm a big fan of having two or more implementations optimized for different uses. |
So there may be a "sort_view" method subroutine to only sort the index array, and a "sort_inplace" subroutine to actually sort the character data |
This inspired me to implement a new method: sort that returns a sorted
list. The implementation uses an index array that is the item that is
passed to the sorting algorithm. This way, the number of allocations is
kept low. (The sorting algorithm may not be the most efficient possible,
but I like it for its conciseness and functional programming style)
Op vr 18 dec. 2020 om 19:00 schreef esterjo <[email protected]>:
… So there may be a "sort_view" method to sort the index array, and a
"sort_inplace" to actually sort the character data
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#268 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR26CHF56BCSB5BRM3TSVOKFFANCNFSM4U7P3HOA>
.
|
I just committed preliminary documentation and a more complete implementation. There are quite a few things that ought to be added (and possibly improved) but it is getting into shape, IMHO. |
Thanks @arjenmarkus! I hope I'll be able to test it a little bit at some point. Concerning the sorting routines, should not these be a more generic part of |
Thank you @arjenmarkus for the implementation. I will try to have a look at it next week.
@epagone I think that sorting algoritghms are definetely in the scope of |
As @jvdp1 has mentioned already, I think we all agree sorting routines are in the scope of stdlib. But I also think we should not shy away from providing specific sort methods which are private to the module in development (in this case list of strings). This could be either due to using a specialized algorithm more suitable than a generic one, or simply as a temporary building block. |
@ivan-pi I understand and agree with the rationale and the general principle. However, is this the case? I am far from an expert of sorting algorithms, but to my eyes the one in the relevant PR does not seem to be tailored to list of strings (except for the data type). |
I am not an expert for sorting algorithms either. I've tried to compile a list of available routines in #98. I've had a look at the implementation from @arjen. If we were to generalize the sorting routine and move it out of the string list module, we would need to:
|
I see this as an advantage to keep the ball rolling also on the front of
I don't know if this would be a lot of work but to me it does not seem super-complicated. |
I am making progress with the implementation of a better way to specify the indices and en passant with the various versions to be supported. This made me think of some design questions for which I would like some advice:
Of course, negative indices may be not so useful in the first place and I should not put too much effort in that feature. If so, the alternative of flagging an error is probably best. What do you think? |
This is the behavior on Python:
It looks like it prepends it at the beginning? Notice that a negative index wraparounds from the right when you do |
Hm, negative indices start at the end and when you get outside the list the
item is either inserted at the start (negative indices) or at the end.
Well, that is an acceptable model, as well, I suppose :). Perhaps even more
than trying to accommodate for inserting in arbitrary places. I can
certainly do that.
Op ma 25 jan. 2021 om 23:21 schreef Ondřej Čertík <[email protected]
…:
This is the behavior on Python:
>>> a = [1, 2, 3]
>>> a[2]
3
>>> a[0]
1
>>> a[-1]
3
>>> a.insert(-10, 4)
>>> a
[4, 1, 2, 3]
It looks like it prepends it at the beginning?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#268 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR5SQZD2Q7ZAWSL5XF3S3XVHFANCNFSM4U7P3HOA>
.
|
This comment is aimed to discuss the design around which PR #470 is built upon: Explanation of forward and backward indexes: After inserting an element at Inserting an element at HEAD is like inserting at backward index is like the forward index of the reversed form of the list. I am looking at insert API with a different perspective i.e. instead of talking about whether it will insert an element before i-th index or after i-th index I rather say that after insertion is completed the
If you want to insert an element Currently there is NO function in the PR which converts integer index to forward index or backward index [not to be confused with fidx and bidx of PR] |
Last tuesday, during the december call, it was mentioned that support for a list of strings might be useful. This is a straightforward proposal for such a feature in the standard library.
Fortran has supported variable-length strings since the 2003 standard, but it does not have a native type to handle collections of strings of different lengths. Such collections are quite useful though and the language allows us to define a derived type that can handle such collections.
This proposal considers the features that would be useful for this derived type, a list of strings. It does not prescribe the methods for implementing the type. Given the ease by which arrays are handled in Fortran, it is possible to use allocatable arrays, but perhaps a linked list may prove to be more efficient, at least in certain scenarios of use. Therefore the proposal concentrates on the usage.
A further limitation of the proposal: there is no provision for nested strings or for storing data with the string.
Methods for the list of strings:
Method
insert
- insert a new string after the given index.Special index values:
head
andend
, wherehead
means insert the new string before the first one andend
means insert it after the last one. Arithmetic withend
is possible:end-1
means insert before the last element and so on.Note: besides a string you can also insert another list or an ordinary array of strings (in the latter case all inserted strings will be of the same length as the array)
Convenience method:
append
- same as insert with index = endIndices beyond the bounds of the list:
Should they cause an error? Or should they be interpreted as insert at the start or at the end as if
head
orend
were given? Or should the list simply grow to that length? This potentially causes holes.Method
delete
- delete a single string or a range of strings from the list.head
andend
have similar meanings as withinsert
.Method
replace
- replace a string with a new value at the indicated location.Note: should we support replacement with multiple strings? My initial take at this is: no, if you want to do that, it can be done with a combination of
insert
anddelete
. Thereplace
method is a convenient way to deal with individual strings.Method:
get
- get one element from the list and return it as a stringMethod:
range
- get all strings in a given range and return them as a new list.Method:
index
- find the index of the first occurrence of a string in the list. Returns 0 (zero) if there is no such string. May also look for the last.Method:
index_sub
- similar toindex
but rather than the entire string, a matching substring is sought for.Method:
destroy
- destroy the listAssignment: you can use lists in the context:
list1 = list2
where list1 gets a copy of the entire list of strings.
Some methods will be subroutines, others will be functions. Special attention should be paid to error conditions. A guiding principle: no surprises. The philosophy might be: the list is - conceptually - infinitely long and if an element has not been set, then it is an empty string (a string of length 0).
The text was updated successfully, but these errors were encountered: