Skip to content

Proposal for lists of strings #268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
arjenmarkus opened this issue Dec 17, 2020 · 26 comments
Open

Proposal for lists of strings #268

arjenmarkus opened this issue Dec 17, 2020 · 26 comments
Labels
topic: strings String processing

Comments

@arjenmarkus
Copy link
Member

Last tuesday, during the december call, it was mentioned that support for a list of strings might be useful. This is a straightforward proposal for such a feature in the standard library.

Fortran has supported variable-length strings since the 2003 standard, but it does not have a native type to handle collections of strings of different lengths. Such collections are quite useful though and the language allows us to define a derived type that can handle such collections.

This proposal considers the features that would be useful for this derived type, a list of strings. It does not prescribe the methods for implementing the type. Given the ease by which arrays are handled in Fortran, it is possible to use allocatable arrays, but perhaps a linked list may prove to be more efficient, at least in certain scenarios of use. Therefore the proposal concentrates on the usage.

A further limitation of the proposal: there is no provision for nested strings or for storing data with the string.

Methods for the list of strings:

  • Method insert - insert a new string after the given index.
    Special index values: head and end, where head means insert the new string before the first one and end means insert it after the last one. Arithmetic with end is possible: end-1 means insert before the last element and so on.

    Note: besides a string you can also insert another list or an ordinary array of strings (in the latter case all inserted strings will be of the same length as the array)

    Convenience method:
    append - same as insert with index = end

    Indices beyond the bounds of the list:
    Should they cause an error? Or should they be interpreted as insert at the start or at the end as if head or end were given? Or should the list simply grow to that length? This potentially causes holes.

  • Method delete - delete a single string or a range of strings from the list.
    head and end have similar meanings as with insert.

  • Method replace - replace a string with a new value at the indicated location.

    Note: should we support replacement with multiple strings? My initial take at this is: no, if you want to do that, it can be done with a combination of insert and delete. The replace method is a convenient way to deal with individual strings.

  • Method: get - get one element from the list and return it as a string

  • Method: range - get all strings in a given range and return them as a new list.

  • Method: index - find the index of the first occurrence of a string in the list. Returns 0 (zero) if there is no such string. May also look for the last.

  • Method: index_sub - similar to index but rather than the entire string, a matching substring is sought for.

  • Method: destroy - destroy the list

  • Assignment: you can use lists in the context:

  list1 = list2

where list1 gets a copy of the entire list of strings.

Some methods will be subroutines, others will be functions. Special attention should be paid to error conditions. A guiding principle: no surprises. The philosophy might be: the list is - conceptually - infinitely long and if an element has not been set, then it is an empty string (a string of length 0).

@Romendakil
Copy link

This looks very similar to the implementation of iso_varying_string by Rich Townsend.

@arjenmarkus
Copy link
Member Author

arjenmarkus commented Dec 17, 2020 via email

@Romendakil
Copy link

Really? I should have a look at that then. I thought that was about supporting strings of variable length, whereas I was thinking of managing a list of strings of varying lengths. But if you are right, then this becomes superfluous :). Op do 17 dec. 2020 om 12:33 schreef Jürgen Reuter <[email protected]

No, sorry, I was too hasty. You are right. I supports strings of variable length as a container of allocatable character arrays. But it doesn't implement arrays of strings of varying length.

@arjenmarkus
Copy link
Member Author

arjenmarkus commented Dec 17, 2020 via email

@epagone
Copy link

epagone commented Dec 17, 2020

I think that this addition would really be amazing.

Comments

there is no provision ... for storing data with the string.

Can you bit more specific? What do you mean by that?

Indices beyond the bounds of the list:
Should they cause an error? Or should they be interpreted as insert at the start or at the end as if head or end were given? Or should the list simply grow to that length? This potentially causes holes.

IMHO if append is provided, out of bounds indices should cause an error.

Note: should we support replacement with multiple strings? My initial take at this is: no

I don't understand why not. I think that it would be quite convenient and I do not see the potential pitfall lurking.

Additional methods

I would find very useful to run on lists of strings the following methods.

  • to_number: convert numerical values and return arrays of reals and/or (when applicable) integers and (optionally) their location in the list. I understand it might have a complicated API but it would be a killer feature for me.
  • unique: return a list of unique strings (with optional case insensitivity, I would vote for the default to be case sensitive).
  • upper, lower, capitalise. camelcase, ... (see StringiFor for more): those can be implemented in the first instance to individual strings and then applied to the entire list. Hopefully their names are self-explanatory.

Additional operators

  • // would concatenate lists of strings
  • compare lists with ==

Input/output

I admit that I have no exact idea here, but I would like to make the access to the lists as simple as possible for input and output. So I'll throw in the topic for potential discussion.

I hope that I am not repeating something that has been already proposed or discussed during the monthly call (apologies in that case): unfortunately, I didn't have still the chance to catch-up with the recording.

Am I putting too much stuff in the plate? 😄

@arjenmarkus
Copy link
Member Author

What I meant by "not storing data with the string" is: the module I propose only stores strings, it does not function as a dictionary to associate arbitrary data with a string. Such functionality would be welcome too, but it would complicate matters a bit. I first want this one sorted out.

I will have a look at your suggestions. A first comment: I do not know if Fortran allows overloading of // - I have never seen it myself. The specification as I posted it is far from complete, I definitely acknowledge that :).

I did sit down yesterday to get started with a proof-of-concept implementation - currently two methods: insert and get. They work nicely.

@ivan-pi
Copy link
Member

ivan-pi commented Dec 18, 2020

I will have a look at your suggestions. A first comment: I do not know if Fortran allows overloading of // - I have never seen it myself.

It does in fact! You can run this small example to check it:

module test
  implicit none
  interface operator(//)
      module procedure sum_ints
  end interface
contains
  pure function sum_ints(a,b) result(c)
    integer, intent(in) :: a, b
    integer :: c
    c = a + b
  end function
end module

program use_test
  use test
  print *, " 2 + 3 = ", 2//3
end program

I can't find it right now, but there is a Fortran library for dictionaries somewhere on GitHub, that uses overloading of // for building key//value for pairs. In the issue #69 (string handling routines), I believe there were even some suggestions to use this to directly concatenate strings and numerical values (e.g. '2 + 3 ='//5).

In general I like the proposed the API, the idea of an infinitely long lists also sounds like it might fit to Fortran well. One question concerning this is would the list have a size method?

I also see some overlap with the split discussion in #241. Particularly the comment from @esterjo:

How about this:

There should be a string_array type. This type:

  • Would hold all the "string" elements of the array by storing them side by side in one contiguous character string, called "data".
  • It would also store an index array (possibly empty) whose nth element marks the end of the nth string in the "data".
  • A string type would be an instance of this where this index array is empty
  • Making use of this index array would allow for a function to return the nth string in the "data"

Inheriting form this string_array would be a split_string type:

  • It would be no different, but it's index array would simply be the locations of delimiters in the contiguous character string
  • By making use of this index array this child-type would have a function to return the nth token (perhaps overloading the one used by the parent type), which would just be the nth string in the "data" without the first character
  • Perhaps it holds an array of delimiters

Some things I like about this kind of implementation:

  • it reduces memory fragmentation because you do not allocate multiple chunks of heap for each string in the array, while also avoiding padding short strings like a simple character array of fixed size elements.
  • splitting a string type does not reallocate it's character vector, but simply modifies the index array
  • extra capacity can be added to the end of the "data" array to allow for fast addition of small string_arrays to the end.
  • if the index array has 2 columns, then the string elements can be accessed as if they are laid out in a matrix

@Romendakil
Copy link

Romendakil commented Dec 18, 2020

What I meant by "not storing data with the string" is: the module I propose only stores strings, it does not function as a dictionary to associate arbitrary data with a string. Such functionality would be welcome too, but it would complicate matters a bit. I first want this one sorted out.

I will have a look at your suggestions. A first comment: I do not know if Fortran allows overloading of // - I have never seen it myself. The specification as I posted it is far from complete, I definitely acknowledge that :).

I did sit down yesterday to get started with a proof-of-concept implementation - currently two methods: insert and get. They work nicely.

Yes, also the iso_varying_string module by Rich Townsend uses overloading of the (//) operator:

  interface operator(//)
     module procedure op_concat_VS_VS
     module procedure op_concat_CH_VS
     module procedure op_concat_VS_CH
  end interface operator(//)

for the three different cases of varying string with itself, and attaching a character to varying string from left and right.

@arjenmarkus
Copy link
Member Author

arjenmarkus commented Dec 18, 2020 via email

@epagone
Copy link

epagone commented Dec 18, 2020

A first comment: I do not know if Fortran allows overloading of // - I have never seen it myself.

It can be overloaded like any other operator (e.g. ==).

I did sit down yesterday to get started with a proof-of-concept implementation - currently two methods: insert and get. They work nicely.

Brilliant! Why don't you share this with a (draft?) pull request so it can be discussed and reviewed by our knowledgeable community?

@arjenmarkus
Copy link
Member Author

arjenmarkus commented Dec 18, 2020 via email

@arjenmarkus
Copy link
Member Author

arjenmarkus commented Dec 18, 2020 via email

@esterjo
Copy link

esterjo commented Dec 18, 2020

Referring to my comment that @ivan-pi mentioned from #241

I should have also added that sorting may mean sorting the index array, and setting a "sorted" flag to true. No need to reallocate and reorder the data elements. However, it's probably best to just actually reallocate and sort the data, so that scanning through it in sorted order will not cause you do jump around memory randomly.

But yes @arjenmarkus, insertion and modification of elements would be inefficient and would require reallocating at least part of the data for each insertion. Anecdotally, in my day to day programming I typically use arrays of strings in R (usually for data science and statistical modeling). I rarely modify the elements of a character vector or insert elements after I have generated the array. Instead, querying, appending, sorting, getting unique values, and splitting are by far the most common operations. Given that, I will say that I'm a big fan of having two or more implementations optimized for different uses.

@esterjo
Copy link

esterjo commented Dec 18, 2020

So there may be a "sort_view" method subroutine to only sort the index array, and a "sort_inplace" subroutine to actually sort the character data

@arjenmarkus
Copy link
Member Author

arjenmarkus commented Dec 20, 2020 via email

@arjenmarkus
Copy link
Member Author

I just committed preliminary documentation and a more complete implementation. There are quite a few things that ought to be added (and possibly improved) but it is getting into shape, IMHO.

@epagone
Copy link

epagone commented Dec 31, 2020

Thanks @arjenmarkus! I hope I'll be able to test it a little bit at some point.

Concerning the sorting routines, should not these be a more generic part of stdlib and then applied to this case as only one of the possible applications? What do you think @jvdp1 @certik @milancurcic?

@jvdp1
Copy link
Member

jvdp1 commented Dec 31, 2020

Thank you @arjenmarkus for the implementation. I will try to have a look at it next week.

Concerning the sorting routines, should not these be a more generic part of stdlib and then applied to this case as only one of the possible applications? What do you think @jvdp1 @certik @milancurcic?

@epagone I think that sorting algoritghms are definetely in the scope of stdlib, and are needed at several places (e.g., for sparse matrices, computation of median,...). Therefore sorting procedures should have their own module. See some discussions around sorting algorithms and implementations in #98.

@ivan-pi
Copy link
Member

ivan-pi commented Jan 8, 2021

Thanks @arjenmarkus! I hope I'll be able to test it a little bit at some point.

Concerning the sorting routines, should not these be a more generic part of stdlib and then applied to this case as only one of the possible applications? What do you think @jvdp1 @certik @milancurcic?

As @jvdp1 has mentioned already, I think we all agree sorting routines are in the scope of stdlib. But I also think we should not shy away from providing specific sort methods which are private to the module in development (in this case list of strings). This could be either due to using a specialized algorithm more suitable than a generic one, or simply as a temporary building block.

@epagone
Copy link

epagone commented Jan 8, 2021

As @jvdp1 has mentioned already, I think we all agree sorting routines are in the scope of stdlib. But I also think we should not shy away from providing specific sort methods which are private to the module in development (in this case list of strings). This could be either due to using a specialized algorithm more suitable than a generic one, or simply as a temporary building block.

@ivan-pi I understand and agree with the rationale and the general principle. However, is this the case? I am far from an expert of sorting algorithms, but to my eyes the one in the relevant PR does not seem to be tailored to list of strings (except for the data type).

@ivan-pi
Copy link
Member

ivan-pi commented Jan 8, 2021

I am not an expert for sorting algorithms either. I've tried to compile a list of available routines in #98. I've had a look at the implementation from @arjen. If we were to generalize the sorting routine and move it out of the string list module, we would need to:

  1. Agree to introduce string_type as the suggested high-level (OO or functional) string type (similar to varying_string or the many string types defined by others; see String handling routines #69)
  2. Provide a generic argsort routine which takes an array of type(string_type) and returns the indices that would sort an array.

@epagone
Copy link

epagone commented Jan 8, 2021

I am not an expert for sorting algorithms either. I've tried to compile a list of available routines in #98. I've had a look at the implementation from @arjen. If we were to generalize the sorting routine and move it out of the string list module, we would need to:

1. Agree to introduce `string_type` as the suggested OO string type (similar to `varying_string` or the many string types defined by others; see #69)

I see this as an advantage to keep the ball rolling also on the front of stdlib string types. We can always roll back at a later time if, for some reason, this implementation is unsuitable: we are still in "experimental mode".

2. Provide a generic `argsort` routine which takes an array of  `type(string_type)` and returns the indices that would sort an array.

I don't know if this would be a lot of work but to me it does not seem super-complicated.

@arjenmarkus
Copy link
Member Author

I am making progress with the implementation of a better way to specify the indices and en passant with the various versions to be supported. This made me think of some design questions for which I would like some advice:

  • I want to avoid error conditions, so whatever index you give, beyond the end OR beyond the beginning it always succeeds (modulo memory issues). An alternative is to set an error flag when especially negative indices are passed in.
  • Handling negative indices (to the "left" of the head of the list) requires a choice: I first thought of handling that as an implicit shift to the right of all other entries, so that the most negative index becomes 1. and all the other entries simply shift. But that does not feel good. An alternative could be to simply extend the list to the left, so that
    call list%insert( 1, "A")
    call list%insert( -10, 'B' )
    write(*,*) list%get(-10)
    produces "B".
  • Inserting a string at position n will shift all strings with index >n to the right. But if we implement negative indices like above, it might make sense to have an operation to shift them to the left as well.

Of course, negative indices may be not so useful in the first place and I should not put too much effort in that feature. If so, the alternative of flagging an error is probably best.

What do you think?

@certik
Copy link
Member

certik commented Jan 25, 2021

This is the behavior on Python:

>>> a = [1, 2, 3]
>>> a[2]
3
>>> a[0]
1
>>> a[-1]
3
>>> a.insert(-10, 4)
>>> a
[4, 1, 2, 3]

It looks like it prepends it at the beginning? Notice that a negative index wraparounds from the right when you do a[-1].

@arjenmarkus
Copy link
Member Author

arjenmarkus commented Jan 27, 2021 via email

@awvwgk awvwgk added the topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ... label Feb 3, 2021
@aman-godara aman-godara mentioned this issue Jul 23, 2021
10 tasks
@aman-godara
Copy link
Member

This comment is aimed to discuss the design around which PR #470 is built upon:

Explanation of forward and backward indexes:
forward index: fidx
backward index: bidx

After inserting an element at fidx( x ), the added element will appear at index fidx( x ) i.e. at xth position when counted from left to right.
After inserting an element at bidx( y ), the added element will appear at index bidx( y ) i.e. yth position when counted from right to left.
inserting at fidx( 1 ) is equivalent to inserting at bidx( len + 1 )
inserting at bidx( 1 ) is equivalent to inserting at fidx( len + 1 )

Inserting an element at HEAD is like inserting at fidx( 1 ) and inserting an element at TAIL is like inserting at bidx( 1 )

backward index is like the forward index of the reversed form of the list.

I am looking at insert API with a different perspective i.e. instead of talking about whether it will insert an element before i-th index or after i-th index I rather say that after insertion is completed the newly_inserted_element will be found at i-th index. What are your opinions on this way of looking at insert( i )?


index type ... --- A B .. Y Z --- ...
integer ... --- 1 2 .. len - 1 len --- ...
forward ... fidx( 0 ) fidx( 1 ) fidx( 2 ) .. fidx( len - 1 ) fidx( len ) fidx( len + 1 ) ...
backward ... bidx( len + 1 ) bidx( len ) bidx( len - 1 ) .. bidx( 2 ) bidx( 1 ) bidx( 0 ) ...

If you want to insert an element new_element before integer index q use the forward index of integer index q. If you want to insert an element new_element after integer index q use the backward index of integer index q.

Currently there is NO function in the PR which converts integer index to forward index or backward index [not to be confused with fidx and bidx of PR]

@awvwgk awvwgk added topic: strings String processing and removed topic: utilities containers, strings, files, OS/environment integration, unit testing, assertions, logging, ... labels Sep 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: strings String processing
Projects
None yet
Development

No branches or pull requests

9 participants