Skip to content

string operations are quite inefficient #3294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Vincent-Belliard opened this issue Aug 28, 2012 · 2 comments
Closed

string operations are quite inefficient #3294

Vincent-Belliard opened this issue Aug 28, 2012 · 2 comments

Comments

@Vincent-Belliard
Copy link
Contributor

A week ago, I discovered Rust. The project is really interesting and is close to my vision of computer languages. I really love the enum/match mechanism and the memory model with shared and unique boxes. Everything looks consistent but the strings. The mechanism with three kinds of strings is not obvious to use but most of all, the generated llvm is quite inefficient.

For example, if you make the following example:

struct Example {
    let mut a: ~str;
    new() {
        self.a = ~"";
    }
    fn set_a(new_a: ~str) {
        self.a = copy new_a;
    }
    fn cmp_a(ref_a: ~str) -> bool {
        self.a == ref_a
    }
}

The constructor will make three calls (upcall_str_new_uniq, glue_grop and llvm.memmove).
The set_a function will make three calls (exchange_malloc, llvm.memmove and glue_drop).
The cmp_a function just do one mandatory call (upcall_cmp_type) but if you compare with a constant, you add a call to upcall_str_new_uniq and a call to glue_free.

Strings operations are very common and a good compiler should optimize theses operations so Rust must do that.

For several years now, I make a language which share a lot of things with Rust. I had the same strings problems that Rust have and I think I found something quite efficient.

I use reference counted objects to hold strings (the objects have a links_count, a string size (in bytes), a string size (in characters) and the characters (in utf8). A static string has a links count of -1 which is never increment nor decrement. The strings are shared between thread (but, like Rust, I don't share objects between threads). I use the cmpxchg llvm function to increment or decrement links to be thread safe.

With the same example:

class Example
    {
    public field a: String: default_value = "" ;
    public constructor(self):
        {
        }
    public method set_a(self, new_a: String):
        {
        a = new_a ;
        }
    public method cmp_a(self, ref_a: String) returns Boolean:
        {
        if (a == ref_a) return with true;
        return with false;
        }
    }

I have the following LLVM code:

define void @constructor__Example(%class.Example* %_self)
    {
    %1 = getelementptr inbounds %class.Example* %_self, i32 0, i32 0
    store %string_content* bitcast ({ i64, i64, i64, [1 x i8] }* @str_8 to %string_content*), %string_content** %1
    ret void
    }
define void @__Example.set__a.String(%class.Example* %_self, %string_content* %_new__a)
    {
    %1 = getelementptr inbounds %class.Example* %_self, i32 0, i32 0
    %2 = load %string_content** %1
    %3 = getelementptr inbounds %string_content* %_new__a, i32 0, i32 0
    %4 = load i64* %3
    %5 = icmp eq i64 %4, -1
 ; if static don't increment
    br i1 %5, label %11, label %6
; do the increment until it has not be smashed by another thread
; 6:
    %7 = phi i64 [ %4, %0 ], [ %9, %6 ]
    %8 = add i64 %7, 1
    %9 = cmpxchg i64* %3, i64 %7, i64 %8 monotonic
    %10 = icmp eq i64 %7, %9
    br i1 %10, label %11, label %6
; 11:
    %12 = load i64* %3
    %13 = getelementptr inbounds %string_content* %2, i32 0, i32 0
    %14 = load i64* %13
    %15 = icmp eq i64 %14, -1
; if static don't decrement
    br i1 %15, label %24, label %16
 ; do the decrement until is has not be smashed by another thread
; 16:
    %17 = phi i64 [ %14, %11 ], [ %19, %16 ]
    %18 = sub i64 %17, 1
    %19 = cmpxchg i64* %13, i64 %17, i64 %18 monotonic
    %20 = icmp eq i64 %17, %19
    br i1 %20, label %21, label %16
; 21:
    %22 = icmp eq i64 %18, 0
; if we assigned zero to links count, we free the string
    br i1 %22, label %23, label %24
; 23:
    call void @free_string_content(%string_content* %2)
    br label %24
; 24:
    store %string_content* %_new__a, %string_content** %1
    br label %25
; 25:
    ret void
    }
define i1 @__Example.cmp__a.String(%class.Example* %_self, %string_content* %_ref__a)
    {
    %1 = getelementptr inbounds %class.Example* %_self, i32 0, i32 0
    %2 = load %string_content** %1
    %3 = call i1 @equals_string_contents(%string_content* %2, %string_content* %_ref__a)
    br i1 %3, label %4, label %5
; 4:
    ret i1 1
; 5:
    br label %6
; 6:
    ret i1 0
    }

In the constructor, the field is initialized with a static string. The compiler makes an optimization and doesn't generate the increment (which does nothing for a static string).

I think that the way I manage strings with Entity could be very useful for Rust. If it would be decided that it must be done for Rust, I would do the development (or provide some help) with pleasure.

You can have a look at the Entity language (http://code.google.com/p/entity-language/). The final compiler with llvm generation is very young and only a few things are managed by this compiler.

@brson
Copy link
Contributor

brson commented Aug 28, 2012

Thanks for the analysis and description of how Entity handles strings.

String efficiency has long been a problem in Rust because they have been unique types, ~str. Unique things tend to get copied often, but are safe to send between tasks. Rust now has a garbage collected (actually reference counted) string type called @str that should behave more like you expect, but the libraries haven't caught up to the language, so @str is rarely used.

I suggest that the fix for this problem is to modify core::str to support @str in more places. In particular, none of the functions that create new strings can yet create @str. @nikomatsakis has some ideas for how to improve core here.

Atomic reference counting in strings doesn't fit so well with Rust's memory model, and shouldn't be needed.

@pcwalton
Copy link
Contributor

You should use &str, @str, or core::arc<~str> here. Closing.

bors pushed a commit to rust-lang-ci/rust that referenced this issue May 15, 2021
change new line point in the case of no args
RalfJung pushed a commit to RalfJung/rust that referenced this issue Feb 17, 2024
jaisnan added a commit to jaisnan/rust-dev that referenced this issue Jul 29, 2024
Related change:
- rust-lang@24e41f1d13


Resolves rust-lang#3294

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 and MIT licenses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants