string operations are quite inefficient #3294

Vincent-Belliard · 2012-08-28T15:38:42Z

A week ago, I discovered Rust. The project is really interesting and is close to my vision of computer languages. I really love the enum/match mechanism and the memory model with shared and unique boxes. Everything looks consistent but the strings. The mechanism with three kinds of strings is not obvious to use but most of all, the generated llvm is quite inefficient.

For example, if you make the following example:

struct Example {
    let mut a: ~str;
    new() {
        self.a = ~"";
    }
    fn set_a(new_a: ~str) {
        self.a = copy new_a;
    }
    fn cmp_a(ref_a: ~str) -> bool {
        self.a == ref_a
    }
}

The constructor will make three calls (upcall_str_new_uniq, glue_grop and llvm.memmove).
The set_a function will make three calls (exchange_malloc, llvm.memmove and glue_drop).
The cmp_a function just do one mandatory call (upcall_cmp_type) but if you compare with a constant, you add a call to upcall_str_new_uniq and a call to glue_free.

Strings operations are very common and a good compiler should optimize theses operations so Rust must do that.

For several years now, I make a language which share a lot of things with Rust. I had the same strings problems that Rust have and I think I found something quite efficient.

I use reference counted objects to hold strings (the objects have a links_count, a string size (in bytes), a string size (in characters) and the characters (in utf8). A static string has a links count of -1 which is never increment nor decrement. The strings are shared between thread (but, like Rust, I don't share objects between threads). I use the cmpxchg llvm function to increment or decrement links to be thread safe.

With the same example:

class Example
    {
    public field a: String: default_value = "" ;
    public constructor(self):
        {
        }
    public method set_a(self, new_a: String):
        {
        a = new_a ;
        }
    public method cmp_a(self, ref_a: String) returns Boolean:
        {
        if (a == ref_a) return with true;
        return with false;
        }
    }

I have the following LLVM code:

define void @constructor__Example(%class.Example* %_self)
    {
    %1 = getelementptr inbounds %class.Example* %_self, i32 0, i32 0
    store %string_content* bitcast ({ i64, i64, i64, [1 x i8] }* @str_8 to %string_content*), %string_content** %1
    ret void
    }
define void @__Example.set__a.String(%class.Example* %_self, %string_content* %_new__a)
    {
    %1 = getelementptr inbounds %class.Example* %_self, i32 0, i32 0
    %2 = load %string_content** %1
    %3 = getelementptr inbounds %string_content* %_new__a, i32 0, i32 0
    %4 = load i64* %3
    %5 = icmp eq i64 %4, -1
 ; if static don't increment
    br i1 %5, label %11, label %6
; do the increment until it has not be smashed by another thread
; 6:
    %7 = phi i64 [ %4, %0 ], [ %9, %6 ]
    %8 = add i64 %7, 1
    %9 = cmpxchg i64* %3, i64 %7, i64 %8 monotonic
    %10 = icmp eq i64 %7, %9
    br i1 %10, label %11, label %6
; 11:
    %12 = load i64* %3
    %13 = getelementptr inbounds %string_content* %2, i32 0, i32 0
    %14 = load i64* %13
    %15 = icmp eq i64 %14, -1
; if static don't decrement
    br i1 %15, label %24, label %16
 ; do the decrement until is has not be smashed by another thread
; 16:
    %17 = phi i64 [ %14, %11 ], [ %19, %16 ]
    %18 = sub i64 %17, 1
    %19 = cmpxchg i64* %13, i64 %17, i64 %18 monotonic
    %20 = icmp eq i64 %17, %19
    br i1 %20, label %21, label %16
; 21:
    %22 = icmp eq i64 %18, 0
; if we assigned zero to links count, we free the string
    br i1 %22, label %23, label %24
; 23:
    call void @free_string_content(%string_content* %2)
    br label %24
; 24:
    store %string_content* %_new__a, %string_content** %1
    br label %25
; 25:
    ret void
    }
define i1 @__Example.cmp__a.String(%class.Example* %_self, %string_content* %_ref__a)
    {
    %1 = getelementptr inbounds %class.Example* %_self, i32 0, i32 0
    %2 = load %string_content** %1
    %3 = call i1 @equals_string_contents(%string_content* %2, %string_content* %_ref__a)
    br i1 %3, label %4, label %5
; 4:
    ret i1 1
; 5:
    br label %6
; 6:
    ret i1 0
    }

In the constructor, the field is initialized with a static string. The compiler makes an optimization and doesn't generate the increment (which does nothing for a static string).

I think that the way I manage strings with Entity could be very useful for Rust. If it would be decided that it must be done for Rust, I would do the development (or provide some help) with pleasure.

You can have a look at the Entity language (http://code.google.com/p/entity-language/). The final compiler with llvm generation is very young and only a few things are managed by this compiler.

The text was updated successfully, but these errors were encountered:

brson · 2012-08-28T17:08:42Z

Thanks for the analysis and description of how Entity handles strings.

String efficiency has long been a problem in Rust because they have been unique types, ~str. Unique things tend to get copied often, but are safe to send between tasks. Rust now has a garbage collected (actually reference counted) string type called @str that should behave more like you expect, but the libraries haven't caught up to the language, so @str is rarely used.

I suggest that the fix for this problem is to modify core::str to support @str in more places. In particular, none of the functions that create new strings can yet create @str. @nikomatsakis has some ideas for how to improve core here.

Atomic reference counting in strings doesn't fit so well with Rust's memory model, and shouldn't be needed.

pcwalton · 2012-08-28T17:52:06Z

You should use &str, @str, or core::arc<~str> here. Closing.

change new line point in the case of no args

Automatic Rustup

Related change: - rust-lang@24e41f1d13 Resolves rust-lang#3294 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.

pcwalton closed this as completed Aug 28, 2012

bors pushed a commit to rust-lang-ci/rust that referenced this issue May 15, 2021

Merge pull request rust-lang#3294 from rchaser53/issue-3278

203e6d2

change new line point in the case of no args

RalfJung pushed a commit to RalfJung/rust that referenced this issue Feb 17, 2024

Auto merge of rust-lang#3294 - rust-lang:rustup-2024-02-11, r=saethlin

a40a100

Automatic Rustup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

string operations are quite inefficient #3294

string operations are quite inefficient #3294

Vincent-Belliard commented Aug 28, 2012

brson commented Aug 28, 2012

pcwalton commented Aug 28, 2012

string operations are quite inefficient #3294

string operations are quite inefficient #3294

Comments

Vincent-Belliard commented Aug 28, 2012

brson commented Aug 28, 2012

pcwalton commented Aug 28, 2012