Skip to content

Fails to recognize delimiters that fall on the boundary between two buffers #4

Open
@hxtk

Description

@hxtk

This is related to Issue #3 in that both solutions require that we are able to read multiple buffers of information without consuming them.

Terminating Delimiters

Currently we search for terminating delimiters using the following algorithm:

buffer = first_buffer
while (buffer does not contain delimiter):
    append buffer to result string
    consume buffer.size()
    buffer = next_buffer
append buffer[0..match.start()] to result string
consume match.start()

Note that in the first test case below, the delimiter is (two spaces). The first buffer ends with one space and the second buffer begins with one space. Together, they form a delimiter, but this delimiter is not in either buffer, so the entirety of both buffers is read.

The best algorithm I can come up with, which follows, depends on rust-lang/regex#425; else we must make our own regex engine and implement hits_end() -> bool which returns true if the DFA is not in a dead state after we consume the last input character. This would imply that the buffer ends with a prefix to a word in that regex's language, and if we consume additional input it may or may not become a delimiter.

buffer = first_buffer
loop:
    while (buffer does not contain delimiter
            AND buffer does not end with delimiter prefix):
        append buffer to result string
        consume buffer.size()
        buffer = next_buffer
    while (buffer ends in delimiter prefix):
        buffer += next_buffer
    if (buffer contains delimiter):
        append buffer[0..match.start()] to result string
        consume match.start()
        break
# note that if the combined buffer didn't end up containing a delim,
# it will be appended to the result string on the next iteration of the loop at the top

Such functionality would also help us with the issue of parsing arbitrarily long starting delims. However, since there is nothing (simple) we can do to remedy that, we must consider an alternative. I propose the following:

while (result string does not contain delimiter):
    append buffer to result string
    buffer = next_buffer
consume match.start()
return result_string[0..match.start()]

Note that this requires reading multiple buffers without consuming them. By default, BufRead.fill_buf() will happily return the same buffer over and over again if you do not consume() between calls.

Precedent Delimiters

This has all of the same problems as the above problem, with one added caveat: it is impossible to tell the difference between a string that has no precedent delimiters and a string having a very long precedent delimiter—without being able to check for delimiter prefixes—except by reading to EOF.

This issue technically still exists in the case of trailing delimiters, but in that case there is no efficiency hit because reading to EOF if there is no trailing delimiter is actually the desired behavior.

Test Cases

/// This test will fail if we do not solve the above problem in a way that
/// preserves the tail of the original buffer, because in this test case the
/// terminating delimiter begins within the first buffer size and
/// ends within the second.
#[test]
fn buffer_ends_within_end_delim() {
    let string: &[u8] = b"foo  bar";
    let mut br = BufReader::with_capacity(4, string);
    let mut test = Scanner::new(&mut br);
    test.set_delim_str("  ");

    assert_eq!(test.next(), Some(String::from("foo")));
}

/// This test will fail if we cannot detect partial matches of the delimiter
/// when skipping over prefixed delimiters. Because the buffer size is 4, it
/// will read "aaaa", which is not in the language of /a+b/, however the
/// automaton is not in a dead state either: reading a "b" would put us in
/// an accepting state, thus we must read more input to know if the regex is
/// satisfied. Reading an additional character will result in "aaaab", which
/// is a valid delimiter in this language and should therefore be skipped.
#[test]
fn buffer_ends_within_start_delim() {
    let string: &[u8] = b"aaaabfoo";
    let mut br = BufReader::with_capacity(4, string);
    let mut test = Scanner::new(&mut br);
    test.set_delim(Regex::new(r"a+b").unwrap());

    assert_eq!(test.next(), Some(String::from("foo")));
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions