Description
This is related to Issue #3 in that both solutions require that we are able to read multiple buffers of information without consuming them.
Terminating Delimiters
Currently we search for terminating delimiters using the following algorithm:
buffer = first_buffer
while (buffer does not contain delimiter):
append buffer to result string
consume buffer.size()
buffer = next_buffer
append buffer[0..match.start()] to result string
consume match.start()
Note that in the first test case below, the delimiter is
(two spaces). The first buffer ends with one space and the second buffer begins with one space. Together, they form a delimiter, but this delimiter is not in either buffer, so the entirety of both buffers is read.
The best algorithm I can come up with, which follows, depends on rust-lang/regex#425; else we must make our own regex engine and implement hits_end() -> bool
which returns true if the DFA is not in a dead state after we consume the last input character. This would imply that the buffer ends with a prefix to a word in that regex's language, and if we consume additional input it may or may not become a delimiter.
buffer = first_buffer
loop:
while (buffer does not contain delimiter
AND buffer does not end with delimiter prefix):
append buffer to result string
consume buffer.size()
buffer = next_buffer
while (buffer ends in delimiter prefix):
buffer += next_buffer
if (buffer contains delimiter):
append buffer[0..match.start()] to result string
consume match.start()
break
# note that if the combined buffer didn't end up containing a delim,
# it will be appended to the result string on the next iteration of the loop at the top
Such functionality would also help us with the issue of parsing arbitrarily long starting delims. However, since there is nothing (simple) we can do to remedy that, we must consider an alternative. I propose the following:
while (result string does not contain delimiter):
append buffer to result string
buffer = next_buffer
consume match.start()
return result_string[0..match.start()]
Note that this requires reading multiple buffers without consuming them. By default, BufRead.fill_buf()
will happily return the same buffer over and over again if you do not consume()
between calls.
Precedent Delimiters
This has all of the same problems as the above problem, with one added caveat: it is impossible to tell the difference between a string that has no precedent delimiters and a string having a very long precedent delimiter—without being able to check for delimiter prefixes—except by reading to EOF.
This issue technically still exists in the case of trailing delimiters, but in that case there is no efficiency hit because reading to EOF if there is no trailing delimiter is actually the desired behavior.
Test Cases
/// This test will fail if we do not solve the above problem in a way that
/// preserves the tail of the original buffer, because in this test case the
/// terminating delimiter begins within the first buffer size and
/// ends within the second.
#[test]
fn buffer_ends_within_end_delim() {
let string: &[u8] = b"foo bar";
let mut br = BufReader::with_capacity(4, string);
let mut test = Scanner::new(&mut br);
test.set_delim_str(" ");
assert_eq!(test.next(), Some(String::from("foo")));
}
/// This test will fail if we cannot detect partial matches of the delimiter
/// when skipping over prefixed delimiters. Because the buffer size is 4, it
/// will read "aaaa", which is not in the language of /a+b/, however the
/// automaton is not in a dead state either: reading a "b" would put us in
/// an accepting state, thus we must read more input to know if the regex is
/// satisfied. Reading an additional character will result in "aaaab", which
/// is a valid delimiter in this language and should therefore be skipped.
#[test]
fn buffer_ends_within_start_delim() {
let string: &[u8] = b"aaaabfoo";
let mut br = BufReader::with_capacity(4, string);
let mut test = Scanner::new(&mut br);
test.set_delim(Regex::new(r"a+b").unwrap());
assert_eq!(test.next(), Some(String::from("foo")));
}