Skip to content

Commit 8b662c7

Browse files
committed
tools/content: Support systematically surveying unimplemented content features.
We added 2 scripts and a wrapper for them both. - fetch_messages.dart, the script that fetches messages from a given Zulip server, that does not depend on Flutter or other involved Zulip Flutter packages, so that it can run without Flutter. It is meant to be run first to produce the corpora needed for surveying the unimplemented features. The fetched messages are formatted in JSON Lines format, where each individual entry is JSON containing the message ID and the rendered HTML content. The script stores output in separate files for messages from each server, because message IDs are not unique across them. - unimplemented_features_test.dart, a test that goes over all messages collected, parses then with the content parser, and report the unimplemented features it discovered. This is implemented as a test mainly because of its dependency on the content parser, which depends on the Flutter engine (and `flutter test` conveniently sets up a test device). We mostly avoid prints (https://dart.dev/tools/linter-rules/avoid_print) in both scripts. While we don't lose much by disabling this lint rule for them, because they are supposed to be CLI programs after all, the rule (potentially) helps with reducing developer inclination to be verbose. See comments from the scripts for more details on the implementations. ===== Some main benefits of having the wrapper script to access dart code are that we can provide a more intuitive interface consistent with other tools, for fetching message corpora and/or running the check for unimplemented features. Very rarely, you might want to use fetch_messages.dart directly, to use the `fetch-newer` flag for example to update an existing corpus file. If we find it helpful, the flag can be added to check-features as well, but we are skipping that for now. The script is intended to be run manually, not as a part of the CI, because it is very slow, and it relies on some out of tree files like API configs (zuliprc files) and big dumps of chat history. For the most part, we intend to only keep the detailed explanations in the underlying scripts close to the implementation, and selectively repeat some of the helpful information in the wrapper. This also repeats some easy checks for options, so that we can produce nicer error messages for some common errors (like missing zuliprc for `fetch`). Fixes: zulip#190
1 parent e7791f2 commit 8b662c7

File tree

4 files changed

+519
-0
lines changed

4 files changed

+519
-0
lines changed

tools/content/check-features

+108
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
default_steps=(fetch check)
5+
6+
usage() {
7+
cat <<EOF
8+
usage: tools/content/check-features [OPTION]... [STEP]... <CORPUS_DIR>
9+
10+
Fetch messages from a Zulip server and check the content parser for
11+
unimplemented features.
12+
13+
By default, run the following steps:
14+
${default_steps[*]}
15+
16+
CORPUS_DIR is required. It is the directory to store or read corpus files.
17+
This directory will be created if it does not exist already.
18+
19+
The steps are:
20+
21+
fetch Fetch the corpus needed from the server specified via the config
22+
file into \`CORPUS_DIR\` incrementally. This step can take a long
23+
time on servers with a lot of public messages when starting from
24+
scratch.
25+
This wraps around tools/content/fetch_messages.dart.
26+
27+
check Check for unimplemented content parser features. This requires
28+
the corpus directory \`CORPUS_DIR\` to contain at least one corpus
29+
file.
30+
This wraps around tools/content/unimplemented_features_test.dart.
31+
32+
Options:
33+
34+
--config <FILE>
35+
A zuliprc file with identity information including email, API key
36+
and the Zulip server URL to fetch the messages from.
37+
Mandatory if running step \`fetch\`. To get the file, see
38+
https://zulip.com/api/configuring-python-bindings#download-a-zuliprc-file.
39+
40+
--verbose Print more details about everything, especially when checking for
41+
unsupported features.
42+
43+
--help Show this help message.
44+
EOF
45+
}
46+
47+
opt_corpus_dir=
48+
opt_zuliprc=
49+
opt_verbose=
50+
opt_steps=()
51+
while (( $# )); do
52+
case "$1" in
53+
fetch|check) opt_steps+=("$1"); shift;;
54+
--config) shift; opt_zuliprc="$1"; shift;;
55+
--verbose) opt_verbose=1; shift;;
56+
--help) usage; exit 0;;
57+
*)
58+
if [ -n "$opt_corpus_dir" ]; then
59+
# Forbid passing multiple corpus directories.
60+
usage >&2; exit 2
61+
fi
62+
opt_corpus_dir="$1"; shift;;
63+
esac
64+
done
65+
66+
if [ -z "$opt_corpus_dir" ]; then
67+
echo >&2 "Error: Positional argument CORPUS_DIR is required."
68+
echo >&2
69+
usage >&2; exit 2
70+
fi
71+
72+
if (( ! "${#opt_steps[@]}" )); then
73+
opt_steps=( "${default_steps[@]}" )
74+
fi
75+
76+
run_fetch() {
77+
if [ -z "$opt_zuliprc" ]; then
78+
echo >&2 "Error: Option \`--config\` is required for step \`fetch\`."
79+
echo >&2
80+
usage >&2; exit 2
81+
fi
82+
83+
if [ -n "$opt_verbose" ]; then
84+
echo "Fetching all public messages using API config \"$opt_zuliprc\"." \
85+
" This can take a long time."
86+
fi
87+
# This may have a side effect of creating or modifying the corpus
88+
# file named after the Zulip server's host name.
89+
dart tools/content/fetch_messages.dart --config-file "$opt_zuliprc" \
90+
--corpus-dir "$opt_corpus_dir" \
91+
|| return 1
92+
}
93+
94+
run_check() {
95+
flutter test tools/content/unimplemented_features_test.dart \
96+
--dart-define=corpusDir="$opt_corpus_dir" \
97+
--dart-define=verbose="$opt_verbose" \
98+
|| return 1
99+
}
100+
101+
for step in "${opt_steps[@]}"; do
102+
echo "Running ${step}"
103+
case "${step}" in
104+
fetch) run_fetch ;;
105+
check) run_check ;;
106+
*) echo >&2 "Internal error: unknown step ${step}" ;;
107+
esac
108+
done

tools/content/fetch_messages.dart

+231
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
import 'dart:convert';
2+
import 'dart:io';
3+
import 'dart:math';
4+
5+
// Avoid any Flutter-related dependencies so this can be run as a CLI program.
6+
import 'package:args/args.dart';
7+
import 'package:http/http.dart';
8+
import 'package:ini/ini.dart' as ini;
9+
import 'package:zulip/api/backoff.dart';
10+
11+
import 'model.dart';
12+
13+
/// Fetch all public message contents from a Zulip server in bulk.
14+
///
15+
/// It outputs JSON entries of the message IDs and the rendered HTML contents in
16+
/// JSON Lines (https://jsonlines.org) format. The output can be used later to
17+
/// perform checks for discovering unimplemented features.
18+
///
19+
/// Because message IDs are only unique within a single server, the script
20+
/// names corpora by the server host names.
21+
///
22+
/// This script is meant to be run via `tools/content/check-features`.
23+
///
24+
/// For more help, run `tools/content/check-features --help`.
25+
///
26+
/// See also:
27+
/// * tools/content/unimplemented_features_test.dart, which runs checks against
28+
/// the fetched corpora.
29+
void main(List<String> args) async {
30+
final argParser = ArgParser();
31+
argParser.addOption(
32+
'config-file',
33+
help: 'A zuliprc file with identity information including email, API key\n'
34+
'and the Zulip server URL to fetch the messages from (required).\n\n'
35+
'To get the file, see\n'
36+
'https://zulip.com/api/configuring-python-bindings#download-a-zuliprc-file.',
37+
valueHelp: 'path/to/zuliprc',
38+
);
39+
argParser.addOption(
40+
'corpus-dir',
41+
help: 'The directory to look for/store the corpus file (required).\n'
42+
'The script will first read from the existing corpus file\n'
43+
'(assumed to be named as "your-zulip-server.com.jsonl")\n'
44+
'to avoid duplicates before fetching more messages.',
45+
valueHelp: 'path/to/corpus-dir',
46+
);
47+
argParser.addFlag(
48+
'fetch-newer',
49+
help: 'Fetch newer messages instead of older ones.\n'
50+
'Only useful when there is a matching corpus file in corpus-dir.',
51+
defaultsTo: false,
52+
);
53+
argParser.addFlag(
54+
'help', abbr: 'h',
55+
negatable: false,
56+
help: 'Show this help message.',
57+
);
58+
59+
void printUsage() {
60+
// Give it a pass when printing the help message.
61+
// ignore: avoid_print
62+
print('usage: fetch_messages --config-file <CONFIG_FILE>\n\n'
63+
'Fetch message contents from a Zulip server in bulk.\n\n'
64+
'${argParser.usage}');
65+
}
66+
67+
Never throwWithUsage(String error) {
68+
printUsage();
69+
throw Exception('\nError: $error');
70+
}
71+
72+
final parsedArguments = argParser.parse(args);
73+
if (parsedArguments['help'] as bool) {
74+
printUsage();
75+
exit(0);
76+
}
77+
78+
final zuliprc = parsedArguments['config-file'] as String?;
79+
if (zuliprc == null) {
80+
throwWithUsage('"config-file is required');
81+
}
82+
83+
final configFile = File(zuliprc);
84+
if (!configFile.existsSync()) {
85+
throwWithUsage('Config file "$zuliprc" does not exist');
86+
}
87+
88+
// `zuliprc` is a file in INI format containing the user's identity
89+
// information.
90+
//
91+
// See also:
92+
// https://zulip.com/api/configuring-python-bindings#configuration-keys-and-environment-variables
93+
final parsedConfig = ini.Config.fromString(configFile.readAsStringSync());
94+
await fetchMessages(
95+
email: parsedConfig.get('api', 'email') as String,
96+
apiKey: parsedConfig.get('api', 'key') as String,
97+
site: Uri.parse(parsedConfig.get('api', 'site') as String),
98+
outputDirStr: parsedArguments['corpus-dir'] as String,
99+
fetchNewer: parsedArguments['fetch-newer'] as bool,
100+
);
101+
}
102+
103+
Future<void> fetchMessages({
104+
required String email,
105+
required String apiKey,
106+
required Uri site,
107+
required String outputDirStr,
108+
required bool fetchNewer,
109+
}) async {
110+
int? anchorMessageId;
111+
IOSink output = stdout;
112+
final outputDir = Directory(outputDirStr);
113+
outputDir.createSync(recursive: true);
114+
final outputFile = File('$outputDirStr/${site.host}.jsonl');
115+
if (!outputFile.existsSync()) outputFile.createSync();
116+
// Look for the known newest/oldest message so that we can continue
117+
// fetching from where we left off.
118+
await for (final message in readMessagesFromJsonl(outputFile)) {
119+
anchorMessageId ??= message.id;
120+
// Newer Zulip messages have higher message IDs.
121+
anchorMessageId = (fetchNewer ? max : min)(message.id, anchorMessageId);
122+
}
123+
output = outputFile.openWrite(mode: FileMode.writeOnlyAppend);
124+
125+
final client = Client();
126+
final authHeader = 'Basic ${base64Encode(utf8.encode('$email:$apiKey'))}';
127+
128+
// These are working constants chosen arbitrarily.
129+
const batchSize = 5000;
130+
const maxRetries = 10;
131+
const fetchInterval = Duration(seconds: 5);
132+
133+
int retries = 0;
134+
BackoffMachine? backoff;
135+
136+
while (true) {
137+
// This loops until there is no message fetched in an iteration.
138+
final _GetMessagesResult result;
139+
try {
140+
// This is the one place where some output would be helpful,
141+
// for indicating progress.
142+
// ignore: avoid_print
143+
print('Fetching $batchSize messages starting from message ID $anchorMessageId');
144+
result = await _getMessages(client, realmUrl: site,
145+
authHeader: authHeader,
146+
anchorMessageId: anchorMessageId,
147+
numBefore: (!fetchNewer) ? batchSize : 0,
148+
numAfter: (fetchNewer) ? batchSize : 0,
149+
);
150+
} catch (e) {
151+
// We could have more fine-grained error handling and avoid retrying on
152+
// non-network-related failures, but that's not necessary.
153+
if (retries >= maxRetries) {
154+
rethrow;
155+
}
156+
retries++;
157+
await (backoff ??= BackoffMachine()).wait();
158+
continue;
159+
}
160+
161+
final messageEntries = result.messages.map(MessageEntry.fromJson);
162+
if (messageEntries.isEmpty) {
163+
// Sanity check to ensure that the server agrees
164+
// there is no more messages to fetch.
165+
if (fetchNewer) assert(result.foundNewest);
166+
if (!fetchNewer) assert(result.foundOldest);
167+
break;
168+
}
169+
170+
// Find and use the newest/oldest message as the next message fetch anchor.
171+
anchorMessageId = messageEntries.map((x) => x.id).reduce(fetchNewer ? max : min);
172+
messageEntries.map(jsonEncode).forEach((json) => output.writeln(json));
173+
174+
// This I/O operation could fail, but crashing is fine here.
175+
final flushFuture = output.flush();
176+
// Make sure the delay happens concurrently to the flush.
177+
await Future<void>.delayed(fetchInterval);
178+
await flushFuture;
179+
backoff = null;
180+
}
181+
}
182+
183+
/// https://zulip.com/api/get-messages#response
184+
// Partially ported from [GetMessagesResult] to avoid depending on Flutter libraries.
185+
class _GetMessagesResult {
186+
const _GetMessagesResult(this.foundOldest, this.foundNewest, this.messages);
187+
188+
final bool foundOldest;
189+
final bool foundNewest;
190+
final List<Map<String, Object?>> messages;
191+
192+
factory _GetMessagesResult.fromJson(Map<String, Object?> json) =>
193+
_GetMessagesResult(
194+
json['found_oldest'] as bool,
195+
json['found_newest'] as bool,
196+
(json['messages'] as List<Object?>).map((x) => (x as Map<String, Object?>)).toList());
197+
}
198+
199+
/// https://zulip.com/api/get-messages
200+
Future<_GetMessagesResult> _getMessages(Client client, {
201+
required Uri realmUrl,
202+
required String authHeader,
203+
required int numBefore,
204+
required int numAfter,
205+
int? anchorMessageId,
206+
}) async {
207+
final url = realmUrl.replace(
208+
path: '/api/v1/messages',
209+
queryParameters: {
210+
// This fallback will only be used when first fetching from a server.
211+
'anchor': anchorMessageId != null ? jsonEncode(anchorMessageId) : 'newest',
212+
// The anchor message already exists in the corpus,
213+
// so avoid fetching it again.
214+
'include_anchor': jsonEncode(anchorMessageId == null),
215+
'num_before': jsonEncode(numBefore),
216+
'num_after': jsonEncode(numAfter),
217+
'narrow': jsonEncode([{'operator': 'channels', 'operand': 'public'}]),
218+
});
219+
final response = await client.send(
220+
Request('GET', url)..headers['Authorization'] = authHeader);
221+
final bytes = await response.stream.toBytes();
222+
final json = jsonDecode(utf8.decode(bytes)) as Map<String, dynamic>?;
223+
224+
if (response.statusCode != 200 || json == null) {
225+
// Just crashing early here should be fine for this tool. We don't need
226+
// to handle the specific error codes.
227+
throw Exception('Failed to get messages. Code: ${response.statusCode}\n'
228+
'Details: ${json ?? 'unknown'}');
229+
}
230+
return _GetMessagesResult.fromJson(json);
231+
}

tools/content/model.dart

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
import 'dart:io';
2+
import 'dart:convert';
3+
4+
/// A data structure representing a message.
5+
final class MessageEntry {
6+
const MessageEntry({
7+
required this.id,
8+
required this.content,
9+
});
10+
11+
/// Selectively parses from get-message responses.
12+
///
13+
/// See also: https://zulip.com/api/get-messages#response
14+
factory MessageEntry.fromJson(Map<String, Object?> json) {
15+
try {
16+
return MessageEntry(
17+
id: (json['id'] as num).toInt(), content: json['content'] as String);
18+
} catch (e) {
19+
throw FormatException(
20+
'Malformed corpus data entry. Got: $e\n'
21+
'When parsing: $json');
22+
}
23+
}
24+
25+
Map<String, Object> toJson() => {'id': id, 'content': content};
26+
27+
/// The message ID, unique within a server.
28+
final int id;
29+
30+
/// The rendered HTML of the message.
31+
final String content;
32+
}
33+
34+
/// Open the given JSON Lines file and read [MessageEntry]'s from it.
35+
///
36+
/// We store the entries in JSON Lines format and return them from a stream to
37+
/// avoid excessive use of memory.
38+
Stream<MessageEntry> readMessagesFromJsonl(File file) => file.openRead()
39+
.transform(utf8.decoder).transform(const LineSplitter())
40+
.map(jsonDecode).map((x) => MessageEntry.fromJson(x as Map<String, Object?>));

0 commit comments

Comments
 (0)