WordSeg model and tests #498

texasmichelle · 2020-05-08T15:54:40Z

Adds a word segmentation model with tests.

This is an implementation of the paper "Learning to Discover, Ground, and Use Words with Segmental Neural Language Models" by Kazuya Kawakami, Chris Dyer, and Phil Blunsom. This implementation is not affiliated with DeepMind and has not been verified by the authors.

This implementation authored by: @marcrasi @compnerd @saeta

BradLarson · 2020-05-08T18:42:14Z

Models/Text/WordSeg/DataSet.swift

+
+import Foundation
+
+public struct DataSet {


Do we want to move the dataset out into the main Datasets (and give it a name other than DataSet) in a follow-on? Related: do we want to start the dataset tests out in the DatasetsTests grouping or move them later?

Yes, this needs to be rearranged as part of adding the dataset for a full example. This unblocks @asuhan while I put that together.

So, this is modelling the fact that there are three different datasets (training, validation, testing). However, the testing and validation is not something that is needed in most cases, and even the training doesn't need to be provided once we can snapshot the weights. So, I think it makes sense to perhaps mark this as internal for now?

Marked as internal

#500 for adding a dataset and cleaning this up.

marcrasi

I didn't read the code. I'm just approving of the plan to move this model here.

compnerd · 2020-05-08T21:56:36Z

Models/Text/WordSeg/DataSet.swift

+
+import Foundation
+
+public struct DataSet {


So, this is modelling the fact that there are three different datasets (training, validation, testing). However, the testing and validation is not something that is needed in most cases, and even the training doesn't need to be provided once we can snapshot the weights. So, I think it makes sense to perhaps mark this as internal for now?

compnerd · 2020-05-08T21:57:13Z

Models/Text/WordSeg/Lattice.swift

+/// Lattice
+///
+/// Represents the lattice used by the WordSeg algorithm.
+public struct Lattice: Differentiable {


Do we need Lattice to be public? I think that that this should probably be made internal.

Changing this scope looks to be a bit more involved. Punting to #499

compnerd · 2020-05-08T21:57:59Z

Models/Text/WordSeg/Lattice.swift

+  /// Edge
+  ///
+  /// Represents an Edge
+  public struct Edge: Differentiable {


compnerd · 2020-05-08T21:58:03Z

Models/Text/WordSeg/Lattice.swift

+  /// Node
+  ///
+  /// Represents a node in the lattice
+  public struct Node: Differentiable {


Models/Text/WordSeg/Model.swift

compnerd · 2020-05-08T22:04:57Z

Models/Text/WordSeg/SE-0259.swift

+  /// - Returns: `true` if `self` is almost equal to `other`; otherwise
+  ///   `false`.
+  @inlinable
+  public func isAlmostEqual(


s/public/internal/ ... just to ensure that we don't accidentally end up with conflicting definitions

This is a bit tricky, given the @inlinable.

compnerd · 2020-05-08T22:05:40Z

Models/Text/WordSeg/Vocabularies.swift

+/// Note: we map from String in order to support multi-character metadata sequences such as </s>.
+///
+/// In Python implementations, this is sometimes called the character vocabulary.
+public struct Alphabet {


Do we need to worry about Alphabet being confusing for multiple text models?

Tests/TextTests/WordSegmentationTests/ProbeLayers.swift

texasmichelle · 2020-05-08T23:09:41Z

OK, most of @compnerd's comments have been addressed and I created #499 for scope cleanup.

saeta

There's a fair bit to clean up, but I'm okay with merging now to unblock progress.

texasmichelle added 2 commits May 8, 2020 11:46

WordSeg model and tests

8c2891e

Add model files to CMakeLists

09fea49

texasmichelle requested review from compnerd, marcrasi and saeta May 8, 2020 18:19

BradLarson reviewed May 8, 2020

View reviewed changes

marcrasi approved these changes May 8, 2020

View reviewed changes

compnerd reviewed May 8, 2020

View reviewed changes

texasmichelle added 3 commits May 8, 2020 18:39

Rename structs and remove commented code

e0515b5

Rename Conf -> SNLM.Parameters, lint

46914df

Make DataSet internal

33a0599

compnerd approved these changes May 8, 2020

View reviewed changes

saeta approved these changes May 9, 2020

View reviewed changes

texasmichelle merged commit 62f932d into tensorflow:master May 9, 2020

texasmichelle deleted the wordseg branch May 9, 2020 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordSeg model and tests #498

WordSeg model and tests #498

texasmichelle commented May 8, 2020

BradLarson May 8, 2020 •

edited

Loading

texasmichelle May 8, 2020

compnerd May 8, 2020

texasmichelle May 8, 2020

texasmichelle May 9, 2020

marcrasi left a comment

compnerd May 8, 2020

compnerd May 8, 2020

texasmichelle May 8, 2020

compnerd May 8, 2020

compnerd May 8, 2020

compnerd May 8, 2020

texasmichelle May 8, 2020

compnerd May 8, 2020

texasmichelle commented May 8, 2020

saeta left a comment

WordSeg model and tests #498

WordSeg model and tests #498

Conversation

texasmichelle commented May 8, 2020

BradLarson May 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcrasi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

texasmichelle commented May 8, 2020

saeta left a comment

Choose a reason for hiding this comment

BradLarson May 8, 2020 •

edited

Loading