Skip to content
This repository was archived by the owner on Apr 23, 2025. It is now read-only.

WordSeg model and tests #498

Merged
merged 5 commits into from
May 9, 2020
Merged

Conversation

texasmichelle
Copy link
Member

Adds a word segmentation model with tests.

This is an implementation of the paper "Learning to Discover, Ground, and Use Words with Segmental Neural Language Models" by Kazuya Kawakami, Chris Dyer, and Phil Blunsom. This implementation is not affiliated with DeepMind and has not been verified by the authors.

This implementation authored by: @marcrasi @compnerd @saeta


import Foundation

public struct DataSet {
Copy link
Contributor

@BradLarson BradLarson May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to move the dataset out into the main Datasets (and give it a name other than DataSet) in a follow-on? Related: do we want to start the dataset tests out in the DatasetsTests grouping or move them later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to be rearranged as part of adding the dataset for a full example. This unblocks @asuhan while I put that together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is modelling the fact that there are three different datasets (training, validation, testing). However, the testing and validation is not something that is needed in most cases, and even the training doesn't need to be provided once we can snapshot the weights. So, I think it makes sense to perhaps mark this as internal for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked as internal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#500 for adding a dataset and cleaning this up.

Copy link
Contributor

@marcrasi marcrasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read the code. I'm just approving of the plan to move this model here.


import Foundation

public struct DataSet {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is modelling the fact that there are three different datasets (training, validation, testing). However, the testing and validation is not something that is needed in most cases, and even the training doesn't need to be provided once we can snapshot the weights. So, I think it makes sense to perhaps mark this as internal for now?

/// Lattice
///
/// Represents the lattice used by the WordSeg algorithm.
public struct Lattice: Differentiable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Lattice to be public? I think that that this should probably be made internal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this scope looks to be a bit more involved. Punting to #499

/// Edge
///
/// Represents an Edge
public struct Edge: Differentiable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar

/// Node
///
/// Represents a node in the lattice
public struct Node: Differentiable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar

/// - Returns: `true` if `self` is almost equal to `other`; otherwise
/// `false`.
@inlinable
public func isAlmostEqual(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/public/internal/ ... just to ensure that we don't accidentally end up with conflicting definitions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky, given the @inlinable.

/// Note: we map from String in order to support multi-character metadata sequences such as </s>.
///
/// In Python implementations, this is sometimes called the character vocabulary.
public struct Alphabet {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about Alphabet being confusing for multiple text models?

@texasmichelle
Copy link
Member Author

OK, most of @compnerd's comments have been addressed and I created #499 for scope cleanup.

Copy link
Contributor

@saeta saeta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a fair bit to clean up, but I'm okay with merging now to unblock progress.

@texasmichelle texasmichelle merged commit 62f932d into tensorflow:master May 9, 2020
@texasmichelle texasmichelle deleted the wordseg branch May 9, 2020 04:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants