-
Notifications
You must be signed in to change notification settings - Fork 149
Conversation
Models/Text/WordSeg/DataSet.swift
Outdated
|
||
import Foundation | ||
|
||
public struct DataSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to move the dataset out into the main Datasets (and give it a name other than DataSet) in a follow-on? Related: do we want to start the dataset tests out in the DatasetsTests grouping or move them later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this needs to be rearranged as part of adding the dataset for a full example. This unblocks @asuhan while I put that together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is modelling the fact that there are three different datasets (training, validation, testing). However, the testing and validation is not something that is needed in most cases, and even the training doesn't need to be provided once we can snapshot the weights. So, I think it makes sense to perhaps mark this as internal
for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked as internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#500 for adding a dataset and cleaning this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't read the code. I'm just approving of the plan to move this model here.
Models/Text/WordSeg/DataSet.swift
Outdated
|
||
import Foundation | ||
|
||
public struct DataSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is modelling the fact that there are three different datasets (training, validation, testing). However, the testing and validation is not something that is needed in most cases, and even the training doesn't need to be provided once we can snapshot the weights. So, I think it makes sense to perhaps mark this as internal
for now?
/// Lattice | ||
/// | ||
/// Represents the lattice used by the WordSeg algorithm. | ||
public struct Lattice: Differentiable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Lattice
to be public
? I think that that this should probably be made internal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this scope looks to be a bit more involved. Punting to #499
/// Edge | ||
/// | ||
/// Represents an Edge | ||
public struct Edge: Differentiable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar
/// Node | ||
/// | ||
/// Represents a node in the lattice | ||
public struct Node: Differentiable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar
/// - Returns: `true` if `self` is almost equal to `other`; otherwise | ||
/// `false`. | ||
@inlinable | ||
public func isAlmostEqual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/public/internal/
... just to ensure that we don't accidentally end up with conflicting definitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky, given the @inlinable
.
/// Note: we map from String in order to support multi-character metadata sequences such as </s>. | ||
/// | ||
/// In Python implementations, this is sometimes called the character vocabulary. | ||
public struct Alphabet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry about Alphabet
being confusing for multiple text models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a fair bit to clean up, but I'm okay with merging now to unblock progress.
Adds a word segmentation model with tests.
This is an implementation of the paper "Learning to Discover, Ground, and Use Words with Segmental Neural Language Models" by Kazuya Kawakami, Chris Dyer, and Phil Blunsom. This implementation is not affiliated with DeepMind and has not been verified by the authors.
This implementation authored by: @marcrasi @compnerd @saeta