Skip to content

New concatenate function to join tree sequences #3164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hyanwong opened this issue May 15, 2025 · 5 comments · Fixed by #3165
Closed

New concatenate function to join tree sequences #3164

hyanwong opened this issue May 15, 2025 · 5 comments · Fixed by #3165

Comments

@hyanwong
Copy link
Member

hyanwong commented May 15, 2025

A number of people (including me) have been trying to join two tree sequences together, end to end, e.g. when obtained from different chromosomes. I have messed this up (message in slack, I forgot that keep_intervals can also change node IDs because it automatically simplifies), and I suspect others find this tricky too.

I propose two new main TableCollection methods, shift and concatenate (based on the numpy function, which concatenates multiple things together), and possibly an additional shortcut method, provisionally called append, which simply concatenates two TableCollections. These can be exposed in the normal way as TreeSequence methods.

As this essentially only relies on union, I think most of the behavioural edge cases & errors should be caught in that function. The helpful addition here is to (a) shift the coordinates of the other TC to fit and (b) by default, only map the samples together.

def shift(self, value=None, sequence_length=None, record_provenance=True):
    """
    Shift the coordinate system (used by edges and sites) of this TableCollection by a given
    value. Positive values shift the coordinate system to the right, negative values to the left.
    The sequence length of the tree sequence will be changed by `value`, unless
    `sequence_length` is given, in which case the sequence length will be set to that value.

    .. note::
        No attempt is made to check that the new coordinate system or sequence length is valid:
        if you wish to do this, use the equivalent TreeSequence.shift() method instead.
    """
    if value is None:
        value = 0
    if value != 0:
        self.edges.left += value
        self.edges.right += value
        self.sites.position += value
    if sequence_length is None:
        self.sequence_length += value
    else:
        self.sequence_length = sequence_length
    
    if record_provenance:
        parameters = {
            "command": "shift", "value": value, "sequence_length": sequence_length
        }
        self.provenances.add_row(
            record=json.dumps(provenance.get_provenance_dict(parameters))
        )


def concatenate(self, others, *, node_mappings=None, record_provenance=True, **kwargs):
    """
    Concatenate a set of table collections, with a comparable set of samples, to the
    right of the existing table collection. This shifts the other tables
    rightwards then repeatedly calls {meth}`union`. By default, the sample nodes
    in the other node table, in numerical order, are mapped to the sample nodes in the
    existing node table, in numerical order.

    .. note::
        To add gaps between the concatenated tables, use :meth:`shift` or
        to remove gaps, use :meth:`trim` before concatenating.

    :param TableCollection other: A list of additional table collections.
    :param list node_mappings: A list of node mappings, of the same length as `others`.
        Each node mapping is an array of integers
        of the same length as the number of nodes in the equivalent table collection
        in `others`, where the k'th element gives the id of the node in the current table
        collection corresponding to node k in the other table collection (see {meth}`union`).
        If any node mapping is None, only the sample nodes between the two nodes
        tables are mapped to each other, with samples taken in numerical order.
        If None (default), ``node_mappings`` is treated as a list of None values.
    :param bool record_provenance: If True (default), record details of this call to
            ``concatenate`` in the returned tree sequence's provenance information
            (Default: True).
    :param kwargs: Additional keyword arguments to pass to {meth}`union`.


    returns: a new tree sequence
    """
    if node_mappings is None:
        node_mappings = [None] * len(others)
    if len(node_mappings) != len(others):
        raise ValueError("`node_mappings` must have the same length as `others`")
    samples = np.where(self.nodes.flags & tskit.NODE_IS_SAMPLE)[0]
    for other, node_mapping in zip(others, node_mappings):
        other_samples = np.where(other.nodes.flags & tskit.NODE_IS_SAMPLE)[0]
        if len(other_samples) != len(samples):
            raise ValueError("each `other` must have the same number of samples as `self`")
        other.shift(self.sequence_length, record_provenance=False)
        self.sequence_length += other.sequence_length
        if node_mapping is None:
            node_mapping = np.full(other.nodes.num_rows, tskit.NULL, dtype=np.int32)
            node_mapping[other_samples] = samples
        self.union(other, node_mapping=node_mapping, **kwargs)
    if record_provenance:
        parameters = {
            "command": "concatenate",
            "TODO": "add concatenate parameters"  # tricky because each other TC has provenances
        }
        self.provenances.add_row(
            record=json.dumps(provenance.get_provenance_dict(parameters))
        )

def append(self, other, node_mapping=None, **kwargs, record_provenance=True):
    """
    Append another table collections to the right of the existing table collection.
    This is a shortcut for ``.concatenate([other], sample_maps=[sample_map], ...)``
    intended for adding a single table collection to another.
    """
    self.concatenate(
        [other],
        node_mappings=[node_mapping],
        record_provenance=False,
        **kwargs
    )
    if record_provenance:
        parameters = {
            "command": "append",
            "TODO": "add append parameters"  # tricky because the other TC has provenances
        }
        self.provenances.add_row(
            record=json.dumps(provenance.get_provenance_dict(parameters))
        )

I think it's helpful to have a function that doesn't require the user to wrap the [other] in a list, but rather than have a separate append method, we could allow concatenate to take either a list of Table Collections, or a single table collection? Or we could take all non-keyword arguments to be additional TableCollections.

@jeromekelleher
Copy link
Member

This would be very useful I think. A few minor points.

I don't see any point in having a default value for shift, just make the value arg a required positional param.

I think it would be clearer if we make the pairwise "append" operation the basic unit here rather than the more complex concat operator.

So, if we focus on clearly defining the semantics of ``tc1.append(ts2) then I think defining

for tc in tcs:
    tc1.append(tc)

will be straightforward

@hyanwong
Copy link
Member Author

This would be very useful I think. A few minor points.

I don't see any point in having a default value for shift, just make the value arg a required positional param.

My thought was that we could then use ts.shift(sequence_length=1e6) to simply change the sequence length. At the moment, we can't do that without resorting to table operations.

I think it would be clearer if we make the pairwise "append" operation the basic unit here rather than the more complex concat operator.

So, if we focus on clearly defining the semantics of ``tc1.append(ts2) then I think defining

for tc in tcs:
tc1.append(tc)
will be straightforward

Yes, I wondered about that. But this would mean lots of creating multiple tree sequences, in the TreeSequence form, I think?

for ts in tree_seqs:
. new_ts = new_ts.append(ts)

I guess we could just say to drop into doing things on tables if that's a concern.

I slightly worried whether TC.append() would be confused with TC.nodes.append(): is concatenate a bit more discoverable? I suppose if we just define TC.append, we could easily define an additional TreeSequence-only .concatenate() method that simply does the loop that you mention?

@jeromekelleher
Copy link
Member

My thought was that we could then use ts.shift(sequence_length=1e6) to simply change the sequence length. At the moment, we can't do that without resorting to table operations.

That sounds like a different operation then.

I think the operation you want is TableCollection.concatenate(other). That can be called sequentially on other TableCollections.

We can define the TreeSequence equivalents later on on top of the TableCollection ops.

@hyanwong
Copy link
Member Author

That sounds like a different operation then.

Yeah, the name is a bit overloaded. I guess we could just tell people who wanted to change the sequence length to use ts.shift(0, sequence_length=new-length), which is pretty clear. So on reflection, I agree that we want to force the shift value.

I think the operation you want is TableCollection.concatenate(other).

Yep, this seems sensible. We can figure out any other operations on top of this, as you say.

I think both should be exposed as TC and TS methods.

@petrelharp
Copy link
Contributor

Great idea. Minor note: in the docstring for concatenate perhaps just say "defaults to matching samples and no others" and refer to union for more docs. (Also: as discussed on slack, we'll need to do check_shared_overlap=False in union.)

hyanwong added a commit to hyanwong/tskit that referenced this issue May 15, 2025
hyanwong added a commit to hyanwong/tskit that referenced this issue May 15, 2025
hyanwong added a commit to hyanwong/tskit that referenced this issue May 15, 2025
hyanwong added a commit to hyanwong/tskit that referenced this issue May 15, 2025
hyanwong added a commit to hyanwong/tskit that referenced this issue May 16, 2025
Fixes tskit-dev#3164

Update python/tskit/tables.py
hyanwong added a commit to hyanwong/tskit that referenced this issue May 16, 2025
Fixes tskit-dev#3164

Update python/tskit/tables.py
hyanwong added a commit to hyanwong/tskit that referenced this issue May 16, 2025
Fixes tskit-dev#3164

Update python/tskit/tables.py
hyanwong added a commit to hyanwong/tskit that referenced this issue May 19, 2025
hyanwong added a commit to hyanwong/tskit that referenced this issue May 19, 2025
hyanwong added a commit to hyanwong/tskit that referenced this issue May 20, 2025
github-merge-queue bot pushed a commit that referenced this issue May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants