Adding Multiple Training Files to the Pipeline? #192

cflint987 · 2018-05-19T08:44:18Z

System information
OS version/distro: Windows 7 Home
.NET Version (eg., dotnet --info): ML .net V0.1.0

Issue:
What is the correct way to add multiple training files to a Learning Pipeline?

In the Taxi Fare example, just adding another textloader and/or ColumnCopier, etc seems to not be correct.

Example:
pipeline.Add(new TextLoader(DataPath, useHeader: true, separator: ","));
pipeline.Add(new TextLoader(DataPath2, useHeader: true, separator: ","));

GalOshri · 2018-05-21T19:27:17Z

Thanks for asking! This is not currently possible, but let's use this issue to track enabling multiple inputs in a pipeline.

Just to clarify: is your intention to concatenate the two files as soon as they are loaded, or to apply different transforms/trainers to them?

A potential workaround for now is to read in the examples from both files into memory and use the CollectionDataSource (see example usage here). You could also concatenate the two files into one CSV outside of the ML.NET pipeline.

cflint987 · 2018-05-22T21:13:43Z

My intention is for creating and testing ML structures with large datasets to be modular and less taxing on file transfers to/from servers. For example, moving 100GB is to a server is easier if split by time or another parameter. It also allows ML structures to be updated as new data comes in without having to concat onto what are already are/is a large file.

Reducing the memory footprint by loading subsets of the data would be nice, but as I understand it, that is not possible for all ML structures.

I have concated the files and it works properly but this would be a nice feature to have.

Thanks for the answer.

glebuk · 2018-05-23T17:14:52Z

@cflint987,
We have a work item to address your exact scenario. Please take a look at PR #61. Feel free to comment and ask @tyclintw.

Ivanidzo4ka · 2018-10-18T16:27:34Z

DRI RESPONSE: You can do it with new Api:

// Create the reader: define the data columns and where to find them in the text file.
var reader = TextLoader.CreateReader(env, ctx => (
        // A boolean column depicting the 'target label'.
        IsOver50K: ctx.LoadBool(14),
        // Three text columns.
        Workclass: ctx.LoadText(1),
        Education: ctx.LoadText(3),
        MaritalStatus: ctx.LoadText(5)),
    hasHeader: true);
 // Now read the files (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
var data = reader.Read(exampleFile1, exampleFile2);

Please let me know if this satisfy you. I'm intend to close this issue within next few days.

cflint987 changed the title ~~.Net Framework Support?~~ Adding Multiple Training Files to the Pipeline? May 19, 2018

shauheen added the question Further information is requested label May 21, 2018

GalOshri added the enhancement New feature or request label May 21, 2018

Ivanidzo4ka added the Answered label Oct 22, 2018

Ivanidzo4ka closed this as completed Oct 22, 2018

ShinobiWannabe mentioned this issue Nov 19, 2018

14 TB of Hundreds of Thousands of Input Files For Training? #1668

Closed

ghost locked as resolved and limited conversation to collaborators Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Multiple Training Files to the Pipeline? #192

Adding Multiple Training Files to the Pipeline? #192

cflint987 commented May 19, 2018 •

edited

Loading

GalOshri commented May 21, 2018

cflint987 commented May 22, 2018

glebuk commented May 23, 2018

Ivanidzo4ka commented Oct 18, 2018

Adding Multiple Training Files to the Pipeline? #192

Adding Multiple Training Files to the Pipeline? #192

Comments

cflint987 commented May 19, 2018 • edited Loading

GalOshri commented May 21, 2018

cflint987 commented May 22, 2018

glebuk commented May 23, 2018

Ivanidzo4ka commented Oct 18, 2018

cflint987 commented May 19, 2018 •

edited

Loading