Skip to content

break apart illinois-common-resources #186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cogcomp-dev opened this issue Sep 15, 2016 · 34 comments
Closed

break apart illinois-common-resources #186

cogcomp-dev opened this issue Sep 15, 2016 · 34 comments
Assignees

Comments

@cogcomp-dev
Copy link

...into several smaller pieces. Gazetteers can go in one; brown clusters (and maybe other clusters) in another. This may help reduce the problems with large dependencies in CI.

@cowchipkid
Copy link
Contributor

I would suggest anything approaching 100M within this jar file warrants it's own jar file.

@mssammon mssammon assigned mayhewsw and shatu and unassigned mayhewsw Sep 17, 2016
@mssammon
Copy link
Contributor

There's already a version just for SRL; please check and see if this overlaps with the NER when deciding how to break down the common-resources jar. The SRL resources jar itself may merit further segmentation. Before this ticket is closed (after testing NER -- including running the training/evaluation) please open a ticket on the illinois-srl project to migrate to the new jars, if the corresponding resources have been affected.

@shatu
Copy link
Member

shatu commented Sep 28, 2016

NER uses an older version of illinois-common-resources (1.3). So, which one should I break apart -- the one that NER uses, or the latest one?

Also, what should be the right approach :-

  1. Should we keep the current illinois-common-resources as it is and create new jars containing portions of it? Or,
  2. Should we modify the current illinois-common-resources to contain only a portion of its current members and move the rest to separate jars. -- This will probably work only if nothing else depends on common-resources. Or, we'll have to open issue in all such projects to update their dependencies correspondingly.

@shatu
Copy link
Member

shatu commented Sep 29, 2016

BTW, I can also see a illinois-common-resources-1.5-ner jar in the repo .. which is around 200M .. any idea what is that being used for?

@mssammon
Copy link
Contributor

I think that was me trying to generate an ner-only jar.

@shatu
Copy link
Member

shatu commented Sep 29, 2016

Okay, so can someone familiar with the following resources, suggest me a logical partitioning, along with the suggested names of the new jars :-

Size -- Name (File/Dir)
64K -- levin-verbClass.txt (File)
504K -- verb-lemDict.txt (File)
12M -- cbcData (Dir)

145M -- CORLEX (Dir)

91M -- WordEmbedding (Dir)

2.6M -- rogetThesaurus (Dir)
97M -- gazetteers (Dir)

144M -- lin-clusters (Dir)
57M -- brown-clusters (Dir)

217M -- lin-similarity (Dir)

Also, kindly address the queries posed in the previous comment.

Thanks,
Shashank

@cowchipkid
Copy link
Contributor

Definately used the 1.5 version, it includes an updated gazetteer.

I can only speak to a few of resources, but these are the ones I know are used by NER, although there may be others. The brown-clusters and gazetteers are well named and these are used in NER. I would suggest for these two datums, you split them each into their own jar files, I suspect there are cases where each are used independently. I would name them gazetteers.jar and brown-clusters.jar.

I would suggest you retain the structure within the jar to minimize code changes. In that way, all we should have to change is the pom.xml files…

@danyaljj
Copy link
Member

Briefly repeating Shashank, here is what the file contains:
screen shot 2016-09-29 at 1 21 39 pm

My suggestion is one jar file per resource. For example I would put brownCluster folder inside a jar artifact with the same name (and retain the current version 1.3).

@danyaljj
Copy link
Member

Something to notice though: the latest version number on m2repo is 1.5, not 1.3 (which is currently being used inside the Edison code).
Also there three files:

  • illinois-common-resources-1.5-illinoisSRL.jar
  • illinois-common-resources-1.5-ner.jar
  • illinois-common-resources-1.5.jar

which seem to have some commonality. Here is what I see inside each:

screen shot 2016-09-29 at 2 43 29 pm

@danyaljj danyaljj mentioned this issue Sep 29, 2016
3 tasks
@danyaljj
Copy link
Member

@cowchipkid
Copy link
Contributor

@shatu where do we stand on this? I am again in a holding pattern waiting for something, and I badly want to get the NER release wrapped up. Can I suggest we do as @danyaljj suggests and put everything in it's own separate jar and release them all. As long as we don't trash the existing illinois-common-resources, everything should still work, but we should create tickets for all the projects needing them to convert to these new resources. Is there a way to deprecate a mvn artifact?

@shatu
Copy link
Member

shatu commented Oct 4, 2016

Sure, I'll finish it off by the end of the day; I was waiting for some other proposals, but I think we should now go ahead with what you suggested -- Sorry for the delay.

I'm not sure about deprecating a mvn artifact; in fact am not even sure about its relevance to the current thread.

@cowchipkid
Copy link
Contributor

@shatu I may not be using the right words here, I am not even close to mvn expert. The illinois-common-resources is what I am referring to, I am just wondering if there is a way to deprecate a resource that can be referenced as a dependency in a pom file. That would be very useful say if you encounter a major bug in a software library or something.

@shatu
Copy link
Member

shatu commented Oct 5, 2016

Okay, one last question ... what do we want the artifactIds to be .... "gazetteers" or as Daniel suggested above "resources.gazetteers"?

@danyaljj
Copy link
Member

danyaljj commented Oct 5, 2016

I like "gazetteers" more as the artifactId. "resources" can/should be part of the group id.

@cowchipkid
Copy link
Contributor

what @danyaljj said

@shatu
Copy link
Member

shatu commented Oct 5, 2016

I'm done breaking apart the common-resources. For the time-being, I've kept the names of the jars as they were inside the common-resources, and have deployed them on the repo. Will it make sense to do something similar for cogcomp-resources jar as well (or is that already deprecated?)?

I'm not sure what all things from the common-resources, the NER depends upon, so I'm not sure what all dependencies to include in place of common-resources. Can someone familiar with NER help me with that? Also, what all tests to run in order to close this ticket?

@danyaljj
Copy link
Member

danyaljj commented Oct 5, 2016

Will it make sense to do something similar for cogcomp-resources jar as well?

Nah I think it's deprecated. We probably should remove these resources from m2repo.

I'm not sure what all things from the common-resources, the NER depends upon, so I'm not sure what all dependencies to include in place of common-resources. Can someone familiar with NER help me with that?

why not doing the greedy approach? I'd first start by Edison, drop the common-resources and add the necessary resources. As far as I know, we always load the resource with IOUtils.lsResources. So if you search for this keyword, you will see what resource is used where....

Also, what all tests to run in order to close this ticket?

I think we have to replace all definitions of common-resources in the pom files, and make sure all tests pass.

@cowchipkid
Copy link
Contributor

Awesome! I will integrate this with NER as part of the NER release (ere-reader) fork. It's really just a matter of making sure I get everything I need.

@shatu
Copy link
Member

shatu commented Oct 5, 2016

@danyaljj .. I added all the dependencies separately to Edison's pom file, and am getting the following error :-

testGazetteerFeatures(edu.illinois.cs.cogcomp.edison.features.factory.TestWordFeatureFactory)
java.lang.RuntimeException: java.io.EOFException at edu.illinois.cs.cogcomp.edison.annotators.GazetteerViewGenerator.addView(GazetteerViewGenerator.java:402)
at edu.illinois.cs.cogcomp.annotation.Annotator.lazyAddView(Annotator.java:176)
at edu.illinois.cs.cogcomp.annotation.Annotator.getView(Annotator.java:161)
....

Does this has to do something with #146 ?

Is it possible to confirm whether it's a edison only issue or not? Is there any other project that uses gazetteers and is easy to test?

@shatu
Copy link
Member

shatu commented Oct 5, 2016

BTW, if anyone was using the 1.3 version of common-resources, I've deployed the corresponding (broken-apart) individual jars for that as well.

@danyaljj
Copy link
Member

danyaljj commented Oct 5, 2016

Your stack trace is cut. I don't completely know what is causing it, but I don't think it related to anything else. We are just replacing the containers of the resources. Everything should work the same, if the foldering structure is the same.

I would look more carefully into the stack trace...

@shatu
Copy link
Member

shatu commented Oct 6, 2016

Okay, here's what you need to replace common-resources with ..

    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>brown-clusters</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>cbc-clusters</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>corlex</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>gazetteers</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>levin-verbClass</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>lin-clusters</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>lin-similarity</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>rogetThesaurus</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>verb-lemDict</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>
    <dependency>
        <groupId>edu.illinois.cs.cogcomp.resources</groupId>
        <artifactId>wordEmbedding</artifactId>
        <version>1.3</version>
        <scope>runtime</scope>
    </dependency>

@shatu
Copy link
Member

shatu commented Oct 6, 2016

Also, here's the full stacktrace :-

testGazetteerFeatures(edu.illinois.cs.cogcomp.edison.features.factory.TestWordFeatureFactory) Time elapsed: 0.013 sec <<< ERROR!
java.lang.RuntimeException: java.io.EOFException
at edu.illinois.cs.cogcomp.edison.annotators.GazetteerViewGenerator.addView(GazetteerViewGenerator.java:402)
at edu.illinois.cs.cogcomp.annotation.Annotator.lazyAddView(Annotator.java:176)
at edu.illinois.cs.cogcomp.annotation.Annotator.getView(Annotator.java:161)
at edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation.addView(TextAnnotation.java:111)
at edu.illinois.cs.cogcomp.edison.features.factory.WordFeatureExtractorFactory$12.getWordFeatures(WordFeatureExtractorFactory.java:423)
at edu.illinois.cs.cogcomp.edison.features.factory.TestWordFeatureFactory.runTest(TestWordFeatureFactory.java:209)
at edu.illinois.cs.cogcomp.edison.features.factory.TestWordFeatureFactory.testGazetteerFeatures(TestWordFeatureFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:232)
at junit.framework.TestSuite.run(TestSuite.java:227)
at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
Caused by: java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
at java.util.zip.GZIPInputStream.(GZIPInputStream.java:79)
at java.util.zip.GZIPInputStream.(GZIPInputStream.java:91)
at edu.illinois.cs.cogcomp.edison.annotators.GazetteerViewGenerator.lazyLoadGazetteers(GazetteerViewGenerator.java:319)
at edu.illinois.cs.cogcomp.edison.annotators.GazetteerViewGenerator.addView(GazetteerViewGenerator.java:400)
... 31 more

@shatu
Copy link
Member

shatu commented Oct 6, 2016

I manually downloaded the jars, decompressed them and did a diff with the corresponding resources present in common-resources .. they are exactly same -- So, not sure where the problem might be.

@danyaljj
Copy link
Member

danyaljj commented Oct 6, 2016

So all the other tests pass except this?

Seems like GazetteerViewGenerator is not able to extract the resource. Have you tried creating Gazetteer instance directly?

 new GazetteerViewGenerator("resources/gazetteers/gazetteers", ViewNames.GAZETTEER + "Gazetteers");

@shatu
Copy link
Member

shatu commented Oct 6, 2016

Yup, other tests work just fine :-

Tests run: 103, Failures: 0, Errors: 1, Skipped: 0

And, yes, with your above suggested replacement, it works! I'm not sure I understand why is that the case.

@danyaljj
Copy link
Member

danyaljj commented Oct 6, 2016

Actually that probably wasn't a good suggestion, as the annotators are lazy. So creating an instance doesn't cause it to load the resource. How about you run instance.doInitialize() afterwards?

@shatu
Copy link
Member

shatu commented Oct 6, 2016

doInitialize works for your suggestion above, but not for the original code i.e.

new GazetteerViewGenerator("gazetteers", ViewNames.GAZETTEER)

@danyaljj
Copy link
Member

danyaljj commented Oct 6, 2016

ok, so there must a problematic/missing resource. Why don't you add enough logging into the constructor to see for what input it's failing?

@shatu
Copy link
Member

shatu commented Oct 6, 2016

Sure, I'll debug it more thoroughly then -- just wanted to see if you guys have encountered similar problems elsewhere, and if it's an edison-only issue or not.

@danyaljj
Copy link
Member

@shatu no progress on this?

@shatu
Copy link
Member

shatu commented Oct 19, 2016

@danyaljj .. apparently, this was over long ago. The same code just worked now; most likely, our maven repo had some issues back then in fetching the resources jar.

Edison tests pass now; should we now create tickets in the projects that use common-resources?

@mssammon
Copy link
Contributor

@shatu great -- and yes, it would be helpful to open tickets in affected projects...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants