-
Notifications
You must be signed in to change notification settings - Fork 144
break apart illinois-common-resources #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would suggest anything approaching 100M within this jar file warrants it's own jar file. |
There's already a version just for SRL; please check and see if this overlaps with the NER when deciding how to break down the common-resources jar. The SRL resources jar itself may merit further segmentation. Before this ticket is closed (after testing NER -- including running the training/evaluation) please open a ticket on the illinois-srl project to migrate to the new jars, if the corresponding resources have been affected. |
NER uses an older version of illinois-common-resources (1.3). So, which one should I break apart -- the one that NER uses, or the latest one? Also, what should be the right approach :-
|
BTW, I can also see a illinois-common-resources-1.5-ner jar in the repo .. which is around 200M .. any idea what is that being used for? |
I think that was me trying to generate an ner-only jar. |
Okay, so can someone familiar with the following resources, suggest me a logical partitioning, along with the suggested names of the new jars :- Size -- Name (File/Dir) 145M -- CORLEX (Dir) 91M -- WordEmbedding (Dir) 2.6M -- rogetThesaurus (Dir) 144M -- lin-clusters (Dir) 217M -- lin-similarity (Dir) Also, kindly address the queries posed in the previous comment. Thanks, |
Definately used the 1.5 version, it includes an updated gazetteer. I can only speak to a few of resources, but these are the ones I know are used by NER, although there may be others. The brown-clusters and gazetteers are well named and these are used in NER. I would suggest for these two datums, you split them each into their own jar files, I suspect there are cases where each are used independently. I would name them gazetteers.jar and brown-clusters.jar. I would suggest you retain the structure within the jar to minimize code changes. In that way, all we should have to change is the pom.xml files… |
Something to notice though: the latest version number on m2repo is 1.5, not 1.3 (which is currently being used inside the Edison code).
which seem to have some commonality. Here is what I see inside each: |
I just realized that there used to resource-based separation of resources. Some of them existing resources are already packaged here.. Could potentially save some time. |
@shatu where do we stand on this? I am again in a holding pattern waiting for something, and I badly want to get the NER release wrapped up. Can I suggest we do as @danyaljj suggests and put everything in it's own separate jar and release them all. As long as we don't trash the existing illinois-common-resources, everything should still work, but we should create tickets for all the projects needing them to convert to these new resources. Is there a way to deprecate a mvn artifact? |
Sure, I'll finish it off by the end of the day; I was waiting for some other proposals, but I think we should now go ahead with what you suggested -- Sorry for the delay. I'm not sure about deprecating a mvn artifact; in fact am not even sure about its relevance to the current thread. |
@shatu I may not be using the right words here, I am not even close to mvn expert. The illinois-common-resources is what I am referring to, I am just wondering if there is a way to deprecate a resource that can be referenced as a dependency in a pom file. That would be very useful say if you encounter a major bug in a software library or something. |
Okay, one last question ... what do we want the artifactIds to be .... "gazetteers" or as Daniel suggested above "resources.gazetteers"? |
I like "gazetteers" more as the artifactId. "resources" can/should be part of the group id. |
what @danyaljj said |
I'm done breaking apart the common-resources. For the time-being, I've kept the names of the jars as they were inside the common-resources, and have deployed them on the repo. Will it make sense to do something similar for cogcomp-resources jar as well (or is that already deprecated?)? I'm not sure what all things from the common-resources, the NER depends upon, so I'm not sure what all dependencies to include in place of common-resources. Can someone familiar with NER help me with that? Also, what all tests to run in order to close this ticket? |
Nah I think it's deprecated. We probably should remove these resources from
why not doing the greedy approach? I'd first start by Edison, drop the common-resources and add the necessary resources. As far as I know, we always load the resource with
I think we have to replace all definitions of common-resources in the pom files, and make sure all tests pass. |
Awesome! I will integrate this with NER as part of the NER release (ere-reader) fork. It's really just a matter of making sure I get everything I need. |
@danyaljj .. I added all the dependencies separately to Edison's pom file, and am getting the following error :-
Does this has to do something with #146 ? Is it possible to confirm whether it's a edison only issue or not? Is there any other project that uses gazetteers and is easy to test? |
BTW, if anyone was using the 1.3 version of common-resources, I've deployed the corresponding (broken-apart) individual jars for that as well. |
Your stack trace is cut. I don't completely know what is causing it, but I don't think it related to anything else. We are just replacing the containers of the resources. Everything should work the same, if the foldering structure is the same. I would look more carefully into the stack trace... |
Okay, here's what you need to replace common-resources with ..
|
Also, here's the full stacktrace :-
|
I manually downloaded the jars, decompressed them and did a diff with the corresponding resources present in common-resources .. they are exactly same -- So, not sure where the problem might be. |
So all the other tests pass except this? Seems like new GazetteerViewGenerator("resources/gazetteers/gazetteers", ViewNames.GAZETTEER + "Gazetteers"); |
Yup, other tests work just fine :-
And, yes, with your above suggested replacement, it works! I'm not sure I understand why is that the case. |
Actually that probably wasn't a good suggestion, as the annotators are lazy. So creating an instance doesn't cause it to load the resource. How about you run |
doInitialize works for your suggestion above, but not for the original code i.e.
|
ok, so there must a problematic/missing resource. Why don't you add enough logging into the constructor to see for what input it's failing? |
Sure, I'll debug it more thoroughly then -- just wanted to see if you guys have encountered similar problems elsewhere, and if it's an edison-only issue or not. |
@shatu no progress on this? |
@danyaljj .. apparently, this was over long ago. The same code just worked now; most likely, our maven repo had some issues back then in fetching the resources jar. Edison tests pass now; should we now create tickets in the projects that use common-resources? |
@shatu great -- and yes, it would be helpful to open tickets in affected projects... |
...into several smaller pieces. Gazetteers can go in one; brown clusters (and maybe other clusters) in another. This may help reduce the problems with large dependencies in CI.
The text was updated successfully, but these errors were encountered: