Prepare for public release

jfowkes · jfowkes · commit f438b26539a1 · 2016-08-02T15:13:07.000+01:00
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,3 @@
+language: java
+jdk:
+  - oraclejdk8
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -0,0 +1,161 @@
+PAM: Probabilistic API Miner [![Build Status](https://travis-ci.org/mast-group/api-mining.svg?branch=master)](https://travis-ci.org/mast-group/api-mining)
+================
+ 
+PAM is a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences. PAM largely avoids returning redundant and spurious sequences, unlike API mining approaches based on frequent pattern mining.
+
+This is an implementation of the API miner from our paper:  
+[*Parameter-Free Probabilistic API Mining across GitHub*](http://arxiv.org/abs/1512.05558)  
+J. Fowkes and C. Sutton. FSE 2016.   
+
+
+Installation 
+------------
+
+#### Installing in Eclipse
+
+Simply import as a maven project into [Eclipse](https://eclipse.org/) using the *File -> Import...* menu option (note that this requires [m2eclipse](http://eclipse.org/m2e/)). 
+
+It's also possible to export a runnable jar from Eclipse using the *File -> Export...* menu option.
+
+#### Compiling a Runnable Jar
+
+To compile a standalone runnable jar, simply run
+
+```
+mvn package
+```
+
+in the top-level directory (note that this requires [maven](https://maven.apache.org/)). This will create the standalone runnable jar ```api-mining-1.0.jar``` in the api-mining/target subdirectory. The main class is *apimining.pam.main.PAM* (see below).
+
+
+Running PAM
+-----------
+
+PAM uses a probabilistic model to determine which API patterns are the most interesting in a given dataset.  
+
+#### Mining API Patterns 
+
+Main class *apimining.pam.main.PAM* mines API patterns from a specified API call sequence file. It has the following command line options:
+
+* **-f**  &nbsp;  API call sequence file to mine (in [ARFF](https://weka.wikispaces.com/ARFF+%28stable+version%29) format, see below)
+* **-o**  &nbsp;  output file
+* **-i**  &nbsp;  max. no. iterations
+* **-s**  &nbsp;  max. no. structure steps
+* **-r**  &nbsp;  max. runtime (min)
+* **-l**  &nbsp;  log level (INFO/FINE/FINER/FINEST)
+* **-v**  &nbsp;  log to console instead of log file   
+
+See the individual file javadocs in *apimining.pam.main.PAM* for information on the Java interface.
+In Eclipse you can set command line arguments for the PAM interface using the *Run Configurations...* menu option. 
+
+#### Example Usage
+
+A complete example using the command line interface on a runnable jar. We can mine the provided dataset ```netty.arff``` as follows: 
+
+  ```sh 
+  $ java -jar api-mining/target/api-mining-1.0.jar -i 1000 -f datasets/calls/all/netty.arff -o patterns.txt -v 
+  ```
+
+which will write the mined API patterns to ```patterns.txt```. Omitting the ```-v``` flag will redirect logging to a log file in ```/tmp/```. 
+
+Input/Output Formats
+--------------------
+
+#### Input Format
+
+PAM takes as input a list of API call sequences in [ARFF](https://weka.wikispaces.com/ARFF+%28stable+version%29) file format
+The ARFF format is very simple and best illustrated by example. The first few lines from ```netty.arff``` are:
+
+```text
+@relation netty
+
+@attribute fqCaller string
+@attribute fqCalls string
+
+@data
+'com.torrent4j.net.peerwire.AbstractPeerWireMessage.write','io.netty.buffer.ChannelBuffer.writeByte'
+'com.torrent4j.net.peerwire.messages.BitFieldMessage.writeImpl','io.netty.buffer.ChannelBuffer.writeByte'
+'com.torrent4j.net.peerwire.messages.BitFieldMessage.readImpl','io.netty.buffer.ChannelBuffer.readable io.netty.buffer.ChannelBuffer.readByte'
+'com.torrent4j.net.peerwire.messages.BlockMessage.writeImpl','io.netty.buffer.ChannelBuffer.writeInt io.netty.buffer.ChannelBuffer.writeInt io.netty.buffer.ChannelBuffer.writeBytes'
+'com.torrent4j.net.peerwire.messages.BlockMessage.readImpl','io.netty.buffer.ChannelBuffer.readInt io.netty.buffer.ChannelBuffer.readInt io.netty.buffer.ChannelBuffer.readableBytes io.netty.buffer.ChannelBuffer.readBytes'
+```
+
+The ```@relation``` declaration names the dataset and the following two ```@attribute``` statements declare that the dataset consists of two comma separated attributes:   
+* ```fqCaller``` &nbsp; the fully-qualified name of the client method, enclosed in single quotes  
+* ```fqCalls``` &nbsp; a space-separated list of fully-qualified names of API method calls, enclosed in single quotes.
+
+The dataset is listed after the ```@data``` relation: each line contains a specific method (```fqCaller```) and its API call 
+sequence (```fqCalls```). Note that the ```fqCaller``` attribute can be empty for PAM and UPMiner, it is only required for MAPO (see below).
+Note that while this example uses Java, PAM is language agnostic and can use API call sequences from *any* language.
+
+#### Output Format
+
+PAM outputs a list of the most interesting API call patterns (i.e. subsequences of the original API call sequences) ordered by their probability under the model. 
+For example, the first few lines in the output file ```patterns.txt``` for the usage example above are:
+
+```text
+prob: 0.04878
+[io.netty.channel.Channel.write]
+
+prob: 0.04065
+[io.netty.channel.ExceptionEvent.getCause, io.netty.channel.ExceptionEvent.getChannel]
+
+prob: 0.04065
+[io.netty.channel.ChannelHandlerContext.getChannel]
+
+prob: 0.03252
+[io.netty.channel.Channel.close]
+```
+
+See the accompanying [paper](http://arxiv.org/abs/1512.05558) for details.
+
+
+Java API Call Extractor
+-----------------------
+
+The class *apimining.java.APICallExtractor* contains our 'best-effort' API call sequence extractor for Java source files. 
+We used it to create the API call sequence datasets for our paper.  
+
+It takes folders of API client source files as input and generates API call sequences files (in ARFF format) for each API library given. For best performance, it requires a folder of namespaces used in the libraries so that it can resolve wildcarded namespaces. These can be collected using the provided Wildcard Namespace Collector class: *apimining.java.WildcardNamespaceCollector*. 
+
+See the individual class javadocs in *apimining.java* for details of their use.
+
+
+MAPO and UPMiner Implementations
+--------------------------------
+
+For comparison purposes, we implemented the API miners [MAPO](https://www.cs.sfu.ca/~jpei/publications/Mapo-ecoop09.pdf) and [UPMiner](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/miningsuccincthighcoverageapiusagepatternsfromsourcecode.pdf) from stratch using the [Weka](http://www.cs.waikato.ac.nz/ml/weka/) hierarchical clusterer. These are provided in the 
+*apimining.mapo.MAPO* and *apimining.upminer.UPMiner* classes respectively. They have the following command line options:
+
+* **-f**  &nbsp;  API call sequence file to mine (in [ARFF](https://weka.wikispaces.com/ARFF+%28stable+version%29) format, see above)
+* **-o**  &nbsp;  output folder
+* **-s**  &nbsp;  minimum support threshold
+
+See the individual class files for information on the Java interface. Note that these are not particularly fast implementations as Weka's hierarchical clusterer is rather slow and inefficient. Moreover, as both API miners are based on frequent pattern mining algorithms, they can suffer from pattern explosion (this is a known problem with frequent pattern mining algorithms).
+
+
+Datasets
+--------
+
+AAll datasets used in the paper are available in the ```datasets/``` subdirectory: 
+* ```datasets/calls/all``` contains API call sequences for each of the 17 Java libraries described in our [paper](http://arxiv.org/abs/1512.05558) (see Table 1)
+* ```datasets/calls/train``` contains the subset of API call sequences used as the 'training set' in the paper
+ 
+Both datasets use the [ARFF](https://weka.wikispaces.com/ARFF+%28stable+version%29) file format described above. In addition, so that it is possible to replicate our evaluation, we have provided the Java source files for:
+* each of the library client classes in ```datasets/source/client_files.tar.xz``` 
+* the library example classes in ```datasets/source/example_files.tar.xz``` 
+* the namespaces necessary for our *API Call Extractor* in ```namespaces.tar.xz```
+
+Finally, the ```datasets/source/test_train_split``` subdirectory details the training/test set assignments for each client class.
+
+
+Bugs
+----
+
+Please report any bugs using GitHub's issue tracker.
+
+
+License
+-------
+
+This algorithm is released under the GNU GPLv3 license. Other licenses are available on request.
diff --git a/run-MAPO-all.sh b/run-MAPO-all.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+for i in \
+andengine,0.2 \
+camel,0.005 \
+cloud9,0.06 \
+drools,0.2 \
+hornetq,0.34 \
+mahout,0.02 \
+neo4j,0.05 \
+netty,0.2 \
+resteasy,0.1 \
+restlet-framework-java,0.3 \
+spring-data-mongodb,0.005 \
+spring-data-neo4j,0.005 \
+twitter4j,0.2 \
+webobjects,0.006 \
+weld,0.005 \
+wicket,0.13 \
+hadoop,0.4
+do IFS=',' 
+set $i
+java -cp api-mining/target/api-mining-1.0.jar apimining.mapo.MAPO \
+-f datasets/calls/all/$1.arff \
+-o output/all/$1/mapo/ \
+-s $2
+done 
diff --git a/run-MAPO-train.sh b/run-MAPO-train.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+for i in \
+andengine,0.09 \
+camel,0.007 \
+cloud9,0.02 \
+drools,0.06 \
+hornetq,0.06 \
+mahout,0.3 \
+neo4j,0.005 \
+netty,0.005 \
+resteasy,0.07 \
+restlet-framework-java,0.2 \
+spring-data-mongodb,0.005 \
+spring-data-neo4j,0.005 \
+twitter4j,0.3 \
+webobjects,0.007 \
+weld,0.005 \
+wicket,0.2 \
+hadoop,0.008
+do IFS=',' 
+set $i
+java -cp api-mining/target/api-mining-1.0.jar apimining.mapo.MAPO \
+-f datasets/calls/train/$1.arff \
+-o output/train/$1/mapo/ \
+-s $2
+done 
diff --git a/run-PAM-all.sh b/run-PAM-all.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+for p in \
+andengine \
+camel \
+cloud9 \
+drools \
+hadoop \
+hornetq \
+mahout \
+neo4j \
+netty \
+resteasy \
+restlet-framework-java \
+spring-data-mongodb \
+spring-data-neo4j \
+twitter4j \
+webobjects \
+weld \
+wicket
+do
+java -cp api-mining/target/api-mining-1.0.jar apimining.pam.main.PAM -f datasets/calls/all/$p.arff -i 10000 -o output/all/$p/PAM_seqs.txt
+done
diff --git a/run-PAM-train.sh b/run-PAM-train.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+for p in \
+andengine \
+camel \
+cloud9 \
+drools \
+hadoop \
+hornetq \
+mahout \
+neo4j \
+netty \
+resteasy \
+restlet-framework-java \
+spring-data-mongodb \
+spring-data-neo4j \
+twitter4j \
+webobjects \
+weld \
+wicket
+do
+java -cp api-mining/target/api-mining-1.0.jar apimining.pam.main.PAM -f datasets/calls/train/$p.arff -i 10000 -o output/train/$p/PAM_seqs.txt
+done
diff --git a/run-UPMINER-all.sh b/run-UPMINER-all.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+for i in \
+andengine,0.4 \
+camel,0.2 \
+cloud9,0.005 \
+drools,0.2 \
+hornetq,0.005 \
+mahout,0.005 \
+neo4j,0.06 \
+netty,0.005 \
+resteasy,0.005 \
+restlet-framework-java,0.03 \
+spring-data-mongodb,0.005 \
+spring-data-neo4j,0.005 \
+twitter4j,0.005 \
+webobjects,0.12 \
+weld,0.005 \
+wicket,0.3 \
+hadoop,0.3
+do IFS=',' 
+set $i
+java -cp api-mining/target/api-mining-1.0.jar apimining.upminer.UPMiner \
+-f datasets/calls/all/$1.arff \
+-o output/all/$1/upminer/ \
+-s $2
+done 
diff --git a/run-UPMINER-train.sh b/run-UPMINER-train.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+for i in \
+andengine,0.05 \
+camel,0.4 \
+cloud9,0.01 \
+drools,0.1 \
+hornetq,0.01 \
+mahout,0.01 \
+neo4j,0.1 \
+netty,0.005 \
+resteasy,0.3 \
+restlet-framework-java,0.01 \
+spring-data-mongodb,0.005 \
+spring-data-neo4j,0.005 \
+twitter4j,0.005 \
+webobjects,0.072 \
+weld,0.005 \
+wicket,0.38 \
+hadoop,0.1
+do IFS=',' 
+set $i
+java -cp api-mining/target/api-mining-1.0.jar apimining.upminer.UPMiner \
+-f datasets/calls/train/$1.arff \
+-o output/train/$1/upminer/ \
+-s $2
+done 

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+language: java`
	`2`	`+jdk:`
	`3`	`+ - oraclejdk8`