Skip to content
This repository was archived by the owner on Dec 1, 2021. It is now read-only.

Commit f438b26

Browse files
committed
Prepare for public release
1 parent d6e0c16 commit f438b26

9 files changed

+987
-0
lines changed

.travis.yml

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
language: java
2+
jdk:
3+
- oraclejdk8

LICENSE

+675
Large diffs are not rendered by default.

README.md

+161
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
PAM: Probabilistic API Miner [![Build Status](https://travis-ci.org/mast-group/api-mining.svg?branch=master)](https://travis-ci.org/mast-group/api-mining)
2+
================
3+
4+
PAM is a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences. PAM largely avoids returning redundant and spurious sequences, unlike API mining approaches based on frequent pattern mining.
5+
6+
This is an implementation of the API miner from our paper:
7+
[*Parameter-Free Probabilistic API Mining across GitHub*](http://arxiv.org/abs/1512.05558)
8+
J. Fowkes and C. Sutton. FSE 2016.
9+
10+
11+
Installation
12+
------------
13+
14+
#### Installing in Eclipse
15+
16+
Simply import as a maven project into [Eclipse](https://eclipse.org/) using the *File -> Import...* menu option (note that this requires [m2eclipse](http://eclipse.org/m2e/)).
17+
18+
It's also possible to export a runnable jar from Eclipse using the *File -> Export...* menu option.
19+
20+
#### Compiling a Runnable Jar
21+
22+
To compile a standalone runnable jar, simply run
23+
24+
```
25+
mvn package
26+
```
27+
28+
in the top-level directory (note that this requires [maven](https://maven.apache.org/)). This will create the standalone runnable jar ```api-mining-1.0.jar``` in the api-mining/target subdirectory. The main class is *apimining.pam.main.PAM* (see below).
29+
30+
31+
Running PAM
32+
-----------
33+
34+
PAM uses a probabilistic model to determine which API patterns are the most interesting in a given dataset.
35+
36+
#### Mining API Patterns
37+
38+
Main class *apimining.pam.main.PAM* mines API patterns from a specified API call sequence file. It has the following command line options:
39+
40+
* **-f**   API call sequence file to mine (in [ARFF](https://weka.wikispaces.com/ARFF+%28stable+version%29) format, see below)
41+
* **-o**   output file
42+
* **-i**   max. no. iterations
43+
* **-s**   max. no. structure steps
44+
* **-r**   max. runtime (min)
45+
* **-l**   log level (INFO/FINE/FINER/FINEST)
46+
* **-v**   log to console instead of log file
47+
48+
See the individual file javadocs in *apimining.pam.main.PAM* for information on the Java interface.
49+
In Eclipse you can set command line arguments for the PAM interface using the *Run Configurations...* menu option.
50+
51+
#### Example Usage
52+
53+
A complete example using the command line interface on a runnable jar. We can mine the provided dataset ```netty.arff``` as follows:
54+
55+
```sh
56+
$ java -jar api-mining/target/api-mining-1.0.jar -i 1000 -f datasets/calls/all/netty.arff -o patterns.txt -v
57+
```
58+
59+
which will write the mined API patterns to ```patterns.txt```. Omitting the ```-v``` flag will redirect logging to a log file in ```/tmp/```.
60+
61+
Input/Output Formats
62+
--------------------
63+
64+
#### Input Format
65+
66+
PAM takes as input a list of API call sequences in [ARFF](https://weka.wikispaces.com/ARFF+%28stable+version%29) file format
67+
The ARFF format is very simple and best illustrated by example. The first few lines from ```netty.arff``` are:
68+
69+
```text
70+
@relation netty
71+
72+
@attribute fqCaller string
73+
@attribute fqCalls string
74+
75+
@data
76+
'com.torrent4j.net.peerwire.AbstractPeerWireMessage.write','io.netty.buffer.ChannelBuffer.writeByte'
77+
'com.torrent4j.net.peerwire.messages.BitFieldMessage.writeImpl','io.netty.buffer.ChannelBuffer.writeByte'
78+
'com.torrent4j.net.peerwire.messages.BitFieldMessage.readImpl','io.netty.buffer.ChannelBuffer.readable io.netty.buffer.ChannelBuffer.readByte'
79+
'com.torrent4j.net.peerwire.messages.BlockMessage.writeImpl','io.netty.buffer.ChannelBuffer.writeInt io.netty.buffer.ChannelBuffer.writeInt io.netty.buffer.ChannelBuffer.writeBytes'
80+
'com.torrent4j.net.peerwire.messages.BlockMessage.readImpl','io.netty.buffer.ChannelBuffer.readInt io.netty.buffer.ChannelBuffer.readInt io.netty.buffer.ChannelBuffer.readableBytes io.netty.buffer.ChannelBuffer.readBytes'
81+
```
82+
83+
The ```@relation``` declaration names the dataset and the following two ```@attribute``` statements declare that the dataset consists of two comma separated attributes:
84+
* ```fqCaller```   the fully-qualified name of the client method, enclosed in single quotes
85+
* ```fqCalls```   a space-separated list of fully-qualified names of API method calls, enclosed in single quotes.
86+
87+
The dataset is listed after the ```@data``` relation: each line contains a specific method (```fqCaller```) and its API call
88+
sequence (```fqCalls```). Note that the ```fqCaller``` attribute can be empty for PAM and UPMiner, it is only required for MAPO (see below).
89+
Note that while this example uses Java, PAM is language agnostic and can use API call sequences from *any* language.
90+
91+
#### Output Format
92+
93+
PAM outputs a list of the most interesting API call patterns (i.e. subsequences of the original API call sequences) ordered by their probability under the model.
94+
For example, the first few lines in the output file ```patterns.txt``` for the usage example above are:
95+
96+
```text
97+
prob: 0.04878
98+
[io.netty.channel.Channel.write]
99+
100+
prob: 0.04065
101+
[io.netty.channel.ExceptionEvent.getCause, io.netty.channel.ExceptionEvent.getChannel]
102+
103+
prob: 0.04065
104+
[io.netty.channel.ChannelHandlerContext.getChannel]
105+
106+
prob: 0.03252
107+
[io.netty.channel.Channel.close]
108+
```
109+
110+
See the accompanying [paper](http://arxiv.org/abs/1512.05558) for details.
111+
112+
113+
Java API Call Extractor
114+
-----------------------
115+
116+
The class *apimining.java.APICallExtractor* contains our 'best-effort' API call sequence extractor for Java source files.
117+
We used it to create the API call sequence datasets for our paper.
118+
119+
It takes folders of API client source files as input and generates API call sequences files (in ARFF format) for each API library given. For best performance, it requires a folder of namespaces used in the libraries so that it can resolve wildcarded namespaces. These can be collected using the provided Wildcard Namespace Collector class: *apimining.java.WildcardNamespaceCollector*.
120+
121+
See the individual class javadocs in *apimining.java* for details of their use.
122+
123+
124+
MAPO and UPMiner Implementations
125+
--------------------------------
126+
127+
For comparison purposes, we implemented the API miners [MAPO](https://www.cs.sfu.ca/~jpei/publications/Mapo-ecoop09.pdf) and [UPMiner](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/miningsuccincthighcoverageapiusagepatternsfromsourcecode.pdf) from stratch using the [Weka](http://www.cs.waikato.ac.nz/ml/weka/) hierarchical clusterer. These are provided in the
128+
*apimining.mapo.MAPO* and *apimining.upminer.UPMiner* classes respectively. They have the following command line options:
129+
130+
* **-f**   API call sequence file to mine (in [ARFF](https://weka.wikispaces.com/ARFF+%28stable+version%29) format, see above)
131+
* **-o**   output folder
132+
* **-s**   minimum support threshold
133+
134+
See the individual class files for information on the Java interface. Note that these are not particularly fast implementations as Weka's hierarchical clusterer is rather slow and inefficient. Moreover, as both API miners are based on frequent pattern mining algorithms, they can suffer from pattern explosion (this is a known problem with frequent pattern mining algorithms).
135+
136+
137+
Datasets
138+
--------
139+
140+
AAll datasets used in the paper are available in the ```datasets/``` subdirectory:
141+
* ```datasets/calls/all``` contains API call sequences for each of the 17 Java libraries described in our [paper](http://arxiv.org/abs/1512.05558) (see Table 1)
142+
* ```datasets/calls/train``` contains the subset of API call sequences used as the 'training set' in the paper
143+
144+
Both datasets use the [ARFF](https://weka.wikispaces.com/ARFF+%28stable+version%29) file format described above. In addition, so that it is possible to replicate our evaluation, we have provided the Java source files for:
145+
* each of the library client classes in ```datasets/source/client_files.tar.xz```
146+
* the library example classes in ```datasets/source/example_files.tar.xz```
147+
* the namespaces necessary for our *API Call Extractor* in ```namespaces.tar.xz```
148+
149+
Finally, the ```datasets/source/test_train_split``` subdirectory details the training/test set assignments for each client class.
150+
151+
152+
Bugs
153+
----
154+
155+
Please report any bugs using GitHub's issue tracker.
156+
157+
158+
License
159+
-------
160+
161+
This algorithm is released under the GNU GPLv3 license. Other licenses are available on request.

run-MAPO-all.sh

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/bin/bash
2+
for i in \
3+
andengine,0.2 \
4+
camel,0.005 \
5+
cloud9,0.06 \
6+
drools,0.2 \
7+
hornetq,0.34 \
8+
mahout,0.02 \
9+
neo4j,0.05 \
10+
netty,0.2 \
11+
resteasy,0.1 \
12+
restlet-framework-java,0.3 \
13+
spring-data-mongodb,0.005 \
14+
spring-data-neo4j,0.005 \
15+
twitter4j,0.2 \
16+
webobjects,0.006 \
17+
weld,0.005 \
18+
wicket,0.13 \
19+
hadoop,0.4
20+
do IFS=','
21+
set $i
22+
java -cp api-mining/target/api-mining-1.0.jar apimining.mapo.MAPO \
23+
-f datasets/calls/all/$1.arff \
24+
-o output/all/$1/mapo/ \
25+
-s $2
26+
done

run-MAPO-train.sh

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/bin/bash
2+
for i in \
3+
andengine,0.09 \
4+
camel,0.007 \
5+
cloud9,0.02 \
6+
drools,0.06 \
7+
hornetq,0.06 \
8+
mahout,0.3 \
9+
neo4j,0.005 \
10+
netty,0.005 \
11+
resteasy,0.07 \
12+
restlet-framework-java,0.2 \
13+
spring-data-mongodb,0.005 \
14+
spring-data-neo4j,0.005 \
15+
twitter4j,0.3 \
16+
webobjects,0.007 \
17+
weld,0.005 \
18+
wicket,0.2 \
19+
hadoop,0.008
20+
do IFS=','
21+
set $i
22+
java -cp api-mining/target/api-mining-1.0.jar apimining.mapo.MAPO \
23+
-f datasets/calls/train/$1.arff \
24+
-o output/train/$1/mapo/ \
25+
-s $2
26+
done

run-PAM-all.sh

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/bin/bash
2+
for p in \
3+
andengine \
4+
camel \
5+
cloud9 \
6+
drools \
7+
hadoop \
8+
hornetq \
9+
mahout \
10+
neo4j \
11+
netty \
12+
resteasy \
13+
restlet-framework-java \
14+
spring-data-mongodb \
15+
spring-data-neo4j \
16+
twitter4j \
17+
webobjects \
18+
weld \
19+
wicket
20+
do
21+
java -cp api-mining/target/api-mining-1.0.jar apimining.pam.main.PAM -f datasets/calls/all/$p.arff -i 10000 -o output/all/$p/PAM_seqs.txt
22+
done

run-PAM-train.sh

+22
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/bin/bash
2+
for p in \
3+
andengine \
4+
camel \
5+
cloud9 \
6+
drools \
7+
hadoop \
8+
hornetq \
9+
mahout \
10+
neo4j \
11+
netty \
12+
resteasy \
13+
restlet-framework-java \
14+
spring-data-mongodb \
15+
spring-data-neo4j \
16+
twitter4j \
17+
webobjects \
18+
weld \
19+
wicket
20+
do
21+
java -cp api-mining/target/api-mining-1.0.jar apimining.pam.main.PAM -f datasets/calls/train/$p.arff -i 10000 -o output/train/$p/PAM_seqs.txt
22+
done

run-UPMINER-all.sh

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/bin/bash
2+
for i in \
3+
andengine,0.4 \
4+
camel,0.2 \
5+
cloud9,0.005 \
6+
drools,0.2 \
7+
hornetq,0.005 \
8+
mahout,0.005 \
9+
neo4j,0.06 \
10+
netty,0.005 \
11+
resteasy,0.005 \
12+
restlet-framework-java,0.03 \
13+
spring-data-mongodb,0.005 \
14+
spring-data-neo4j,0.005 \
15+
twitter4j,0.005 \
16+
webobjects,0.12 \
17+
weld,0.005 \
18+
wicket,0.3 \
19+
hadoop,0.3
20+
do IFS=','
21+
set $i
22+
java -cp api-mining/target/api-mining-1.0.jar apimining.upminer.UPMiner \
23+
-f datasets/calls/all/$1.arff \
24+
-o output/all/$1/upminer/ \
25+
-s $2
26+
done

run-UPMINER-train.sh

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/bin/bash
2+
for i in \
3+
andengine,0.05 \
4+
camel,0.4 \
5+
cloud9,0.01 \
6+
drools,0.1 \
7+
hornetq,0.01 \
8+
mahout,0.01 \
9+
neo4j,0.1 \
10+
netty,0.005 \
11+
resteasy,0.3 \
12+
restlet-framework-java,0.01 \
13+
spring-data-mongodb,0.005 \
14+
spring-data-neo4j,0.005 \
15+
twitter4j,0.005 \
16+
webobjects,0.072 \
17+
weld,0.005 \
18+
wicket,0.38 \
19+
hadoop,0.1
20+
do IFS=','
21+
set $i
22+
java -cp api-mining/target/api-mining-1.0.jar apimining.upminer.UPMiner \
23+
-f datasets/calls/train/$1.arff \
24+
-o output/train/$1/upminer/ \
25+
-s $2
26+
done

0 commit comments

Comments
 (0)