Skip to content

GSOC Application: OCR prototype for ancient documents #12881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lareinahu-2023
Copy link

GSOC Application: OCR Implementation Prototype

This PR contains a prototype implementation of OCR support for ancient documents
as part of my Google Summer of Code application. It is meant to demonstrate my
understanding of JabRef's architecture and design patterns.

Components Implemented

  1. OCR Service Interface Prototype

    • Created interface following JabRef's hexagonal architecture
    • Implemented adapter pattern for multiple OCR engines
    • Integrated with JabRef's existing patterns
  2. PDF Text Layer Utility

    • Built utilities to add OCR text as searchable PDF layers
    • Used PDFBox in JabRef's style
    • Demonstrated Lucene integration for indexing
  3. Preference Panel UI

    • Designed JavaFX panel following JabRef's UI patterns
    • Used property binding with validation
    • Integrated with JabRef's preference system

Implementation Approach

The implementation follows JabRef's architecture with clean separation of:

  • Domain models (OcrResult, OcrLanguage)
  • Interfaces/ports (OcrService)
  • Adapters (TesseractAdapter)
  • Application logic (OcrManager)
  • UI components (OcrTab)

I've included detailed documentation in code comments and the implementation plan.

Note: This is not intended for merging but for demonstrating my understanding
of JabRef's codebase for my GSoC application.

  This commit demonstrates my understanding of JabRef's architecture by:
  - Creating OCR service interface with hexagonal architecture
  - Implementing PDF text layer utilities with PDFBox
  - Designing preference panel for OCR configuration

@Override
public String getTabName() {
return Localization.lang("OCR");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string "OCR" should be localized using the proper localization method to ensure consistency across different languages.

@jabref-machine
Copy link
Collaborator

Your pull request needs to link an issue.

To ease organizational workflows, please link this pull-request to the issue with syntax as described in https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue:

Linking a pull request to an issue using a keyword

You can link a pull request to an issue by using a supported keyword in the pull request's description or in a commit message. The pull request must be on the default branch.

  • close
  • closes
  • closed
  • fix
  • fixes
  • fixed
  • resolve
  • resolves
  • resolved

If you use a keyword to reference a pull request comment in another pull request, the pull requests will be linked. Merging the referencing pull request also closes the referenced pull request.

The syntax for closing keywords depends on whether the issue is in the same repository as the pull request.

Examples

  • Fixes #xyz links pull-request to issue. Merging the PR will close the issue.
  • Fixes https://github.com/JabRef/jabref/issues/xyz links pull-request to issue. Merging the PR will close the issue.
  • Fixes https://github.com/Koppor/jabref/issues/xyz links pull-request to issue. Merging the PR will close the issue.
  • Fixes [#xyz](https://github.com/JabRef/jabref/issues/xyz) links pull-request to issue. Merging the PR will NOT close the issue.

return true; // Empty path is allowed (will use system default)
}

Path tesseractDir = Paths.get(path);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use modern Java best practices by replacing Paths.get() with Path.of() to improve readability and maintainability.

* @return true if path is valid (either empty or existing directory)
*/
private boolean validateTesseractPath(String path) {
if (path == null || path.trim().isEmpty()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code should follow the fail fast principle by immediately handling invalid states and returning early instead of nesting logic inside else branches.

@jabref-machine
Copy link
Collaborator

JUnit tests are failing. You can see which checks are failing by locating the box "Some checks were not successful" on the pull request page. To see the test output, locate "Tests / Unit tests (pull_request)" and click on it.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

@jabref-machine
Copy link
Collaborator

You have removed the "Mandatory Checks" section from your pull request description. Please adhere to our pull request template.

@jabref-machine
Copy link
Collaborator

Note that your PR will not be reviewed/accepted until you have gone through the mandatory checks in the description and marked each of them them exactly in the format of [x] (done), [ ] (not done yet) or [/] (not applicable).

public OcrService getOcrService(String engineName) throws OcrProcessException {
OcrService service = ocrEngines.get(engineName);

if (service == null) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code uses exceptions for normal control flow by throwing an exception when the service is null. Exceptions should be used for exceptional states, not for normal control flow.

* @throws OcrProcessException if no engine is available
*/
public OcrService getDefaultOcrService() throws OcrProcessException {
if (defaultEngineName == null) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code uses exceptions for normal control flow by throwing an exception when the defaultEngineName is null. Exceptions should be used for exceptional states, not for normal control flow.

registerEngines(tesseractPath);

// Set default engine (first one registered)
this.defaultEngineName = ocrEngines.isEmpty() ? null : ocrEngines.keySet().iterator().next();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning null for defaultEngineName is not recommended. Use java.util.Optional to avoid null returns.

* @return OCR result containing extracted text and metadata
* @throws OcrProcessException if OCR processing fails
*/
OcrResult processPdf(Path pdfPath) throws OcrProcessException;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method does not return an Optional, which is recommended for new public methods to avoid returning null.

* @return OCR result containing extracted text and metadata
* @throws OcrProcessException if OCR processing fails
*/
OcrResult processImage(Path imagePath) throws OcrProcessException;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method does not return an Optional, which is recommended for new public methods to avoid returning null.

*
* @return Set of supported languages
*/
Set<OcrLanguage> getSupportedLanguages();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method does not return an Optional, which is recommended for new public methods to avoid returning null.

*
* @return Engine name
*/
String getEngineName();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method does not return an Optional, which is recommended for new public methods to avoid returning null.

*
* @return true if the engine is ready to use
*/
boolean isAvailable();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method does not return an Optional, which is recommended for new public methods to avoid returning null.

* @param config Configuration to apply
* @throws OcrProcessException if the configuration cannot be applied
*/
void applyConfig(OcrEngineConfig config) throws OcrProcessException;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method does not return an Optional, which is recommended for new public methods to avoid returning null.

* @param ocrResult OCR result containing extracted text to add
* @throws OcrProcessException if adding the text layer fails
*/
void addTextLayerToPdf(Path pdfPath, Path outputPath, OcrResult ocrResult) throws OcrProcessException;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method does not return an Optional, which is recommended for new public methods to avoid returning null.

* @param language OCR language to use
* @throws OcrProcessException if the language is not supported or cannot be set
*/
void setLanguage(OcrLanguage language) throws OcrProcessException;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method does not return an Optional, which is recommended for new public methods to avoid returning null.

*/
@Override
public Set<OcrLanguage> getSupportedLanguages() {
return new HashSet<>(supportedLanguages);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using new HashSet<>(...) is not optimal. Modern Java best practices suggest using Set.of(...) for better readability and performance.

Comment on lines +34 to +35
throw new OcrProcessException(getEngineName(),
String.format("Language '%s' is not supported", language.getDisplayName()));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception message should be localized using Localization.lang for consistency with other strings in the application.

Comment on lines +56 to +58
throw new OcrProcessException(getEngineName(),
String.format("Configuration is for engine '%s', not compatible with '%s'",
config.getEngineName(), getEngineName()));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception message should be localized using Localization.lang for consistency with other strings in the application.

Comment on lines +92 to +94
protected OcrEngineConfig getCurrentConfig() {
return currentConfig;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public methods should not return null. Consider using java.util.Optional to avoid returning null.

* {@inheritDoc}
*/
@Override
public Set<OcrLanguage> getSupportedLanguages() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method should return a modern Java data structure like Set.of(...) instead of new HashSet<>(...) for better readability and performance.

*/
@Override
public OcrResult processPdf(Path pdfPath) throws OcrProcessException {
if (!isAvailable()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code should follow the fail fast principle by immediately handling invalid states and returning early instead of nesting logic inside else branches.

throw new OcrProcessException(ENGINE_NAME, "Tesseract is not available");
}

if (!Files.exists(pdfPath)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code should follow the fail fast principle by immediately handling invalid states and returning early instead of nesting logic inside else branches.

private final QualityPreset qualityPreset;

private OcrEngineConfig(Builder builder) {
this.engineName = Objects.requireNonNull(builder.engineName);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null values.


private OcrEngineConfig(Builder builder) {
this.engineName = Objects.requireNonNull(builder.engineName);
this.language = Objects.requireNonNull(builder.language);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null values.

* @param isAncient Whether this is an ancient language
*/
public OcrLanguage(String isoCode, String displayName, boolean isAncient) {
this.isoCode = Objects.requireNonNull(isoCode);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null parameters.

*/
public OcrLanguage(String isoCode, String displayName, boolean isAncient) {
this.isoCode = Objects.requireNonNull(isoCode);
this.displayName = Objects.requireNonNull(displayName);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null parameters.

private final OcrLanguage language;

private OcrResult(Builder builder) {
this.extractedText = Objects.requireNonNull(builder.extractedText);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null values.

*/
public OcrResult processAndAddTextLayer(Path pdfPath, Path outputPath) throws OcrProcessException {
// Check if PDF already has text
if (hasTextLayer(pdfPath)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not follow the fail-fast principle. It should return early if the condition is met, instead of nesting the logic inside an else branch.

* @param isInvisible Whether the text should be invisible
*/
private SearchableTextLayer(Builder builder) {
this.sourceText = Objects.requireNonNull(builder.sourceText);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null values.


// Process each page
for (int i = 0; i < document.getNumberOfPages(); i++) {
if (i < pageTexts.size()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code should follow the fail fast principle by immediately handling invalid states and returning early instead of nesting logic inside else branches. This line should be refactored to check for invalid states first.

contentStream.newLineAtOffset(0, 0);

// Set text rendering mode to invisible (3 = invisible)
contentStream.setRenderingMode(3);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of '3' as a magic number for setting the rendering mode to invisible should be replaced with a named constant for better readability and maintainability.

String text = stripper.getText(document);

// If we get more than a threshold of characters, assume there's a text layer
return text.trim().length() > 50;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold of 50 characters is a magic number and should be defined as a constant with a descriptive name to improve code clarity.

@calixtus calixtus marked this pull request as draft April 3, 2025 18:53
import javafx.scene.control.TextField;
import javafx.stage.DirectoryChooser;

import org.jabref.gui.Globals;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lareinahu-2023, where did you get this from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @subhramit

The imports for javafx.scene.control.TextField and javafx.stage.DirectoryChooser are both standard JavaFX classes that are used in the code. TextField is used for the tesseractPathTextField field to display and edit the Tesseract path, while DirectoryChooser is indirectly used through DialogService to implement the directory selection functionality.

Regarding org.jabref.gui.Globals, I apologize for the confusion. I was unable to find where I originally saw this pattern. Looking at the codebase now, I see this approach doesn't match JabRef's current implementation style. Other tabs like LinkedFilesTab and GeneralTab use a different pattern, accessing preferences through view models rather than a static Globals class.

I'll update my implementation to follow the project's current pattern for directory selection, using the DirectoryDialogConfiguration with proper preferences injection, similar to how it's done in other preference tabs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants