-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
GSOC Application: OCR prototype for ancient documents #12881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit demonstrates my understanding of JabRef's architecture by: - Creating OCR service interface with hexagonal architecture - Implementing PDF text layer utilities with PDFBox - Designing preference panel for OCR configuration
|
||
@Override | ||
public String getTabName() { | ||
return Localization.lang("OCR"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string "OCR" should be localized using the proper localization method to ensure consistency across different languages.
Your pull request needs to link an issue. To ease organizational workflows, please link this pull-request to the issue with syntax as described in https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue:
Examples
|
return true; // Empty path is allowed (will use system default) | ||
} | ||
|
||
Path tesseractDir = Paths.get(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use modern Java best practices by replacing Paths.get() with Path.of() to improve readability and maintainability.
* @return true if path is valid (either empty or existing directory) | ||
*/ | ||
private boolean validateTesseractPath(String path) { | ||
if (path == null || path.trim().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code should follow the fail fast principle by immediately handling invalid states and returning early instead of nesting logic inside else branches.
JUnit tests are failing. You can see which checks are failing by locating the box "Some checks were not successful" on the pull request page. To see the test output, locate "Tests / Unit tests (pull_request)" and click on it. You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide. |
You have removed the "Mandatory Checks" section from your pull request description. Please adhere to our pull request template. |
Note that your PR will not be reviewed/accepted until you have gone through the mandatory checks in the description and marked each of them them exactly in the format of |
public OcrService getOcrService(String engineName) throws OcrProcessException { | ||
OcrService service = ocrEngines.get(engineName); | ||
|
||
if (service == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses exceptions for normal control flow by throwing an exception when the service is null. Exceptions should be used for exceptional states, not for normal control flow.
* @throws OcrProcessException if no engine is available | ||
*/ | ||
public OcrService getDefaultOcrService() throws OcrProcessException { | ||
if (defaultEngineName == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses exceptions for normal control flow by throwing an exception when the defaultEngineName is null. Exceptions should be used for exceptional states, not for normal control flow.
registerEngines(tesseractPath); | ||
|
||
// Set default engine (first one registered) | ||
this.defaultEngineName = ocrEngines.isEmpty() ? null : ocrEngines.keySet().iterator().next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning null for defaultEngineName is not recommended. Use java.util.Optional to avoid null returns.
* @return OCR result containing extracted text and metadata | ||
* @throws OcrProcessException if OCR processing fails | ||
*/ | ||
OcrResult processPdf(Path pdfPath) throws OcrProcessException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method does not return an Optional, which is recommended for new public methods to avoid returning null.
* @return OCR result containing extracted text and metadata | ||
* @throws OcrProcessException if OCR processing fails | ||
*/ | ||
OcrResult processImage(Path imagePath) throws OcrProcessException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method does not return an Optional, which is recommended for new public methods to avoid returning null.
* | ||
* @return Set of supported languages | ||
*/ | ||
Set<OcrLanguage> getSupportedLanguages(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method does not return an Optional, which is recommended for new public methods to avoid returning null.
* | ||
* @return Engine name | ||
*/ | ||
String getEngineName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method does not return an Optional, which is recommended for new public methods to avoid returning null.
* | ||
* @return true if the engine is ready to use | ||
*/ | ||
boolean isAvailable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method does not return an Optional, which is recommended for new public methods to avoid returning null.
* @param config Configuration to apply | ||
* @throws OcrProcessException if the configuration cannot be applied | ||
*/ | ||
void applyConfig(OcrEngineConfig config) throws OcrProcessException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method does not return an Optional, which is recommended for new public methods to avoid returning null.
* @param ocrResult OCR result containing extracted text to add | ||
* @throws OcrProcessException if adding the text layer fails | ||
*/ | ||
void addTextLayerToPdf(Path pdfPath, Path outputPath, OcrResult ocrResult) throws OcrProcessException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method does not return an Optional, which is recommended for new public methods to avoid returning null.
* @param language OCR language to use | ||
* @throws OcrProcessException if the language is not supported or cannot be set | ||
*/ | ||
void setLanguage(OcrLanguage language) throws OcrProcessException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method does not return an Optional, which is recommended for new public methods to avoid returning null.
*/ | ||
@Override | ||
public Set<OcrLanguage> getSupportedLanguages() { | ||
return new HashSet<>(supportedLanguages); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using new HashSet<>(...) is not optimal. Modern Java best practices suggest using Set.of(...) for better readability and performance.
throw new OcrProcessException(getEngineName(), | ||
String.format("Language '%s' is not supported", language.getDisplayName())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception message should be localized using Localization.lang for consistency with other strings in the application.
throw new OcrProcessException(getEngineName(), | ||
String.format("Configuration is for engine '%s', not compatible with '%s'", | ||
config.getEngineName(), getEngineName())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception message should be localized using Localization.lang for consistency with other strings in the application.
protected OcrEngineConfig getCurrentConfig() { | ||
return currentConfig; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New public methods should not return null. Consider using java.util.Optional to avoid returning null.
* {@inheritDoc} | ||
*/ | ||
@Override | ||
public Set<OcrLanguage> getSupportedLanguages() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method should return a modern Java data structure like Set.of(...) instead of new HashSet<>(...) for better readability and performance.
*/ | ||
@Override | ||
public OcrResult processPdf(Path pdfPath) throws OcrProcessException { | ||
if (!isAvailable()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code should follow the fail fast principle by immediately handling invalid states and returning early instead of nesting logic inside else branches.
throw new OcrProcessException(ENGINE_NAME, "Tesseract is not available"); | ||
} | ||
|
||
if (!Files.exists(pdfPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code should follow the fail fast principle by immediately handling invalid states and returning early instead of nesting logic inside else branches.
private final QualityPreset qualityPreset; | ||
|
||
private OcrEngineConfig(Builder builder) { | ||
this.engineName = Objects.requireNonNull(builder.engineName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null values.
|
||
private OcrEngineConfig(Builder builder) { | ||
this.engineName = Objects.requireNonNull(builder.engineName); | ||
this.language = Objects.requireNonNull(builder.language); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null values.
* @param isAncient Whether this is an ancient language | ||
*/ | ||
public OcrLanguage(String isoCode, String displayName, boolean isAncient) { | ||
this.isoCode = Objects.requireNonNull(isoCode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null parameters.
*/ | ||
public OcrLanguage(String isoCode, String displayName, boolean isAncient) { | ||
this.isoCode = Objects.requireNonNull(isoCode); | ||
this.displayName = Objects.requireNonNull(displayName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null parameters.
private final OcrLanguage language; | ||
|
||
private OcrResult(Builder builder) { | ||
this.extractedText = Objects.requireNonNull(builder.extractedText); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null values.
*/ | ||
public OcrResult processAndAddTextLayer(Path pdfPath, Path outputPath) throws OcrProcessException { | ||
// Check if PDF already has text | ||
if (hasTextLayer(pdfPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code does not follow the fail-fast principle. It should return early if the condition is met, instead of nesting the logic inside an else branch.
* @param isInvisible Whether the text should be invisible | ||
*/ | ||
private SearchableTextLayer(Builder builder) { | ||
this.sourceText = Objects.requireNonNull(builder.sourceText); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of Objects.requireNonNull is discouraged. Instead, the @nonnull annotation from JSpecify should be used to ensure non-null values.
|
||
// Process each page | ||
for (int i = 0; i < document.getNumberOfPages(); i++) { | ||
if (i < pageTexts.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code should follow the fail fast principle by immediately handling invalid states and returning early instead of nesting logic inside else branches. This line should be refactored to check for invalid states first.
contentStream.newLineAtOffset(0, 0); | ||
|
||
// Set text rendering mode to invisible (3 = invisible) | ||
contentStream.setRenderingMode(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of '3' as a magic number for setting the rendering mode to invisible should be replaced with a named constant for better readability and maintainability.
String text = stripper.getText(document); | ||
|
||
// If we get more than a threshold of characters, assume there's a text layer | ||
return text.trim().length() > 50; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threshold of 50 characters is a magic number and should be defined as a constant with a descriptive name to improve code clarity.
import javafx.scene.control.TextField; | ||
import javafx.stage.DirectoryChooser; | ||
|
||
import org.jabref.gui.Globals; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lareinahu-2023, where did you get this from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @subhramit
The imports for javafx.scene.control.TextField and javafx.stage.DirectoryChooser are both standard JavaFX classes that are used in the code. TextField is used for the tesseractPathTextField field to display and edit the Tesseract path, while DirectoryChooser is indirectly used through DialogService to implement the directory selection functionality.
Regarding org.jabref.gui.Globals, I apologize for the confusion. I was unable to find where I originally saw this pattern. Looking at the codebase now, I see this approach doesn't match JabRef's current implementation style. Other tabs like LinkedFilesTab and GeneralTab use a different pattern, accessing preferences through view models rather than a static Globals class.
I'll update my implementation to follow the project's current pattern for directory selection, using the DirectoryDialogConfiguration with proper preferences injection, similar to how it's done in other preference tabs.
GSOC Application: OCR Implementation Prototype
This PR contains a prototype implementation of OCR support for ancient documents
as part of my Google Summer of Code application. It is meant to demonstrate my
understanding of JabRef's architecture and design patterns.
Components Implemented
OCR Service Interface Prototype
PDF Text Layer Utility
Preference Panel UI
Implementation Approach
The implementation follows JabRef's architecture with clean separation of:
I've included detailed documentation in code comments and the implementation plan.
Note: This is not intended for merging but for demonstrating my understanding
of JabRef's codebase for my GSoC application.