Updated README.md

Can convert, OCR and combine image files to text in parallel
2023-09-08 11:30:05 +03:00 · 2023-09-08 11:25:59 +03:00
3 changed files with 110 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -1,11 +1,14 @@
 # csv2ankicards

-A simple tool to convert CSV files into Anki deck packages (.apkg files).
+A simple toolkit that offers:
+- Conversion of CSV files into Anki deck packages (.apkg files).
+- Conversion of image files in a directory to a text file using Optical Character Recognition (OCR).

 ## Features

 - Converts a CSV file with questions and answers into an Anki deck package.
- There are only two columns in the CSV file, separated by the first comma encountered.
+- Converts image files from a specified directory to a single text file using OCR.
+- For CSV: there are only two columns in the CSV file, separated by the first comma encountered.
 - CSV files should have a "Front" column for questions and a "Back" column for answers.

 ## Installation
@ -29,6 +32,8 @@ A simple tool to convert CSV files into Anki deck packages (.apkg files).

 ## Usage

+### CSV to Anki Conversion
+
 To convert a CSV file into an Anki deck package:

 ```bash
@ -37,23 +42,37 @@ python csv2ankicards.py /path/to/your/csvfile.csv output.apkg

 This will produce an `output.apkg` file which can then be imported into Anki.

-### CSV Format
+#### CSV Format

 The CSV file should follow this format:

 ```
 Front,Back
-Your question here,Your answer here, and here
+Your question here,Your answer here
 Another question,list of: answer1, answer2, answer3
 ...
 ```

 **Note:** If your answers contain commas, they will be considered as part of the answer. Only the first comma is used to separate the question from the answer.

+### Image to Text Conversion
+
+To convert images from a directory to a single text file using OCR:
+
+```bash
+python images2text.py /path/to/your/image_directory/
+```
+
+This will produce a `final.txt` file which contains the text extracted from the images.
+
+#### Supported Image Formats
+
+Currently supported formats for the images are: `.png`, `.jpg`, and `.jpeg`.
+
 ## License

 [MIT License](LICENSE)

 ## Contributing

-Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
+Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
--- a/images2text.py
+++ b/images2text.py
@ -0,0 +1,85 @@
+import os
+import sys
+from subprocess import run, CalledProcessError
+from concurrent.futures import ThreadPoolExecutor
+
+converted_dir = "converted"
+
+def is_image_file(path):
+    lower_path = path.lower()
+    return lower_path.endswith('.png') or lower_path.endswith('.jpg') or lower_path.endswith('.jpeg')
+
+def convert_image(image_path):
+    print(f"Converting {image_path}...")
+    converted_path = os.path.join(converted_dir, os.path.basename(image_path))
+    cmd = [
+        "convert",
+        image_path,
+        "-colorspace", "Gray",
+        "-resize", "300%",
+        "-threshold", "55%",
+        "-type", "Grayscale",
+        converted_path
+    ]
+    
+    try:
+        run(cmd, check=True)
+        print(f"Converted image output to {converted_path}!")
+        return converted_path
+    except CalledProcessError:
+        print(f"Error converting {image_path} with ImageMagick. Using original for Tesseract.")
+        return image_path
+
+def ocr_image(image_path):
+    print(f"OCR'ing {image_path}...")
+    text_filename = os.path.basename(image_path).replace(".jpg", ".txt")
+    text_path = os.path.join(converted_dir, text_filename)
+    cmd = ["tesseract", image_path, text_path.replace(".txt", "")]
+    try:
+        run(cmd, check=True)
+        print(f"OCRed to {text_path}!")
+        return text_path
+    except CalledProcessError:
+        print(f"Error processing {image_path} with Tesseract. Skipping.")
+        return None
+
+def process_image(image_path):
+    converted_path = convert_image(image_path)
+    print(f"OCR'ing image {image_path} (now at {converted_path})...")
+    text_path = ocr_image(converted_path)
+    if text_path and os.path.exists(text_path):
+        with open(text_path, 'r') as text_file:
+            text_content = text_file.read()
+            print(f"Added text from {text_path} to final output.")
+            return text_content
+    else:
+        print(f"Cannot locate {text_path}! Cannot add text to final output!")
+        return None
+
+def main(directory_path):
+    final_text = []
+
+    if not os.path.exists(converted_dir):
+        os.mkdir(converted_dir)
+
+    image_paths = []
+    for root, dirs, files in os.walk(directory_path):
+        for file in files:
+            image_path = os.path.join(root, file)
+            if is_image_file(image_path):
+                image_paths.append(image_path)
+
+    # Use a ThreadPoolExecutor to process images in parallel
+    with ThreadPoolExecutor() as executor:
+        final_text = list(executor.map(process_image, image_paths))
+    
+    # Filter out any None values and write the text to final.txt
+    final_text = [text for text in final_text if text is not None]
+    with open("final.txt", 'w') as f:
+        f.write("\n".join(final_text))
+
+if __name__ == "__main__":
+    if len(sys.argv) != 2:
+        print("Usage: python images2text.py <directory_path>")
+        sys.exit(1)
+    main(sys.argv[1])
--- a/requirements.txt
+++ b/requirements.txt
@ -1 +1,2 @@
 genanki==0.8.0
+Pillow
Author	SHA1	Message	Date
Benjamin Dweck	51401ba964	Updated README.md	2023-09-08 11:30:05 +03:00
Benjamin Dweck	18a9cb0dd9	Can convert, OCR and combine image files to text in parallel	2023-09-08 11:25:59 +03:00