Wanda

This tool is designed to help digitializing the Kunzer library. It uses the scan of one page of the Kunzer Buch, to extract and detect the different entries. The information needs to be verified at the end and can also be changed if errors occured.

There are different tools implemented to ease the validation process, such as automatic SISF extraction, plant name search, region updates, swiss location search as well as a local location search.

The program is called with: python main.py

Dependecies

numpy
pandas
geopandas
shapely
QtPY5
cv2
transformers
datetime
matplotlib

Models

We use the EAST model to find the regions of the scaned page, where text is written. The model can be downloaded here and needs to be put in a folder 'Checkpoints' in the same directory as Wanda. you can follow this Tutorial if you have problems loading the model.

For the OCR (Optical Character Recognition) part of our program we use the TrOCR model from the HuggingFace library

Datasets

These datasets need to be downloaded such that the programme can run correctly:

We use the swissBOUNDARIES3D dataset for the current boundaries of the different communities/cantons. The repsective GPKG must be manually downloaded.

We extract locations in Switzerland from swissNAMES3D. More specifically the points and polygons.

The plantnames and their respective SISF numbers are downloaded from ArtFlora.

Functionality

The program first looks for two title, which are present at every page ('Nr.' and 'Datum'). With these the found textboxes can be classified into plant names, dates and locations. The classified boxes are then merged. Since every page has exactly 25 entries the programme knows when not all or too many plants were found. In that case the user is notified to manually select/remove a bounding box via a GUI. The extracted text boxes are saved temporarly in the folder 'Extractions'. On these boxes we perform OCR and search for the most similar word in our respective database. The user is then shown a summary of what the programme found, with the image boxes used for the result. The output is then saved in the folder 'Outputs'.

Known Bugs/Issues

These Issues are known, they are however not fixed as it did not hinder the workflow of digitilizing the Kunzer collection or the collection was already digitilized when the bugs became known

When no matching plantname can be found (with a similarity of at least 1%) from the OCR output of the program throws an unhandled expection.
When removing a box (due to too many recognized boxes) the user can select a point. The programme removes the box with the nearest lower left corner. This does not always correspond to the box which the point is inside of.
The merger only supports lists of size 25 for the computation of Kunzer numbers (the other metric are correctly calculated)
After selecting the last box when adding/removing boxes, the selction window does not close until the OCR prediction is completed.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
dataextractor.py		dataextractor.py
handrecognition.py		handrecognition.py
identification.py		identification.py
locationselector.py		locationselector.py
main.py		main.py
merger.py		merger.py
regionselector.py		regionselector.py
segmenter.py		segmenter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wanda

Dependecies

Models

Datasets

Functionality

Known Bugs/Issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wanda

Dependecies

Models

Datasets

Functionality

Known Bugs/Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages