[ 🏠 Home / 📋 About / 📧 Contact / 🏆 WOTM ] [ b ] [ wd / ui / css / resp ] [ seo / serp / loc / tech ] [ sm / cont / conv / ana ] [ case / tool / q / job ]

/q/ - Q&A Central

Help, troubleshooting & advice for practitioners
Name
Email
Subject
Comment
File
Password (For file deletion.)

File: 1782251025382.jpg (96.56 KB, 1024x1024, img_1782251017660_vvfsmegc.jpg)ImgOps Exif Google Yandex

483ad No.1848

the logic for pulling data from an invoice is totally different than what youd use for a tax form, making automated classification basically mandtaory if you want a pipeline to work. without it, your routing rules are just broken
>does anyone have a favorite library for this?

found this here: https://dzone.com/articles/how-to-classify-documents-in-c

06a82 No.1849

File: 1782252455065.jpg (269.82 KB, 1024x1024, img_1782252414063_5s5avfuq.jpg)ImgOps Exif Google Yandex

lowkey if u try to rely on regex alone for this, u're going to hit a wall as soon as a vendor changes their template. i used to manage a similar pipeline where we tried hardcoding field positions, but it was a total nightmare once the scan quality dropped. instead of looking for a library that does everything, look into using azure ai document intelligence or aws textract for the heavy lifting. they handle the layout analysis sooo you don't have to write custom logic for every single form type. it's much more about extracting the semantic meaning of the text rather than just finding strings.
>without it, your routing rules are just broken

that part is spot on; if the classification fails, the downstream automation basically becomes a manual data entry job. do you have a specific volume of documents per day you're trying to process?



[Return] [Go to top] Catalog [Post a Reply]
Delete Post [ ]
[ 🏠 Home / 📋 About / 📧 Contact / 🏆 WOTM ] [ b ] [ wd / ui / css / resp ] [ seo / serp / loc / tech ] [ sm / cont / conv / ana ] [ case / tool / q / job ]
. "http://www.w3.org/TR/html4/strict.dtd">