Quantcast
Channel: security – Cloudera Engineering Blog
Viewing all articles
Browse latest Browse all 166

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

$
0
0

Learn how to use OCR tools, Apache Spark, and other Apache Hadoop components to process PDF images at scale.

Optical character recognition (OCR) technologies have advanced significantly over the last 20 years. However, during that time, there has been little or no effort to marry OCR with distributed architectures such as Apache Hadoop to process large numbers of images in near-real time.

In this post, you will learn how to use standard open source tools along with Hadoop components such as Apache Spark,

Read More

The post How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code appeared first on Cloudera Engineering Blog.


Viewing all articles
Browse latest Browse all 166

Trending Articles