The extraction of plain text from PDF documents for further processing, analysis or full text search is a common task. This article explains how to extract plain text from PDF documents programmatically with .NET C# and TX Text Control.
Extracting plain text from PDF documents for further processing, analyzing or searching is a common task. Typically, a PDF document contains a collection of characters at specific locations, and a filter is required to import the text to be extracted.
With TX Text Control, all typical word processing formats such as DOC, DOCX, RTF and PDF can be loaded for plain text extraction. The following code shows how to create a simple console application that loads a PDF document and then extracts the plain text from it.
A .NET 6 console application is created for the purposes of this demo.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
using ( TXTextControl . ServerTextControl tx = new TXTextControl . ServerTextControl ( ) ) |
tx . Create ( ) ; |
TXTextControl . LoadSettings ls = new TXTextControl . LoadSettings ( ) |
PDFImportSettings = TXTextControl . PDFImportSettings . GenerateParagraphs |
> ; |
// load PDF document |
tx . Load ( " App_Data/sample.pdf " , TXTextControl . StreamType . AdobePDF , ls ) ; |
// retrieve plain text |
var text = tx . Text ; |
Console . WriteLine ( text ) ; |
> |
Alternatively, if the document is stored in a database or by some other method, you can load the document from a byte array.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
byte [ ] document = File . ReadAllBytes ( " App_Data/sample.pdf " ) ; |
tx . Load ( document , TXTextControl . BinaryStreamType . AdobePDF , ls ) ; |
TX Text Control recognizes matching words to create paragraphs and returns plain text written to the console.
ASP.NET Core .NET 6 .NET 7 .NET 8 Angular Blazor React JavaScript
Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.
This article shows how to create a self-signed digital ID using Adobe Acrobat Reader and how to use it to sign documents in .NET C#. The article also shows how to create a PDF document with a signature field that can be signed using the created digital ID.
This article shows how to convert MS Word DOCX documents to PDF in .NET C# using the ServerTextControl component. The example shows how to load a DOCX file from a file or from a variable and how to save it as an Adobe PDF file
This article shows how to flatten form fields in TX Text Control before exporting the document to PDF. This is a common requirement when documents should be shared with others and the form fields should not be changed anymore.
Document metadata in PDFs and other formats is important for several reasons, including organization, searchability, authenticity, and compliance. This article shows how to import and export metadata in PDF documents using the TX Text Control .NET Server for ASP.NET.
Getting Started
Text Control is an award-winning vendor of document processing and reporting components for Windows, web, cloud and mobile development technologies.
We ♥ documents.
Copyright © 2024 Text Control GmbH. All rights reserved. Impressum.
TX Text Control, DS Server, ReportingCloud and other product names used herein might be trademarks or registered trademarks of Text Control GmbH and/or one of its subsidiaries or affiliates in the U.S. and/or other countries.
Our partners and we may store and access personal data such as cookies , device identifiers or other similar technologies on your device and process such data to personalize content and ads, provide social media features and analyze our traffic.
Cookies, device identifiers, or other information can be stored or accessed on your device for the purposes presented to you.
Ads and content can be personalized based on a profile. More data can be added to better personalize ads and content. Ad and content performance can be measured. Insights about audiences who saw the ads and content can be derived. Data can be used to build or improve user experience, systems and software. The information may be transferred, stored and processed to countries outside the EU, including the United States.
By clicking on I AGREE, you allow the use of these cookies and agree to the privacy policy. You can withdraw your consent at any time, by clicking I DISAGREE. To reopen this dialog, click on Consent in the footer of any page.