Converting PDF Documents to Plain Text TXT in C#

The extraction of plain text from PDF documents for further processing, analysis or full text search is a common task. This article explains how to extract plain text from PDF documents programmatically with .NET C# and TX Text Control.

Extracting plain text from PDF documents for further processing, analyzing or searching is a common task. Typically, a PDF document contains a collection of characters at specific locations, and a filter is required to import the text to be extracted.

With TX Text Control, all typical word processing formats such as DOC, DOCX, RTF and PDF can be loaded for plain text extraction. The following code shows how to create a simple console application that loads a PDF document and then extracts the plain text from it.

Preparing the Application

A .NET 6 console application is created for the purposes of this demo.

Prerequisites

  1. In Visual Studio, create a new Console App using .NET 6.
  2. In the Solution Explorer, select your created project and choose Manage NuGet Packages. from the Project main menu. Select Text Control Offline Packages from the Package source drop-down. Install the latest versions of the following package:

Create PDF

Adding a PDF

  1. Create a folder named App_Data in the root of your project. Copy your PDF documents into this folder. In this example, the name of the PDF document that will be loaded is sample.pdf.

Adding the Code

  1. Open the Program.cs file and add the following code:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

using ( TXTextControl . ServerTextControl tx = new TXTextControl . ServerTextControl ( ) )
tx . Create ( ) ;
TXTextControl . LoadSettings ls = new TXTextControl . LoadSettings ( )
PDFImportSettings = TXTextControl . PDFImportSettings . GenerateParagraphs
> ;
// load PDF document
tx . Load ( " App_Data/sample.pdf " , TXTextControl . StreamType . AdobePDF , ls ) ;
// retrieve plain text
var text = tx . Text ;
Console . WriteLine ( text ) ;
>
view raw test.cs hosted with ❤ by GitHub

Alternatively, if the document is stored in a database or by some other method, you can load the document from a byte array.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

byte [ ] document = File . ReadAllBytes ( " App_Data/sample.pdf " ) ;
tx . Load ( document , TXTextControl . BinaryStreamType . AdobePDF , ls ) ;
view raw test.cs hosted with ❤ by GitHub

TX Text Control recognizes matching words to create paragraphs and returns plain text written to the console.

ASP.NET Core .NET 6 .NET 7 .NET 8 Angular Blazor React JavaScript

ASP.NET

Integrate document processing into your applications to create documents such as PDFs and MS Word documents, including client-side document editing, viewing, and electronic signatures.

Getting started with:

Related Posts

Sign Documents with a Self-Signed Digital ID From Adobe Acrobat Reader in .NET C#

This article shows how to create a self-signed digital ID using Adobe Acrobat Reader and how to use it to sign documents in .NET C#. The article also shows how to create a PDF document with a signature field that can be signed using the created digital ID.

Programmatically Convert MS Word DOCX Documents to PDF in .NET C#

This article shows how to convert MS Word DOCX documents to PDF in .NET C# using the ServerTextControl component. The example shows how to load a DOCX file from a file or from a variable and how to save it as an Adobe PDF file

Extension Method: Flatten Forms Fields in PDF Documents using .NET C#

This article shows how to flatten form fields in TX Text Control before exporting the document to PDF. This is a common requirement when documents should be shared with others and the form fields should not be changed anymore.

The Importance of Metadata in PDF Documents: Import and Export Metadata in ASP.NET Core C#

Document metadata in PDFs and other formats is important for several reasons, including organization, searchability, authenticity, and compliance. This article shows how to import and export metadata in PDF documents using the TX Text Control .NET Server for ASP.NET.

Popular Products

Technologies

Get Products

Resources

Getting Started

Support

Ready To Talk?

Text Control is an award-winning vendor of document processing and reporting components for Windows, web, cloud and mobile development technologies.

We ♥ documents.

Copyright © 2024 Text Control GmbH. All rights reserved. Impressum.

TX Text Control, DS Server, ReportingCloud and other product names used herein might be trademarks or registered trademarks of Text Control GmbH and/or one of its subsidiaries or affiliates in the U.S. and/or other countries.

Our partners and we may store and access personal data such as cookies , device identifiers or other similar technologies on your device and process such data to personalize content and ads, provide social media features and analyze our traffic.

Cookies, device identifiers, or other information can be stored or accessed on your device for the purposes presented to you.

Ads and content can be personalized based on a profile. More data can be added to better personalize ads and content. Ad and content performance can be measured. Insights about audiences who saw the ads and content can be derived. Data can be used to build or improve user experience, systems and software. The information may be transferred, stored and processed to countries outside the EU, including the United States.

By clicking on I AGREE, you allow the use of these cookies and agree to the privacy policy. You can withdraw your consent at any time, by clicking I DISAGREE. To reopen this dialog, click on Consent in the footer of any page.