How to Extract Text from PDF Documents

PDF Mosaic Library may extract text from PDF documents. PDF Mosaic makes available the text contents of a PDF as Unicode strings. With help of PDF Mosaic you can convert Adobe PDF documents to text files. Our PDF SDK provides access to the text content in PDF files without requiring any Adobe product. Use PDFPage.GetText() method to extract text in plain text format.

This sample shows how to extract plane text from PDF documents using PDF Mosaic library.

C# :

using PDFMosaic;
using System.Drawing;
using System.IO;
using System.Diagnostics;

namespace ExtractText
{
  class ExtractText
  {
    static void Main()
    {
      PDFDocument document = new PDFDocument("..\\..\\residential.pdf");

      StreamWriter writer = new StreamWriter("Document text.txt");
      for (int i = 0; i < document.Pages.Count; ++i)
        writer.WriteLine(document.Pages[i].GetText());

      writer.Close();

      document.Save("ExtractText.pdf", true);
      Process.Start("Document text.txt");
    }
  }
}


VB.NET :

Imports PDFMosaic
Imports System.Drawing
Imports System.IO
Imports System.Diagnostics

Module ExtractText
  Sub Main()
    Dim document As New PDFDocument("..\\..\\residential.pdf")

    Dim writer As New StreamWriter("Document text.txt")
    For i As Integer = 0 To document.Pages.Count - 1
      writer.WriteLine(document.Pages(i).GetText())
    Next

    writer.Close()

    document.Save("ExtractText.pdf", True)
    Process.Start("Document text.txt")
  End Sub
End Module