How to programmatically search a PDF document in c# [closed]

0 votes
asked Feb 20, 2009 by nathan

I have a need to search a pdf file to see if a certain string is present. The string in question is definitely encoded as text (ie. it is not an image or anything). I have tried just searching the file as though it was plain text, but this does not work.

Is it possible to do this? Are there any librarys out there for .net2.0 that will extract/decode all the text out of pdf file for me?

3 Answers

0 votes
answered by rowan

In the vast majority of cases, it's not possible to search the contents of a PDF directly by opening it up in notepad -- and even in the minority of cases (depending on how the PDF was constructed), you'll only ever be able search for individual words due to the way that PDF handles text internally.

My company has a commercial solution that will let you extract text from a PDF file. I've included some sample code for you below, as shown on this page, that demonstrates how to search through the text from a PDF file for a particular string.

using System;
using System.IO;
using QuickPDFDLL0718;

namespace QPLConsoleApp
{
    public class QPL
    {
        public static void Main()
        {
            // This example uses the DLL edition of Quick PDF Library
            // Create an instance of the class and give it the path to the DLL
            PDFLibrary QP = new PDFLibrary("QuickPDFDLL0718.dll");

            // Check if the DLL was loaded successfully
            if (QP.LibraryLoaded())
            {
                // Insert license key here / Check the license key
                if (QP.UnlockKey("...") == 1)
                {
                    QP.LoadFromFile(@"C:\Program Files\Quick PDF Library\DLL\GettingStarted.pdf");

                    int iPageCount = QP.PageCount();
                    int PageNumber = 1;
                    int MatchesFound = 0;

                    while (PageNumber <= iPageCount)
                    {
                        QP.SelectPage(PageNumber);
                        string PageText = QP.GetPageText(3);

                        using (StreamWriter TempFile = new StreamWriter(QP.GetTempPath() + "temp" + PageNumber + ".txt"))
                        {
                            TempFile.Write(PageText);
                        }

                        string[] lines = File.ReadAllLines(QP.GetTempPath() + "temp" + PageNumber + ".txt");
                        string[][] grid = new string[lines.Length][];

                        for (int i = 0; i < lines.Length; i++)
                        {
                            grid[i] = lines[i].Split(',');
                        }

                        foreach (string[] line in grid)
                        {
                            string FindMatch = line[11];

                            // Update this string to the word that you're searching for.
                            // It can be one or more words (i.e. "sunday" or "last sunday".

                            if (FindMatch.Contains("characters"))
                            {
                                Console.WriteLine("Success! Word match found on page: " + PageNumber);
                                MatchesFound++;
                            }
                        }
                        PageNumber++;
                    }

                    if (MatchesFound == 0)
                    {
                        Console.WriteLine("Sorry! No matches found.");
                    }
                    else
                    {
                        Console.WriteLine();
                        Console.WriteLine("Total: " + MatchesFound + " matches found!");
                    }
                    Console.ReadLine();
                }
            }
        }
    }
}
0 votes
answered Feb 20, 2009 by volatilsis

There are a few libraries available out there. Check out http://www.codeproject.com/KB/cs/PDFToText.aspx and http://itextsharp.sourceforge.net/

It takes a little bit of effort but it's possible.

0 votes
answered Feb 21, 2012 by bobrovsky

You can use Docotic.Pdf library to search for text in PDF files.

Here is a sample code:

static void searchForText(string path, string text)
{
    using (PdfDocument pdf = new PdfDocument(path))
    {
        for (int i = 0; i < pdf.Pages.Count; i++)
        {
            string pageText = pdf.Pages[i].GetText();
            int index = pageText.IndexOf(text, 0, StringComparison.CurrentCultureIgnoreCase);
            if (index != -1)
                Console.WriteLine("'{0}' found on page {1}", text, i);
        }
    }
}

The library can also extract formatted and plain text from the whole document or any document page.

Disclaimer: I work for Bit Miracle, vendor of the library.

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...