Announcing the PDF Document Parser in Bamboo Labs

Bamboo PDF Document Parser is a custom document parser for PDF documents and PDF forms.

It allows you to import metadata and form data from your PDF form into your SharePoint list (Promotion) and also populate your PDF document with metadata from the List (Demotion).

Download the installer from Bamboo Labs (http://community.bamboosolutions.com/media/p/4597.aspx)

Extract the zip file:

Run Setup.bat

 

You will see this setup screen:

Start to install the PDF Document Parser on all your web front ends, this is needed since the parser is a COM object and thus can't be installed using the WSP solution.

After this is done, install the PDF Document Parser build 0.5.0.0 on one of the servers. This is a WSP solution so it will be deployed to all servers in the farm.

Now it's time to activate the feature, it has farm scope so you need to login to Central Admin and go to Operations > Manage Farm Features.

Now you have activated parsing for all files with the PDF extension. Now it's time to create a document library which has properties corresponding to fields in the PDF document. We will use an IRS form SS-4 for our sample.

Along with the document parser comes a command line tool PDFhelp.exe which helps you to list fields in a PDF form and it can also create a document library with properties corresponding to these fields.

It has two different commands, list and create. List does what it sounds like it lists out all fields in the specified document. Create will create fields for each of those properties in a document library that you specify. If the library doesn't exist it will also create it for you.

The way properties from the pdf document are mapped to list fields is by matching the property name to the internal name of the field. After the field has been created you can rename it to a descriptive name this is called the fields Title or Display Name. (Because of this you can not remap properties by renaming an existing Field you have to create a new Field.)

If you run the list command on the IRS form SS-4 you will see something like this:

 

The listing is [Field Name] == [Value], since we have not filled anything all values are empty.

If we run the create command on this file

PDFhelp.exe -create fss4.PDF http://localhost/ss4library

 

You will get a document library looking like this:

As you can see the field names doesn't tell us much about the field it's just f1_nn(0). If you have a form with these field names there's an easy way of giving most of the fields a descriptive name and that's by importing a filled out form. When the field has a value the import program will create the field with the proper name but then give it the display name of the field value.

Let's try it on the SS-4 form.

 

Now let's run the same import again with this form.

 

And your list looks like this:

 

Now it's time to upload this form to see that the properties are actually read from the PDF and stored in the list. We will take the same document we just used as a template.

Since this is a pre-release version you should be aware of some limitations: 

 

  • Property Demotion is not yet implemented.
  • All properties are currently promoted as strings.

 


Posted Sep 12 2008, 11:49 AM by Jonas Nilsson
Filed under: ,

Comments

Indexing PDFs - Can Someone Make it Easier Please? « The WorkerThread Blog wrote Indexing PDFs - Can Someone Make it Easier Please? « The WorkerThread Blog
on Tue, Oct 14 2008 11:21 AM

Pingback from  Indexing PDFs - Can Someone Make it Easier Please? « The WorkerThread Blog

dmac wrote re: Announcing the PDF Document Parser in Bamboo Labs
on Fri, Jul 17 2009 1:48 PM

I have followed the steps above and successfully created a doc library with the descriptive names for the fields.  However, when uploading a new PDF, none of the form data is improted to the libary, just the form itself shows up as an attachment.  Did I misunderstand the idea behind this - that it would  import the form data to the library fields if you upload a PDF?

Thanks,

Drew

Jonas Nilsson wrote re: Announcing the PDF Document Parser in Bamboo Labs
on Fri, Jul 17 2009 1:53 PM

Drew,

No you did not misunderstand it. The properties in the Pdf file should be promoted into the List properties.

Is it possible for you to send us a sample Pdf file?

/Jonas

dmac wrote re: Announcing the PDF Document Parser in Bamboo Labs
on Fri, Jul 17 2009 2:19 PM

Yes - please let me know where to send it.

thanks,

Drew

Jonas Nilsson wrote re: Announcing the PDF Document Parser in Bamboo Labs
on Fri, Jul 17 2009 2:30 PM

Drew,

You can post it in the Forum community.bamboosolutions.com/.../179.aspx

Thanks

/Jonas

Jonas Nilsson wrote re: Announcing the PDF Document Parser in Bamboo Labs
on Thu, Jul 23 2009 2:37 PM

Update

The problem dmac had was because of a bug in the x64 installer.  It won't properly register the

Bamboo.PdfParser.dll as a COM object.

The workaround right now is to run regasm.exe and register the assembly located in the GAC.

C:\Windows\assembly\gac_msil\Bamboo.PdfParser\0.5.0.0__2cc91efae2d531be

We will update the install.

/Jonas

pkingswood wrote re: Announcing the PDF Document Parser in Bamboo Labs
on Wed, Sep 2 2009 12:35 PM

Following on from my ealier post about the file not being parsed correctly. I have now found a number of fields are dropdown lists and the value taken for these fields and populated into the list is the ID value not the text. Any chance of an update to give the text or is there source available or a work around to get this data?

Jonas Nilsson wrote re: Announcing the PDF Document Parser in Bamboo Labs
on Wed, Sep 2 2009 1:49 PM

pkingswood,

We will take a look at this and post an updated parser.

Thanks for submitting this feedback.

/Jonas

Blogs

See you in San Francisco!

Register for SPTechCon

Subscribe by Email

Syndication

Bamboo Nation Now on Twitter

Bamboo Now in Alltop!

        Featured in Alltop

Blue Rooster Cycling

Bamboo is a proud sponsor of the Blue Rooster Cycling Team.
Blue Rooster Cycling Logo

Bamboo Solutions Corporation, 2002-2010