Reading Microsoft Word Document in JAVA

When it comes to reading Microsoft Office Word document Java does not have any in build classes to handle this but Apache POI Package developed by Apache Foundation gives you the power of reading Microsoft Word document in Java. More information on the Apache POI package can be found at Apache POI

import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.extractor.*;
import java.io.*;
 
public class readDoc
{
	public static void main( String[] args )
	{
		String filesname = "Hello.doc";
		POIFSFileSystem fs = null;
		try
		{
                  fs = new POIFSFileSystem(new FileInputStream(filesname; 
                  //Couldn't close the braces at the end as my site did not allow it to close
 
                  HWPFDocument doc = new HWPFDocument(fs);
 
		  WordExtractor we = new WordExtractor(doc);
 
		  String[] paragraphs = we.getParagraphText();
 
		  System.out.println( "Word Document has " + paragraphs.length + " paragraphs" );
		  for( int i=0; i<paragraphs .length; i++ ) {
			paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n","");
                	System.out.println( "Length:"+paragraphs[ i ].length());
		  }
                }
                catch(Exception e) { 
                    e.printStackTrace();
                }
         }
}

Your email:

 


Code Explanation:

  • Creating new POIFSFileSystem Object and passing the Microsoft Word document to it
  • Creating new object of HWPFDocument class, this class is specifically responsible for handling Microsoft Word Document
  • WordExtractor will extract all the words from the word document
  • getParagraphText() will extract all the text paragraph wise
  • Finally we try to read the paragraph content


Popular Articles:

Subscribe to my RSS feed.

  1. November 24th, 2008 at 08:15 | #1
    Subramanyam

    Hi,

    I am getting below exception while running this example.

    Could you please let me know if I am missing any jars/ need to do anything else to execute this java class.

    Thanks in advance for your help.

    Regards,
    Subramanyam.

  2. November 24th, 2008 at 08:16 | #2
    Subramanyam

    Hi,

    sorry for the spam. attaching exception.

    I am getting below exception while running this example.

    java.io.IOException: Invalid header signature; read 7021802808062469458, expected -2226271756974174256
    at org.apache.poi.poifs.storage.HeaderBlockReader.(HeaderBlockReader.java:112)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.(POIFSFileSystem.java:151)
    at com.general.test.ReadDoc.main(ReadDoc.java:16)

    Could you please let me know if I am missing any jars/ need to do anything else to execute this java class.

    Thanks in advance for your help.

    Regards,
    Subramanyam.

  3. December 18th, 2008 at 05:16 | #3
    Nishikanta Sahoo

    After run this code i got below exception. Please give me any solution for this execption. I already insert jar also, but still i got this execption. One thing I didn’t get this EncryptedDocumentException.class in the jar.

    Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/poi/EncryptedDocumentException
    at ws.WordRead.main(WordRead.java:38)
    ERROR: JDWP Unable to get JNI 1.2 environment, jvm->GetEnv() return code = -2
    JDWP exit error AGENT_ERROR_NO_JNI_ENV(183): [../../../src/share/back/util.c:820]

  4. December 18th, 2008 at 23:33 | #4

    Hi Nishikanta,
    I have uses POI-3.0.2-Final.jar and poi-scratchpad-3.0.2-FINAL-20080204.jar package for this code.

  5. March 18th, 2009 at 02:00 | #5
    slim

    after running this code excption “java.io.FileNotFoundException: hello.doc (The system cannot find the file specified) ” was genereted
    so where do i must place hello.doc (i created it on my desktop) thankss

  6. March 18th, 2009 at 09:50 | #6

    Hi Slim,
    Just place the hello.doc where .class file resides. If you are putting the doc file at another location than specify the location path in the source code. IT will work fine.

    Thanks,
    Hitesh Agrawal

  7. March 24th, 2009 at 03:51 | #7
    slim

    hi,
    thanks for the answer.
    the script work very well.
    what’s the effect of using “paragraphs[i] = paragraphs[i].replaceAll(”\\cM?\r?\n”,”");”

    thanks

  8. March 24th, 2009 at 04:12 | #8
    laker

    Hi,
    Thanks for this post, it’s very useful.
    I’m trying to find a word on my word file after reading the file.
    How can i do it??

    Thanks a lot

  9. April 9th, 2009 at 06:23 | #9
    amit

    java.io.IOException: Unable to read entire header; 6 bytes read; expected 512 bytes
    at org.apache.poi.poifs.storage.HeaderBlockReader.(HeaderBlockReader.java:78)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.(POIFSFileSystem.java:83)
    at org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocument.java:133)
    at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:146)
    at transactionDB.changeFormat.main(changeFormat.java:45)

    Error display what i have to do tell me please

  10. May 20th, 2009 at 21:53 | #10
    Ankur Raiyani

    Hello hitesh,

    thanks for sharing this example. I have a different requirement with word file. I want to add an image into word document using POI, but don’t know how to do this.

    Thanks,
    Ankur Raiyani

  11. July 2nd, 2009 at 08:57 | #11

    How do I read word comments and bookmarks using Java? Do u have a sample code? Any help would be appreciated.

  1. No trackbacks yet.
Comments feed

Spam protection by WP Captcha-Free