Home > Java > Reading Microsoft Word Document in JAVA

Reading Microsoft Word Document in JAVA

When it comes to reading Microsoft Office Word document Java does not have any in build classes to handle this but Apache POI Package developed by Apache Foundation gives you the power of reading Microsoft Word document in Java. More information on the Apache POI package can be found at Apache POI

import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.extractor.*;
import java.io.*;
 
public class readDoc
{
	public static void main( String[] args )
	{
		String filesname = "Hello.doc";
		POIFSFileSystem fs = null;
		try
		{
                  fs = new POIFSFileSystem(new FileInputStream(filesname; 
                  //Couldn't close the braces at the end as my site did not allow it to close
 
                  HWPFDocument doc = new HWPFDocument(fs);
 
		  WordExtractor we = new WordExtractor(doc);
 
		  String[] paragraphs = we.getParagraphText();
 
		  System.out.println( "Word Document has " + paragraphs.length + " paragraphs" );
		  for( int i=0; i<paragraphs .length; i++ ) {
			paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n","");
                	System.out.println( "Length:"+paragraphs[ i ].length());
		  }
                }
                catch(Exception e) { 
                    e.printStackTrace();
                }
         }
}

Your email:

 


Code Explanation:

  • Creating new POIFSFileSystem Object and passing the Microsoft Word document to it
  • Creating new object of HWPFDocument class, this class is specifically responsible for handling Microsoft Word Document
  • WordExtractor will extract all the words from the word document
  • getParagraphText() will extract all the text paragraph wise
  • Finally we try to read the paragraph content


Custom Search

Popular Articles:

Share and Enjoy:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • email
  • IndianPad
  • LinkedIn
  • Live
  • MySpace
  • Netvibes
  • RSS
  • Technorati
  • Yahoo! Bookmarks
  • Yahoo! Buzz
  • Reddit
  • Add to favorites
  • PDF
  • Twitter
Categories: Java Tags:
  1. Subramanyam
    November 24th, 2008 at 08:15 | #1

    Hi,

    I am getting below exception while running this example.

    Could you please let me know if I am missing any jars/ need to do anything else to execute this java class.

    Thanks in advance for your help.

    Regards,
    Subramanyam.

  2. Subramanyam
    November 24th, 2008 at 08:16 | #2

    Hi,

    sorry for the spam. attaching exception.

    I am getting below exception while running this example.

    java.io.IOException: Invalid header signature; read 7021802808062469458, expected -2226271756974174256
    at org.apache.poi.poifs.storage.HeaderBlockReader.(HeaderBlockReader.java:112)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.(POIFSFileSystem.java:151)
    at com.general.test.ReadDoc.main(ReadDoc.java:16)

    Could you please let me know if I am missing any jars/ need to do anything else to execute this java class.

    Thanks in advance for your help.

    Regards,
    Subramanyam.

  3. Nishikanta Sahoo
    December 18th, 2008 at 05:16 | #3

    After run this code i got below exception. Please give me any solution for this execption. I already insert jar also, but still i got this execption. One thing I didn’t get this EncryptedDocumentException.class in the jar.

    Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/poi/EncryptedDocumentException
    at ws.WordRead.main(WordRead.java:38)
    ERROR: JDWP Unable to get JNI 1.2 environment, jvm->GetEnv() return code = -2
    JDWP exit error AGENT_ERROR_NO_JNI_ENV(183): [../../../src/share/back/util.c:820]

  4. December 18th, 2008 at 23:33 | #4

    Hi Nishikanta,
    I have uses POI-3.0.2-Final.jar and poi-scratchpad-3.0.2-FINAL-20080204.jar package for this code.

  5. slim
    March 18th, 2009 at 02:00 | #5

    after running this code excption “java.io.FileNotFoundException: hello.doc (The system cannot find the file specified) ” was genereted
    so where do i must place hello.doc (i created it on my desktop) thankss

    • March 18th, 2009 at 09:50 | #6

      Hi Slim,
      Just place the hello.doc where .class file resides. If you are putting the doc file at another location than specify the location path in the source code. IT will work fine.

      Thanks,
      Hitesh Agrawal

  6. slim
    March 24th, 2009 at 03:51 | #7

    hi,
    thanks for the answer.
    the script work very well.
    what’s the effect of using “paragraphs[i] = paragraphs[i].replaceAll(“\\cM?\r?\n”,”");”

    thanks

  7. laker
    March 24th, 2009 at 04:12 | #8

    Hi,
    Thanks for this post, it’s very useful.
    I’m trying to find a word on my word file after reading the file.
    How can i do it??

    Thanks a lot

  8. amit
    April 9th, 2009 at 06:23 | #9

    java.io.IOException: Unable to read entire header; 6 bytes read; expected 512 bytes
    at org.apache.poi.poifs.storage.HeaderBlockReader.(HeaderBlockReader.java:78)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.(POIFSFileSystem.java:83)
    at org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocument.java:133)
    at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:146)
    at transactionDB.changeFormat.main(changeFormat.java:45)

    Error display what i have to do tell me please

  9. Ankur Raiyani
    May 20th, 2009 at 21:53 | #10

    Hello hitesh,

    thanks for sharing this example. I have a different requirement with word file. I want to add an image into word document using POI, but don’t know how to do this.

    Thanks,
    Ankur Raiyani

  10. July 2nd, 2009 at 08:57 | #11

    How do I read word comments and bookmarks using Java? Do u have a sample code? Any help would be appreciated.

  11. Sathish Raja
    July 10th, 2009 at 04:31 | #12

    hi friends,
    Can anyone help me in this………i had use this code and i m geting this exceptions……i am using poi-2.5.1-final-20040804.jar.and poi-scratchpad-3.5-beta5-20090219.jar files……..how to specify the location path in source code…..i had kept the file in desktop

    java.io.IOException: Invalid header signature; read 85966670672, expected -2226271756974174256
    at org.apache.poi.poifs.storage.HeaderBlockReader.(HeaderBlockReader.java:88)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.(POIFSFileSystem.java:83)
    at rb.action.FileRead.main(FileRead.java:15)

  12. prabhu
    July 11th, 2009 at 00:40 | #13

    Sathish Raja,

    Have you fixed the issue, if fixed please post the steps

  13. Darren Slevin
    July 15th, 2009 at 14:02 | #14

    Hi Hitesh,

    where do I store the POI-3.0.2-Final.jar and poi-scratchpad-3.0.2-FINAL-20080204.jar files. I am just trying to get the example above working. Cheers for the help.

    Darren

  14. devday
    July 17th, 2009 at 23:20 | #15

    Hi friends,

    On executing this code am getting the following error.can anyone tell me how to resolve this problem.

    java.io.IOException: Unable to read entire header; -1 bytes read; expected 512 bytes
    at org.apache.poi.poifs.storage.HeaderBlockReader.(HeaderBlockReader.java:78)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.(POIFSFileSystem.java:83)

  15. miche
    July 18th, 2009 at 18:33 | #16

    hello! im really lost… i am very new to this poi but i have to use this for my project which is to read a word doc using java… how can i “import” the package for org.apache.poi? i have downloaded the poi-3.5-beta6 and it asked me to install ant and forrest.. it asked me to set the environment variable to ANT_HOME and FORREST_HOME.. please help me.. im confused!

  16. Sulabh
    August 4th, 2009 at 04:14 | #17

    Hi friends,

    I am trying to change the font size of a text.
    To do this I am writing one HWPF stream to another and hence can change the font, but what I exactly need is to have different font(and/or size) for each word/paragraph. Basically to have more than one font size in a single piece of word file.
    Can anybody please tell me how to go about doing this ??

  17. Sulabh
    August 4th, 2009 at 04:15 | #18

    what I exactly need is …
    dgd gedgfe
    rbr brbr gbntghth
    rghh rtfhtyh bnfgh
    that is each word having different font properties

  18. Shriddha
    August 27th, 2009 at 02:29 | #19

    getting error:
    java.lang.NoClassDefFoundError: org/apache/poi/hpsf/WritingNotSupportedException

  19. gokul
    September 10th, 2009 at 02:43 | #20

    hi,

    I have executed ur java program to read word document. it works fine , but if the word document hava a Tables. your code produce a malicious script and code runs infinte loop.

    please tell me is there any methods to read a data from a tables in word Document.

  20. Josh
    November 13th, 2009 at 09:05 | #21

    @Ankur Raiyani
    Did you have any luck getting apache POI to insert images into a word document. I am trying to do the same thing.

  21. December 27th, 2009 at 01:11 | #22

    Thank you very much.

  22. shady
    January 6th, 2010 at 23:04 | #23

    plzzzz quickly i need help : i use 2 files .file with header and file without header when i enter the file that without header give me this error java.io.IOException: Invalid header signature; read 0x665C316674725C7B, expected 0xE11AB1A1E011CFD0
    at org.apache.poi.poifs.storage.HeaderBlockReader.(HeaderBlockReader.java:107)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.(POIFSFileSystem.java:151)
    at wordtotext.Main.main(Main.java:30)
    and the second file run good plz helpme

  23. W.K.Kasun Chamika
    March 21st, 2010 at 21:51 | #24

    Thank u 4 d code

    System.out.println(paragraphs[ i ].toString()); // to print the paragraphs

  24. Sushree Das
    March 23rd, 2010 at 04:30 | #25

    please anyone can provide me with the java code through which i can insert image into a MS word file at any location,and also consider that it has some caontent on it.plz reply me..

  25. Sushree Das
    March 23rd, 2010 at 04:31 | #26

    please let me know how to insert image into a word doc file

  26. param
    April 1st, 2010 at 01:52 | #27

    please let me know how can we read images of .doc file along with text using java

  27. April 21st, 2010 at 15:29 | #28

    Excellent.

    Thank very much.

  28. UJJAL
    May 6th, 2010 at 02:14 | #29

    I am begineer o java.When I compile this example I got 9 errors.
    Help me please…

    package org.apache.poi.poifs.filesystem does not exist
    import org.apache.poi.poifs.filesystem.*;

    package org.apache.poi.hwpf does not exist
    import org.apache.poi.hwpf.*;

    package org.apache.poi.hwpf.extractor does not exist
    import org.apache.poi.hwpf.extractor.*;

    cannot find symbol
    symbol : class POIFSFileSystem
    location: class readDoc
    POIFSFileSystem fs = null;

    cannot find symbol
    symbol : class POIFSFileSystem
    location: class readDoc
    fs = new POIFSFileSystem(new FileInputStream(filesname));

    cannot find symbol
    symbol : class HWPFDocument
    location: class readDoc
    HWPFDocument doc = new HWPFDocument(fs);

    cannot find symbol
    symbol : class HWPFDocument
    location: class readDoc
    HWPFDocument doc = new HWPFDocument(fs);

    cannot find symbol
    symbol : class WordExtractor
    location: class readDoc
    WordExtractor we = new WordExtractor(doc);

    cannot find symbol
    symbol : class WordExtractor
    location: class readDoc
    WordExtractor we = new WordExtractor(doc);

    9 errors

  29. UJJAL
    May 6th, 2010 at 07:04 | #31

    Please anyone help me…
    Let me know about the basic job of mine to read from a document..

  30. May 12th, 2010 at 18:22 | #32

    Very nice information.

  31. Piotr Rychlik
    May 14th, 2010 at 09:35 | #33

    Is it possible to edit .doc and/or .docx documents with POI? I’d like to be able to replace certain text fragments in several Word documents and then save updated documents to disk.

  32. UJJAL
    May 16th, 2010 at 05:27 | #34

    This code read a .doc file paragraph by paragraph.
    How can I read this file sentence by sentence?

    Thanks in advance.

  33. melaal
    May 22nd, 2010 at 01:51 | #35

    How can I read doc with text and images ?

  34. melaal
    May 22nd, 2010 at 01:56 | #36

    and how I can read text with style ?

  35. Piotr Rychlik
    May 24th, 2010 at 04:39 | #37

    Hi,

    How to replace one string for another in .doc documents?

  36. Piotr Rychlik
    May 26th, 2010 at 13:35 | #38

    I think there are a lot of serious bugs in the implementation of HWPF format, e.g. the following:

    HWPFDocument doc = new HWPFDocument(inputStream);
    doc.write(outputStream);

    turns .doc files into somethig that cannot be opened with Word anymore.

  37. bshirota
    June 11th, 2010 at 15:32 | #39

    Hitesh,

    Thanks for this. Excellent post..saved me a ton of searching.

  38. gayan
    June 22nd, 2010 at 03:31 | #40

    How identify the heading of the .doc file….

    please…

    send me the code…

  39. gayan
    June 22nd, 2010 at 03:33 | #41

    How do identify the heading of the .doc file…. using apache POI

    please…

    send me the code…

  40. Brijesh
    July 7th, 2010 at 23:17 | #42

    Hi

    Can you please tell me how to read a doc file that have Images with it.

    Post some code if possible..

  41. August 18th, 2010 at 01:57 | #43

    @Subramanyam
    hi
    i m want to read a doc file using poi interface but i m geeting an error on package and word extractor plz help me

    thank you in advance

  42. shehan weerasinghe
    September 1st, 2010 at 08:04 | #44

    Hi
    I just want to load some information from database table and load that data into a word document. Finally I want to create a way to load database’s data into a word document on a single button click in my java application. Thank you
    Need help as soon as possible…

    Thanks and best regards,
    Shehan.

  1. No trackbacks yet.