Pages

Monday 29 July 2013

Error parsing: xxx failed(2,0): Unable to read 512 bytes from 65536 in stream of length 65421

Error parsing: xxx failed(2,0): Unable to read 512 bytes from 65536 in stream of length 65421.  

The error that you are receiving is because of length of the file so for each error message you are likely to have different error numbers.

By default there is a limit to how much data that Nutch will parse and should your file be larger than that limit then Nutch will either truncate your index file or just give up. 


SOLUTION:

To take that limit away you need to set the content limit to -1 - This lets Nutch know that you have no limit on the length of file and it will continually attempt to parse it until it is completed.  This can mean that there are performance issues when it comes to large files.

If you are using a web crawler:

Add this code to the nutch-site.xml
<property> 
<name>http.content.limit</name> 
<value>-1</value> 
<description>The length limit for downloaded content, in bytes. 
             If this value is nonnegative (>=0), content longer than it 
will be 
             truncated;otherwise, no truncation at all. 
</description> 
</property>
If you are using a filesystem crawler then:

Add this code to the nutch-site.xml
<property> 
<name>file.content.limit</name> 
<value>-1</value> 
<description>The length limit for downloaded content, in bytes. 
             If this value is nonnegative (>=0), content longer than it 
will be 
             truncated;otherwise, no truncation at all. 
</description> 
</property>

No comments:

Post a Comment