The error that you are receiving is because of length of the file so for each error message you are likely to have different error numbers.
By default there is a limit to how much data that Nutch will parse and should your file be larger than that limit then Nutch will either truncate your index file or just give up.
SOLUTION:
To take that limit away you need to set the content limit to -1 - This lets Nutch know that you have no limit on the length of file and it will continually attempt to parse it until it is completed. This can mean that there are performance issues when it comes to large files.
If you are using a web crawler:
Add this code to the
nutch-site.xml
<property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated;otherwise, no truncation at all. </description> </property>If you are using a filesystem crawler then:
Add this code to the
nutch-site.xml
<property> <name>file.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated;otherwise, no truncation at all. </description> </property>
No comments:
Post a Comment