Pages

Friday, 16 August 2013

How to fix /windows/system32/config/system misssing or corrupt WITHOUT XP Recovery CD

So I came across the dreaded /windows/system32/config/system file is misssing or corrupt recently and I didn't have my XP disk to hand.  No matter which Forum I went to or blog post I read, nobody really had a viable solution so I thought I would write up the guide myself.

Obviously you dont have the XP Recovery CD if you are reading this, but if you do then Microsoft's guide on MSDN is pretty thorough and worth reading, although this way is much quicker if you have what you need at hand.


What you need:


Because you don't have the Recovery disk you need some way of accessing the files on the hard drive, so I have suggested a few ways of doing so, there will be others and please comment if you know other simple ways of doing so
  1. USB to SATA/IDE cable.  Because we are working with Windows XP there is a high chance you may still have a Hard Drive with an IDE interface so buy your cable accordingly.  There is a picture below to help if you are unsure. You will need a second PC to plug this into to allow you to access the files. (Cheap)
  2. USB Caddy or Enclosure.  Again, ensure your Hard Drive interface matches that of the caddy otherwise it will not fit. SATA to SATA, IDE to IDE. You will then just need a USB to USB cable to connect the two. You will need a second PC to plug this into to allow you to access the files. (Cheap)
  3. Download Windows Mini XP or Mini 7.  You will need to download a .iso file of either of these, they are available from many sources, one of which is Hirens Boot CD.  When you download it, use software like ImgBurn to burn it onto the CD.  When your computer is booting, go to set-up (usually pressing F2) and ensure that booting from CD is your first option. Once you boot from CD you should be working in a cut-down version of either of these operating systems and should have access to the files on your hard drive.  (Free)
  4. Download Knoppix.  Knoppix is similar to using Mini Xp or Mini 7, it is just a small operating system that you can run from a CD.  So download the image (.iso file) and burn it to a CD using ImgBurn. Go to set-up (usually pressing F2) and ensure that booting from CD is your first option. Once you boot from CD you should be in an environment where you have access to your hard drive.  There are guides to downloading and burning Knoppix to a CD on the internet that will be helpful to you. (Free)
SATA Interface - Hard Drive





IDE Interface - Hard Drive





The Guide:


By this point, whatever method you choose to use, you should have access to the file-system.  Navigate to C:/System Volume Information  (Assuming C: is your drive letter).   This folder contains a folder with restore points from the past that will help you solve your problem.  If the folder is hidden, there is a guide at the end you can follow to make is visible.  You will then want to navigate to the _restore{ [letters & digits] } folder. Inside here you will see lots of folders beginning with RP followed by some digits. Navigate into any of these folders and you need to copy the following files:
  1. _REGISTRY_MACHINE_SECURITY
  2. _REGISTRY_MACHINE_SOFTWARE
  3. _REGISTRY_MACHINE_SYSTEM
  4. _REGISTRY_MACHINE_SAM
  5. _REGISTRY_USER_.DEFAULT
Once you have copied these files you need to navigate to the /windows/system32/config/system folder and paste the files there.
If any of the following files exist, you should delete them:
  1. SECURITY
  2. SOFTWARE
  3. SYSTEM
  4. SAM
  5. DEFAULT
You then want to rename the following:
  1. _REGISTRY_USER_.DEFAULT to DEFAULT
  2. _REGISTRY_MACHINE_SECURITY to SECURITY
  3. _REGISTRY_MACHINE_SOFTWARE to SOFTWARE
  4. _REGISTRY_MACHINE_SYSTEM to transparent
  5. _REGISTRY_MACHINE_SAM to SAM
You can then reboot your machine.  You now want to boot to the Hard Drive so if you changed any settings in the BIOS Set-up then change them back.  You should successfully boot into Windows XP and everything should seem normal, but we now need to do a System Restore.  To do a system restore you:
  1. Click Start, then click All Programs.
  2. Click Accessories, then click System Tools.
  3. Click System Restore, and then click Restore to a previous RestorePoint.
  4. Choose a restore point, preferably not too far in the past otherwise your applications may all require updating or changing. It will not delete any of your data.
 Once the system restore has completed you should have a fully functioning machine again. 

Showing Hidden Files & Folders On Microsoft Windows XP Professional or Windows XP Home Edition:

  1. Click Start,then My Computer.
  2. On the Tools menu, click Folder Options.
  3. On the View tab, click Show hidden files and folders.
  4. Clear the Hide protected operating system files (Recommended) check box. Click Yes when you are prompted to confirm the change.
  5. Click OK.

















Nutch Re-crawling

Nutch allows you to crawl the web or a filesystem in order to build up index store of all its content. If your objective is to simply crawl the content once, it is fairly easy. But if you want to continuously monitor a site and crawl updates, it can be harder. Harder because the Nutch documentation does not have many details about it.

Nutch will update any previously indexed urls/files, delete any inactive ones and add any new ones it encounters while re-crawling. There are a few very simple settings that need changed if you wish Nutch to do this
Nutch stores a record of all the files/urls it has encountered whilst doing its crawl and is called the crawldb. Initially this is built from the list of urls/files provided by the user using the inject command which will be normally be taken from your seed.txt file, in our case, the nutch.txt file
Nutch uses a generate/fetch/update process:
generate: This command looks at the crawldb for all the urls/files that are due for fetching and regroups them in a segment. A url/file is due for fetch if if it is new or the time has expired for that url/file and is now due for recrawling (default is 30 days).
fetch: This command will go an fetch all the urls/files specified in the segment.
update: This command will add the results of the crawling, which have been stored in the segment, into the crawldb and each url/file will be updated to indicate the time it wad fetched and when its next scheduled fetch is. If any urls/files have been discovered, they will be added and marked as not fetched.
How does Nutch can detect if a page has changed or not? Each time a page is fetched, Nutch computes a signature for the page. At the next fetch, if the signature is the same (or if a 304 is returned by the web server because of the If-Modified-Since header), Nutch can tell if the page was modified or not (It is not just the content, if the http headers or metatags have changed it will be marked as modified). If a document no longer exists it returns a 404 and will be marked DB_GONE. During the update cycle Nutch has the ability to purge all the those urls/files that have been marked DB_GONE.
The linkdb stores the finalised indexes that Nutch has generated from the crawl and this is the data that Nutch passes to the Solr Server during the solrindex process.

Set-Up

The first thing is, you need to allow Nutch to re-crawl. You may have noticed that if you try to run a crawl on the same source using the same folder name as you used before that it will tell you there are no more urls/files to crawl. There is a default setting for a url/file that states it may not be re-crawled for 30 days. So edit your nutch-site.xml and edit the value to your liking:
<property>
  <name>db.fetch.interval.default</name>
  <value>43200</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>
If you set the number too low you may get into an infinite loop. During a crawl when it finishes one of its cycles, it will notice that it is time for that url to be re-crawled, and at the end of that cycle it notices again that it needs re-crawled. So ensure that the number of seconds you select will be longer than the nutch crawler will take to complete it crawl of everything
Adding the following code will delete all urls/files marked with DB_GONE and thus ensuring all urls/files are active and up-to-date.
<property>
  <name>db.update.purge.404</name>
  <value>true</value>
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
  </description>
</property>
You are now all set to re-crawl! Just issue the same command to crawl normally but ensure the folder containing the Nutch indexes is the one you used before, so if you performed the crawl previously and your folder was named "Test"
./nutch crawl urls -dir Test -solr http://localhost:8080/solr/ -depth 3 -topN 100
Then you will want to issue the exact same command again to recrawl, although you are free to change the depth and topN to your own liking
./nutch crawl urls -dir Test -solr http://localhost:8080/solr/ -depth 3 -topN 100

Issue

There is an issue when it comes to indexing this to Solr, it passes all the indexes over and updates the ones that require updating. If a url/file has been deleted, it has been purged from Nutch's indexes and when it passes them over to Solr, Solr does not know that it has been deleted so still stores a record of the old ur/filel. The issue can be combatted by deleting the Solr indexes, and then passing the Nutch indexes over to Solr, and it comes with very little additional performance cost

Wednesday, 14 August 2013

Configuring Solr 4 Data Import Handler with JDBC

Preparation

You should have set-up Solr on Tomcat along with Tika's extracting request handler as shown in the previous two guides:
   http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html       http://amac4.blogspot.co.uk/2013/07/setting-up-tika-extracting-request.html    


Set-Up

  • Firstly we need the libraries that are required to use Data Import Handler. Create a folder and name it dih (preferably in your $SOLR_HOME), and place solr-dataimporthandler-4.0.0.jar and solr-dataimporthandler-extras-4.0.0.jar from $SOLR/dist directory in the dih folder. 

    Add this to the solrconfig.xml file:
<lib dir="../../dih" regex=".*\.jar" />
  • Now we modify the solrconfig.xml file. Add the following :
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
</requestHandler>
  • Create a db-data-config.xml. This is for the Data Import Handler configuration. It should look similar to:
<dataConfig>
<dataSource driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost:1234/users" user="users"
password="secret" />
<document>
<entity name="user" query="SELECT id, name from users">
<field column="id" name="id" />
<field column="name" name="name" />
<entity name="birthday" query="select birthday from table where id=${user.id}">
<field column="description" name="description" />
</entity>
</entity>
</document>
</dataConfig></dataConfig>
  • For other database engines enter the specific driver, url, user and password attributes.
  • We now need to modify the fields section of the schema.xml file to something like the following snippet:
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="name" type="text" indexed="true" stored="true" />
<field name="birthday" type="text" indexed="true" stored="true"/>
<field name="description" type="" indexed="true"
stored="true"/>

Solr may not like type="text", if so then change it to text_general

One more thing before the indexing – you should copy an appropriate JDBC driver to the $SOLR_HOME/lib directory of your Solr installation or the dih directory we created before. You can get the libraries from the databases websites
To index, run the following query: http://localhost:8080/solr/dataimport?command=full-import .The HTTP protocol is asynchronous so you won't be updated on the status of the indexing process. To check the status of the indexing process, you can run the command once again.

Tuesday, 13 August 2013

Setting Up Highlighting For Solr 4

The large search engines like Google and Bing show you a small snippet of text that often contains one or more of the keywords that have been searched for. To set this up on Solr is also very straightforward and this is a short guide on how to set it up.

Schema.xml

Always ensure that your schema.xml file for Nutch and for Solr are identical otherwise you will encounter problems so theses changes must be applied to both.
Ensure that the content field (or whatever fields you wish to highlight) are set to stored.
  <field name="content" type="text_general" stored="true" indexed="true"/>
NOTE: For pre-solr 4.0.0 "text_general" is called "text"
You also need to ensure that these two lines are present:
  <field name="id" type="string" indexed="true" stored="true" required="true" />
  <field name="name" type="text_general" indexed="true" stored="true" />

Solrconfig.xml

There do not need to be any major changes to this file but it contains a lot of features that you are able to change allowing you to tweak the higlighting functionality to your own liking.
  <!-- Highlighting defaults -->
     <str name="hl">on</str>
     <str name="hl.fl">content</str>
     <str name="hl.encoder">html</str>
     <str name="hl.simple.pre">&lt;b&gt;</str>
     <str name="hl.simple.post">&lt;/b&gt;</str>
     <str name="f.title.hl.fragsize">0</str>
     <str name="f.title.hl.alternateField">title</str>
     <str name="f.name.hl.fragsize">0</str>
     <str name="f.name.hl.alternateField">name</str>
     <str name="f.content.hl.snippets">3</str>
     <str name="f.content.hl.fragsize">200</str>
     <str name="f.content.hl.alternateField">content</str>
     <str name="f.content.hl.maxAlternateFieldLength">750</str>

Querying

When you query the Solr Server and have highlighting enabled it will return to you a extra tag named highlighting. The next name tag will match up with the id of the documents and can be easily matched using software like xPath.
  <lst name="highlighting">
    <lst name="file:/C:/Users/alamil/Documents/TextFiles/a.doc">
       <arr name="content">
        <str>Budget and Council Tax  POLICY AND RESOURCES COMMITTEE  BUDGET<em>STRATEGY</em></str>
      </arr>
    </lst>
    <lst name="file:/C:/Users/alamil/Documents/TextFilesb.doc">
      <arr name="content">
        <str>CONTENTS Introduction Customer Care Standards <em>Strategy</em></str>
      </arr>
    </lst>
  </lst>
The most common highlighting parameters available to the user are:
hl=true: If you want highlighting this must ALWAYS be true. Any blank, missing or "false" value disables highlighting feature.
hl.fl=content: Enables highlighting in that field, by default you are probably going to want to use content but other field can also be selected.
hl.snippets=5: It accepts a number as value, the specified numeric value decides the number of highlighted snipets to be returned in a query respense. The default value is 1.
hl.requireFieldMatch: It accept a true or false value as parameter, the highlighted response is returned only if the keyword is found in requied field.
The default value is "false".
hl.maxAnalyzedChars: It decides, how many characters into a document should be considered for highlighting.The default value is "51200". 

Friday, 9 August 2013

Solr 4 Deduplication On Windows

Setting Up Deduplication

I followed the Solr wiki to try and set-up deduplication and all it did was cause my Solr server to break, these settings worked for me and should also work for you too.  

Any changes made to the schema.xml file for Solr should also be reflected in the schema.xml file in Nutch.

schema.xml

You need a separate field to store the signature:


 <field name="signature" type="string" stored="true" indexed="true" multiValued="false" />

solrconfig.xml

The SignatureUpdateProcessorFactory has to be registered in the solrconfig.xml as part of the UpdateRequest Chain:
<updateRequestProcessorChain name="dedupe">
    <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">true</bool>
      <str name="signatureField">signature</str>
      <str name="fields">id</str>
      <str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
Also be sure to change your update handlers to use the defined chain:


 <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
    <lst name="defaults">
      <str name="update.chain">dedupe</str>
    </lst>
  </requestHandler>
The update processor can also be specified per request with a parameter of update.chain=dedupe.
Note that for pre-Solr3.2 you need to use update.processor instead

Deleting Dead URLS & Files That No Longer Exist Nutch

Nutch allows you to crawl the web or a filesystem in order to build up index store of all its content. If your objective is to simply crawl the content once, it is fairly easy. But if you want to continuously monitor a site and crawl updates, it can be harder. Harder because the Nutch documentation does not have many details about that.

When you recrawl your source there can be a previously active URLs or Files that now no longer exist and you would like Nutch to remove them from its Indexes.  Nutch would update any changes made to documents it had indexed in the past but any files that had been deleted still remained in the indexes.

If you wish to skip straight to the solution then just go to the end of the post, but if you wish to understand what is happening then read on.

Nutch stores a record of all the files/urls it has encountered whilst doing its crawl and is called the crawldb.  Initially this is built from the list of urls/files provided by the user using the inject command which will be normally be taken from your seed.txt file.

Nutch uses a generate/fetch/update process:
generate:  This command looks at the crawldb for all the urls/files that are due for fetching and regroups them in a segment. A url/file is due for fetch if if it is new or the time has expired for that url/file and is now due for recrawling (default is 30 days).
fetch:  This command will go an fetch all the urls/files specified in the segment.
update:  This command will add the results of the crawling, which have been stored in the segment, into the crawldb and each url/file will be updated to indicate the time it wad fetched and when its next scheduled fetch is.  If any urls/files have been discovered, they will be added and marked as not fetched.

How does Nutch can detect if a page has changed or not? Each time a page is fetched, Nutch computes a signature for the page. At the next fetch, if the signature is the same (or if a 304 is returned by the web server because of the If-Modified-Since header), Nutch can tell if the page was modified or not (It is not just the content, if the http headers or metatags have changed it will be marked as modified).  If a document no longer exists it returns a 404 and will be marked DB_GONE. During the update cycle Nutch has the ability to purge all the those urls/files that have been marked DB_GONE.

The linkdb stores the finalised indexes that Nutch has generated from the crawl and this is the data that Nutch passes to the Solr Server during the solrindex process.

To tell Nutch that you would like all the urls/files that have been deleted you need to add the following code to your nutch-site.xml:

<property>
  <name>db.update.purge.404</name>
  <value>true</value>
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
  </description>
</property>
I hope this is of some help to you.