Class Robots.Host

  • Enclosing class:
    Robots

    protected class Robots.Host
    extends java.lang.Object
    This class maintains status for a given host. There's an instance of this class for each host in the robots cache.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected int checkingRobots
      This will be set to nonzero if the robots structure is currently in use
      protected java.lang.String hostName
      Host name
      protected long invalidTime
      Timestamp.
      protected boolean isValid
      This flag describes whether or not the host record is valid yet.
      protected int port
      Port
      protected java.lang.String protocol
      Protocol
      protected boolean readingRobots
      This will be set to "true" if the robots.txt for this host is in the process of being read.
      protected java.util.ArrayList records
      This is the list of robots records for the host, or null if no robots.txt found.
    • Constructor Summary

      Constructors 
      Constructor Description
      Host​(java.lang.String protocol, int port, java.lang.String hostName)
      Constructor.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      boolean canBeFlushed​(long currentTime)
      Check if the current record can be flushed.
      boolean isFetchAllowed​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, java.lang.String throttleGroupName, long currentTime, java.lang.String pathString, java.lang.String userAgent, java.lang.String from, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int connectionLimit)
      Check a given path string against this host's robots file.
      protected void makeValid​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, java.lang.String throttleGroupName, long currentTime, java.lang.String userAgent, java.lang.String from, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int connectionLimit)
      Initialize the record.
      protected void parseRobotsTxt​(java.io.BufferedReader r, java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)
      Parse the robots.txt file using a reader.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • protocol

        protected java.lang.String protocol
        Protocol
      • port

        protected int port
        Port
      • hostName

        protected java.lang.String hostName
        Host name
      • invalidTime

        protected long invalidTime
        Timestamp. This is the time that the cache record becomes invalid.
      • isValid

        protected boolean isValid
        This flag describes whether or not the host record is valid yet.
      • records

        protected java.util.ArrayList records
        This is the list of robots records for the host, or null if no robots.txt found.
      • readingRobots

        protected boolean readingRobots
        This will be set to "true" if the robots.txt for this host is in the process of being read.
      • checkingRobots

        protected int checkingRobots
        This will be set to nonzero if the robots structure is currently in use
    • Constructor Detail

      • Host

        public Host​(java.lang.String protocol,
                    int port,
                    java.lang.String hostName)
        Constructor.
    • Method Detail

      • isFetchAllowed

        public boolean isFetchAllowed​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                      java.lang.String throttleGroupName,
                                      long currentTime,
                                      java.lang.String pathString,
                                      java.lang.String userAgent,
                                      java.lang.String from,
                                      java.lang.String proxyHost,
                                      int proxyPort,
                                      java.lang.String proxyAuthDomain,
                                      java.lang.String proxyAuthUsername,
                                      java.lang.String proxyAuthPassword,
                                      org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                                      int connectionLimit)
                               throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
                                      org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Check a given path string against this host's robots file.
        Parameters:
        currentTime - is the current time in milliseconds since epoch.
        pathString - is the path string to check.
        Returns:
        true if crawling is allowed, false otherwise.
        Throws:
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • canBeFlushed

        public boolean canBeFlushed​(long currentTime)
        Check if the current record can be flushed. This is not quite the same as whether the record is valid, since a not-yet-valid record still should not be flushed when there is activity going on with that record!
      • makeValid

        protected void makeValid​(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext,
                                 java.lang.String throttleGroupName,
                                 long currentTime,
                                 java.lang.String userAgent,
                                 java.lang.String from,
                                 java.lang.String proxyHost,
                                 int proxyPort,
                                 java.lang.String proxyAuthDomain,
                                 java.lang.String proxyAuthUsername,
                                 java.lang.String proxyAuthPassword,
                                 java.lang.String hostName,
                                 org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities,
                                 int connectionLimit)
                          throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption,
                                 org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Initialize the record. This method reads the robots file on the specified protocol/host/port, and parses it according to the rules.
        Throws:
        org.apache.manifoldcf.agents.interfaces.ServiceInterruption
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • parseRobotsTxt

        protected void parseRobotsTxt​(java.io.BufferedReader r,
                                      java.lang.String hostName,
                                      org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)
                               throws java.io.IOException,
                                      org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Parse the robots.txt file using a reader. Is NOT expected to close the stream.
        Throws:
        java.io.IOException
        org.apache.manifoldcf.core.interfaces.ManifoldCFException